Purpose: This document catalogs the complete technology stack used in the AI e-Invoicing platform, along with viable alternatives for each component. It serves as a reference for architectural decisions and explores emerging solutions that could enhance or replace current implementations.
graph TB
subgraph "Frontend Layer"
UI[Streamlit Dashboard]
Chat[Chatbot Tab]
end
subgraph "API Layer"
API[FastAPI]
ChatAPI[Chatbot API]
Auth[FastAPI Users + JWT]
end
subgraph "Processing Layer"
Orch[Orchestrator AsyncIO]
PDF[PDF Processor<br/>Docling/PyPDF]
IMG[Image Processor<br/>PaddleOCR]
XLS[Excel Processor<br/>Pandas]
end
subgraph "Intelligence Layer"
LLM[LLM Layer<br/>DeepSeek/OpenAI/Gemini]
RAG[RAG Framework<br/>LlamaIndex]
ChatEngine[Chatbot Engine<br/>Cascade Query Strategy]
Valid[Validator<br/>Pydantic]
end
subgraph "Data Layer"
PG[(PostgreSQL)]
Vec[(pgvector<br/>Vector Search)]
FTS[(PostgreSQL<br/>Full-Text Search)]
Queue[(pgqueuer)]
Minio[MinIO S3]
end
UI --> API
Chat --> ChatAPI
ChatAPI --> ChatEngine
API --> Orch
Orch --> PDF
Orch --> IMG
Orch --> XLS
PDF --> LLM
IMG --> LLM
XLS --> LLM
LLM --> Valid
LLM --> RAG
ChatEngine --> Vec
ChatEngine --> FTS
RAG --> Vec
Valid --> PG
Orch --> Queue
API --> Minio
| Component | Current Choice | Version | Purpose |
|---|---|---|---|
| Dashboard UI | Streamlit | 1.39.0+ | Interactive data visualization, HITL review |
| PDF Viewer | streamlit-pdf-viewer | Latest | In-app PDF preview |
| Charting | Plotly | 5.18.0+ | Quality metrics, trends visualization |
| Component | Current Choice | Version | Purpose |
|---|---|---|---|
| Web Framework | FastAPI | 0.115.0+ | Async REST API, dependency injection |
| ASGI Server | Uvicorn | 0.32.0+ | High-performance async server |
| Authentication | fastapi-users + JWT | Latest | User management, token auth |
| File Upload | python-multipart | 0.0.12+ | Multipart form handling |
| Validation | Pydantic v2 | 2.9.0+ | Request/response schemas, data validation |
| Component | Current Choice | Version | Purpose |
|---|---|---|---|
| PDF Extraction | Docling | 1.0.0+ | Layout-aware PDF to Markdown conversion |
| PDF Fallback | PyPDF | 5.0.0+ | Simple text extraction |
| OCR Engine | PaddleOCR | 2.7.0+ | Multi-language OCR (CPU optimized) |
| OCR Framework | PaddlePaddle | 2.6.0+ | Deep learning framework for OCR |
| Excel/CSV Parser | Pandas | 2.2.0+ | Tabular data ingestion |
| Excel Binary | openpyxl | 3.1.0+ | .xlsx file reading |
| Markdown Export | tabulate | (via pandas) | DataFrame to markdown conversion |
| Component | Current Choice | Version | Purpose |
|---|---|---|---|
| LLM (Primary) | DeepSeek-V3 / DeepSeek-Chat | API | Cost-effective structured extraction, chatbot responses |
| LLM (Fallback) | OpenAI GPT-4o / Gemini | API | High-accuracy extraction |
| RAG Framework | LlamaIndex | 0.11.0+ | Document indexing, retrieval, agentic workflows |
| Embeddings | sentence-transformers | 2.2.0+ | Semantic search, vector embeddings (all-MiniLM-L6-v2) |
| Orchestration | AsyncIO (native) | Python 3.12 | Parallel processing, non-blocking I/O |
| Chatbot Engine | Custom (brain/chatbot/) | - | Session management, rate limiting, hybrid retrieval |
| Query Strategy | Cascade Fallback | - | Vector search → SQL text search (see Query Strategy Analysis) |
| Component | Current Choice | Version | Purpose |
|---|---|---|---|
| RDBMS | PostgreSQL | 15+ | Core relational data storage |
| ORM | SQLAlchemy 2.0 | 2.0.36+ | Async ORM, migrations |
| DB Driver | asyncpg | 0.30.0+ | High-performance async PostgreSQL driver |
| Migrations | Alembic | 1.14.0+ | Schema version control |
| Vector Store | pgvector | 0.2.0+ | Embedding storage for semantic search |
| Job Queue | pgqueuer | 0.11.0+ | Background task processing |
| Object Storage | MinIO | Latest | S3-compatible document storage |
| Component | Current Choice | Version | Purpose |
|---|---|---|---|
| Logging | structlog | 24.4.0+ | Structured JSON logging |
| Config Management | pydantic-settings | 2.6.0+ | Environment-based configuration |
| Encryption | cryptography | 43.0.0+ | File encryption at rest |
| Environment | python-dotenv | 1.0.1+ | .env file management |
| Container | Docker Compose | Latest | Service orchestration |
| Component | Current Choice | Version | Purpose |
|---|---|---|---|
| Test Framework | pytest | 8.3.0+ | Unit, integration, contract tests |
| Async Testing | pytest-asyncio | 0.24.0+ | Async test support |
| HTTP Testing | httpx | 0.27.0+ | API test client |
| Linting | ruff | 0.6.0+ | Fast Python linter |
| Type Checking | mypy | 1.11.0+ | Static type analysis |
| Technology | Pros | Cons | Use Case Fit |
|---|---|---|---|
| 🟢 Streamlit (Current) | • Rapid prototyping • Python-native • Built-in components • Easy HITL workflows |
• Limited customization • Not ideal for public apps • Session state quirks |
✅ Perfect for internal dashboards |
| React.js + Next.js | • Highly customizable • Better performance • Modern UX patterns • Public-facing ready |
• Requires separate frontend team • More boilerplate • Slower development |
|
| Gradio | • Similar to Streamlit • Better for ML demos • Automatic API generation |
• Less mature ecosystem • Fewer components |
|
| Reflex | • Python full-stack • React under the hood • Type-safe |
• Very new (2023) • Limited community |
🔮 Watch for future adoption |
| Dash (Plotly) | • Enterprise-grade • Advanced visualizations • Production-ready |
• Steeper learning curve • More verbose code |
| Technology | Pros | Cons | Current Usage |
|---|---|---|---|
| 🟢 Docling (Current) | • Layout preservation • Table extraction • Markdown output • IBM-backed |
• Newer project • GPU-heavy (optional) |
✅ Primary for complex PDFs |
| PyPDF (Fallback) | • Lightweight • Pure Python • Fast for simple PDFs |
• No layout understanding • Poor table handling |
✅ Fallback only |
| Unstructured.io | • Multi-format support • Cloud & local • Active development |
• Heavyweight dependency • Commercial licensing |
🟡 Strong alternative to Docling |
| PyMuPDF (fitz) | • Very fast • Image extraction • Low memory |
• C++ dependency • License restrictions (AGPL) |
🟡 Consider for speed-critical paths |
| pdfplumber | • Table-focused • Visual debugging • Accurate coordinates |
• Slower than PyMuPDF • Less layout context |
🟡 Good for table-heavy invoices |
| Apache Tika | • 1000+ formats • Battle-tested • Enterprise support |
• Requires Java runtime • Heavier footprint |
❌ Too heavyweight for our use case |
| Azure Document Intelligence | • Excellent accuracy • Pre-trained invoice models • Microsoft support |
• High cost ($1.50/1000 pages) • Cloud dependency • Vendor lock-in |
❌ Expensive for high volume |
| Technology | Pros | Cons | Current Usage |
|---|---|---|---|
| 🟢 PaddleOCR (Current) | • Multi-language (80+) • CPU-friendly • Open-source • Chinese text excellent |
• Model size ~200MB • Slower than Tesseract |
✅ Primary OCR engine |
| DeepSeek-OCR | • State-of-the-art accuracy • Multimodal understanding • Context-aware |
• Not publicly released yet • Likely requires API • Unknown pricing |
🔮 Monitor for release - could replace PaddleOCR |
| Tesseract 5.x | • Fast • Lightweight • Ubiquitous |
• Lower accuracy on Chinese • Requires preprocessing |
🟡 Good for English-only invoices |
| EasyOCR | • 80+ languages • PyTorch-based • Good accuracy |
• GPU-hungry • Slower on CPU |
🟡 Alternative to PaddleOCR |
| TrOCR (Hugging Face) | • Transformer-based • SOTA accuracy • Fine-tunable |
• Requires GPU • Larger models |
|
| Google Cloud Vision | • Excellent accuracy • Handles handwriting • Managed service |
• $1.50/1000 images • Privacy concerns • Network latency |
❌ Too expensive at scale |
| AWS Textract | • Invoice-specific • Key-value extraction • High accuracy |
• $0.015/page • Vendor lock-in |
❌ Cost-prohibitive |
| Technology | Pros | Cons | Current Usage |
|---|---|---|---|
| 🟢 DeepSeek-V3 (Current) | • $0.14/1M tokens (input) • 128K context • Strong structured output • Function calling |
• New model (Dec 2024) • Less tested than GPT |
✅ Primary for cost efficiency |
| OpenAI GPT-4o (Fallback) | • Proven reliability • Best-in-class reasoning • Vision support |
• $2.50/1M tokens • 17x more expensive |
✅ Fallback for critical extractions |
| Claude 3.5 Sonnet | • Excellent at structured tasks • 200K context • Strong reasoning |
• $3/1M tokens • Rate limits |
🟡 Consider for complex documents |
| Gemini 1.5 Pro | • 1M token context • Native multimodal • Competitive pricing |
• Inconsistent quality • Regional availability |
🟡 Alternative to DeepSeek |
| Qwen 2.5 | • Open weights • Self-hostable • Good Chinese support |
• Requires GPU infrastructure • Lower accuracy than GPT |
|
| Llama 3.1 (405B) | • Open weights • Strong reasoning • Self-hostable |
• 405B requires 8x A100s • High infra cost |
❌ Too resource-intensive |
| Technology | Pros | Cons | Current Usage |
|---|---|---|---|
| 🟢 LlamaIndex (Current) | • Document-centric • Rich ecosystem • Pydantic integration • Strong structured extraction |
• Heavy dependency tree • Frequent breaking changes |
✅ Primary RAG framework |
| LangChain | • Most popular • Extensive integrations • Agent framework |
• Overly complex abstractions • Slower updates |
🟡 Alternative if agent needs grow |
| 🔥 RagFlow | • Open-source RAG engine • Built-in chunking strategies • Document parsing pipeline • Visual workflow designer • Multi-tenant support |
• Newer project (2024) • Smaller community • Less mature docs |
🔮 Strong alternative - purpose-built for document AI |
| Haystack | • Production-ready • Pipeline-focused • Deepset support |
• Steeper learning curve • Less structured extraction |
🟡 Consider for complex pipelines |
| txtai | • Lightweight • Embeddings-first • Fast indexing |
• Fewer integrations • Simpler feature set |
|
| 🔥 LangExtract | • Zero-shot extraction • No training needed • Schema-driven • Built on LangChain |
• Less flexible than custom prompts • Still requires LLM API |
🔮 Consider for quick wins - simpler than custom prompts |
| Marvin | • Pydantic-native • Type-safe extraction • Elegant API |
• Smaller ecosystem • Less documentation |
🟡 Alternative for Pydantic users |
| Technology | Pros | Cons | Current Usage |
|---|---|---|---|
| 🟢 PostgreSQL + pgvector (Current) | • All-in-one solution • ACID guarantees • Mature ecosystem • Cost-effective • Supports both vector and full-text search |
• Vector search slower than specialized DBs • Manual tuning needed |
✅ "Complexity Collapse" strategy ✅ Chatbot uses cascade: vector → SQL fallback |
| Pinecone | • Purpose-built vectors • Managed service • Fast similarity search |
• $70/month minimum • Vendor lock-in |
❌ Unnecessary with pgvector |
| Weaviate | • Open-source vector DB • Hybrid search built-in • Self-hostable |
• Additional infrastructure • Overkill for our scale |
|
| Qdrant | • Rust-based speed • Filtering support • Good docs |
• Another service to manage | |
| Chroma | • Lightweight • Embedded mode • Developer-friendly |
• Less production-ready • Limited scale |
Query Strategy Details:
- Current: Cascading fallback (vector search → SQL text search)
- Planned: Parallel hybrid search with Reciprocal Rank Fusion (RRF)
- See comprehensive analysis: Query Strategy Analysis
| Technology | Pros | Cons | Current Usage |
|---|---|---|---|
| 🟢 MinIO (Current) | • S3-compatible • Self-hosted • No egress fees |
• Infrastructure overhead • Manual backups |
✅ Cost control strategy |
| AWS S3 | • Managed service • 99.999999999% durability • Global CDN |
• Egress costs • Vendor lock-in |
🟡 Consider for production SaaS |
| Local Filesystem | • Zero cost • Simple |
• No redundancy • Not scalable |
❌ Development only |
| Technology | Pros | Cons | Current Usage |
|---|---|---|---|
| 🟢 pgqueuer (Current) | • Uses existing Postgres • ACID guarantees • Simple setup |
• Not as feature-rich as Celery • Postgres becomes SPOF |
✅ Simplicity wins |
| Celery | • Battle-tested • Rich features • Monitoring tools |
• Requires Redis/RabbitMQ • Complex setup |
|
| Dramatiq | • Simpler than Celery • Better API |
• Smaller ecosystem | 🟡 Alternative to Celery |
| Temporal | • Workflow orchestration • Durable execution • Enterprise-grade |
• Heavy infrastructure • Overkill for MVP |
❌ Too complex for current needs |
- Repository: infiniflow/ragflow
- Stars: ~20K+ (as of Jan 2025)
- What it is: A complete RAG engine with built-in document parsing, chunking, and retrieval
- Key Advantages:
- ✅ Visual workflow designer - no-code pipeline building
- ✅ Built-in document parsers - handles PDF, Word, Excel natively
- ✅ Intelligent chunking - better than naive splitting
- ✅ Multi-tenant support - SaaS-ready architecture
- ✅ Integrated UI - reduces need for custom dashboard
- When to Consider:
- If our LlamaIndex complexity grows
- If we need multi-tenant isolation
- If visual workflow management becomes valuable
- Integration Path:
# Could replace brain/ layer entirely # RagFlow handles: parsing → chunking → embedding → retrieval # We'd keep: validation, database, API layers
- Repository: Part of LangChain ecosystem
- What it is: Schema-driven extraction without training examples
- Key Advantages:
- ✅ Simpler than custom prompts - just define Pydantic schema
- ✅ Zero-shot - no few-shot examples needed
- ✅ Type-safe - leverages Pydantic validation
- When to Consider:
- If we want to simplify
brain/extractor.py - If prompt engineering becomes bottleneck
- If we want to simplify
- Example:
from langextract import extract from pydantic import BaseModel class Invoice(BaseModel): vendor: str total: float date: str # That's it - no prompt engineering needed result = extract(Invoice, raw_text)
- Status: Rumored/announced, not publicly available
- Expected Advantages:
- ✅ Multimodal understanding - combines vision + text reasoning
- ✅ Context-aware - understands invoice semantics during OCR
- ✅ Potential accuracy boost - could eliminate extraction errors
- Risk: May require API access (not self-hostable)
- Action: Monitor for release announcement - could be game-changer
- Trend: Using Docling for layout + Unstructured for preprocessing
- Advantages:
- ✅ Best of both worlds - Docling's layout + Unstructured's robustness
- ✅ Better table handling - especially for complex multi-page tables
- When to Consider: If current PDF extraction quality is insufficient
- Repository: PrefectHQ/marvin
- What it is: AI engineering framework focused on type-safe extraction
- Key Advantages:
- ✅ Elegant API - most Pythonic extraction library
- ✅ Zero boilerplate - uses decorators
- ✅ Built-in validation - Pydantic integration
- Example:
import marvin @marvin.fn def extract_invoice(text: str) -> ExtractedDataSchema: """Extract invoice data from text""" # That's the entire implementation result = extract_invoice(raw_text)
| Component | Cloud Solution | Monthly Cost | Our Stack | Monthly Cost |
|---|---|---|---|---|
| OCR | Google Vision | $150 | PaddleOCR (self-hosted) | $0 |
| PDF Parsing | Azure Document Intelligence | $150 | Docling (self-hosted) | $0 |
| LLM Extraction | GPT-4o only | $250 | DeepSeek-V3 primary | $35 |
| Vector Database | Pinecone | $70 | pgvector (in Postgres) | $0 |
| Job Queue | AWS SQS | $10 | pgqueuer (in Postgres) | $0 |
| Object Storage | AWS S3 | $50 | MinIO (self-hosted) | $10 |
| Database | AWS RDS | $200 | PostgreSQL (self-hosted) | $30 |
| Total | $880/month | $75/month |
Cost Savings: ~90% reduction using our "Complexity Collapse" approach
- MVP/early stage (current state)
- Processing < 50K invoices/month
- Team is Python-focused
- Infrastructure budget is limited
- Extraction accuracy < 85% (try DeepSeek-OCR when available)
- PDF parsing fails on >10% of documents (evaluate Unstructured.io)
- Processing time > 30s/invoice (optimize or add GPU)
- Team grows to include frontend specialists (consider React.js)
- Security audit requires zero-LLM-API-call mode (switch to self-hosted Llama)
- Scaling to 1M+ invoices/month (migrate to managed services)
- Multi-tenant SaaS launch (add RagFlow or similar isolation)
%%{init: {'theme':'base'}}%%
quadrantChart
title AI E-Invoicing Technology Radar
x-axis "Adopt" --> "Hold"
y-axis "Trial" --> "Assess"
quadrant-1 "ADOPT"
quadrant-2 "TRIAL"
quadrant-3 "ASSESS"
quadrant-4 "HOLD"
DeepSeek-V3: [0.9, 0.9]
PaddleOCR: [0.85, 0.85]
Streamlit: [0.8, 0.9]
FastAPI: [0.95, 0.95]
PostgreSQL: [0.9, 0.95]
RagFlow: [0.4, 0.8]
LangExtract: [0.5, 0.7]
Marvin: [0.45, 0.6]
Unstructured: [0.55, 0.75]
DeepSeek-OCR: [0.3, 0.5]
React-Dashboard: [0.25, 0.4]
Temporal: [0.15, 0.3]
Azure-AI: [0.05, 0.1]
Pinecone: [0.1, 0.15]
- ADOPT (Top-Right): Current production stack - proven and reliable
- TRIAL (Top-Left): Actively evaluate - RagFlow, LangExtract, Marvin
- ASSESS (Bottom-Left): Monitor developments - DeepSeek-OCR, React dashboard
- HOLD (Bottom-Right): Avoid - expensive cloud services
Each invoice goes through these stages:
- File Ingestion: File is read and hashed (SHA-256) for duplicate detection
- OCR/Text Extraction: Image/PDF is processed to extract text (PaddleOCR for images, Docling for PDFs)
- AI Extraction: Structured data is extracted using DeepSeek-chat with manual JSON parsing for reliability (vendor, amounts, dates, etc.)
- Validation: Business rules are checked (math validation, tax rate constraints, etc.)
- Self-Correction: If validation fails, the system attempts to refine extraction by capping confidence and adjusting math logic
- Storage: Results are stored in PostgreSQL with JSON-safe serialization
Processing Status:
pending- Initial statequeued- Added to processing queueprocessing- Currently being processedcompleted- Successfully processedfailed- Processing failed (check error_message)
- Check backend logs for error messages (logs include processing stage information)
- Verify the file exists in
data/directory - Check database connection in
.envfile:DATABASE_URLmust be set - Ensure all dependencies are installed:
pip install -e ".[dev]" - Run database migrations:
alembic upgrade head - Check file permissions: ensure
data/directory is writable - Verify file is not corrupted: check file size > 0
- Check for missing processing libraries (OCR, PDF): error messages will indicate which library is missing
- Make sure you've processed at least one invoice
- Check the status filter in the sidebar (may be filtering out your invoices)
- Verify database connection: check
.envfile hasDATABASE_URL - Check dashboard logs for database query errors
- Verify database schema is up to date:
alembic currentshould show latest migration
- Check if backend is running:
curl http://localhost:${API_PORT}/health - Verify port 8000 is not in use by another service
- Check API logs for startup errors
- Verify database is accessible: health endpoint will show "degraded" if database issues exist
- Project structure with three-layer architecture (Sensory, Brain, Interaction)
- PostgreSQL database with pgvector extension
- Async SQLAlchemy 2.0 ORM models
- File processing pipeline (PDF, Excel, CSV, Images)
- Basic data extraction and validation framework
- FastAPI REST API with async endpoints
- Streamlit review dashboard
- File-level encryption at rest
- SHA-256 file hashing for duplicate detection
- Database migrations with Alembic
- Non-blocking PaddleOCR initialization (prevents system crashes)
- Comprehensive error handling with user-friendly messages
- Database schema health checks and connection retry logic
- Enhanced logging with processing stage tracking
- OCR timeout handling with retry logic (180s default, up to 10min for large images)
- File validation (size limits, corruption checks)
- Background processing with proper session management
- Status tracking with immediate database commits
- Performance monitoring (processing time tracking)
- Docling integration for advanced PDF processing
- DeepSeek-chat integration as primary extraction LLM
- pgqueuer setup for background job management
- Enhanced validation rules (Tax rate auto-detection, Line item math fallback)
- Chatbot engine for conversational invoice querying
- Robust transaction management with explicit rollbacks
- Multi-agent coordination for complex multi-page document reconciliation
- Integration with external ERP (Odoo/SAP) APIs
- Enhanced local embedding model performance tuning
- FastAPI Best Practices
- LlamaIndex Structured Extraction
- pgvector Performance Tuning
- Streamlit Components Gallery
- RagFlow: https://github.com/infiniflow/ragflow
- Docling: https://github.com/DS4SD/docling
- Marvin: https://github.com/PrefectHQ/marvin
- PaddleOCR: https://github.com/PaddlePaddle/PaddleOCR
- Document AI Platforms: Azure Form Recognizer, AWS Textract (for accuracy baseline)
- Workflow Orchestration: Apache Airflow, Prefect, Temporal (if complexity grows)
- Full-Stack Python: Reflex, FastUI, NiceGUI (alternatives to React)
| Date | Change | Rationale |
|---|---|---|
| 2025-01-08 | Initial document creation | Catalog current stack and alternatives |
| 2025-01-08 | Added RagFlow, LangExtract, Marvin | Popular emerging solutions in document AI space |
| 2025-01-08 | Added cost comparison | Justify "Complexity Collapse" approach |
This document should be updated when:
- A major technology is added/replaced in the stack
- A new promising alternative emerges in the ecosystem
- Cost/performance benchmarks change significantly
- Team makes a technology decision (document rationale here)
Maintainer: AI E-Invoicing Team