Skip to content

Latest commit

 

History

History
533 lines (431 loc) · 26.7 KB

File metadata and controls

533 lines (431 loc) · 26.7 KB

Tech Stack & Alternative Solutions

Purpose: This document catalogs the complete technology stack used in the AI e-Invoicing platform, along with viable alternatives for each component. It serves as a reference for architectural decisions and explores emerging solutions that could enhance or replace current implementations.


📐 Architecture Overview

graph TB
    subgraph "Frontend Layer"
        UI[Streamlit Dashboard]
        Chat[Chatbot Tab]
    end
    
    subgraph "API Layer"
        API[FastAPI]
        ChatAPI[Chatbot API]
        Auth[FastAPI Users + JWT]
    end
    
    subgraph "Processing Layer"
        Orch[Orchestrator AsyncIO]
        PDF[PDF Processor<br/>Docling/PyPDF]
        IMG[Image Processor<br/>PaddleOCR]
        XLS[Excel Processor<br/>Pandas]
    end
    
    subgraph "Intelligence Layer"
        LLM[LLM Layer<br/>DeepSeek/OpenAI/Gemini]
        RAG[RAG Framework<br/>LlamaIndex]
        ChatEngine[Chatbot Engine<br/>Cascade Query Strategy]
        Valid[Validator<br/>Pydantic]
    end
    
    subgraph "Data Layer"
        PG[(PostgreSQL)]
        Vec[(pgvector<br/>Vector Search)]
        FTS[(PostgreSQL<br/>Full-Text Search)]
        Queue[(pgqueuer)]
        Minio[MinIO S3]
    end
    
    UI --> API
    Chat --> ChatAPI
    ChatAPI --> ChatEngine
    API --> Orch
    Orch --> PDF
    Orch --> IMG
    Orch --> XLS
    PDF --> LLM
    IMG --> LLM
    XLS --> LLM
    LLM --> Valid
    LLM --> RAG
    ChatEngine --> Vec
    ChatEngine --> FTS
    RAG --> Vec
    Valid --> PG
    Orch --> Queue
    API --> Minio
Loading

🎯 Current Stack by Layer

1. Frontend / Interface Layer

Component Current Choice Version Purpose
Dashboard UI Streamlit 1.39.0+ Interactive data visualization, HITL review
PDF Viewer streamlit-pdf-viewer Latest In-app PDF preview
Charting Plotly 5.18.0+ Quality metrics, trends visualization

2. Backend / API Layer

Component Current Choice Version Purpose
Web Framework FastAPI 0.115.0+ Async REST API, dependency injection
ASGI Server Uvicorn 0.32.0+ High-performance async server
Authentication fastapi-users + JWT Latest User management, token auth
File Upload python-multipart 0.0.12+ Multipart form handling
Validation Pydantic v2 2.9.0+ Request/response schemas, data validation

3. Document Processing Layer

Component Current Choice Version Purpose
PDF Extraction Docling 1.0.0+ Layout-aware PDF to Markdown conversion
PDF Fallback PyPDF 5.0.0+ Simple text extraction
OCR Engine PaddleOCR 2.7.0+ Multi-language OCR (CPU optimized)
OCR Framework PaddlePaddle 2.6.0+ Deep learning framework for OCR
Excel/CSV Parser Pandas 2.2.0+ Tabular data ingestion
Excel Binary openpyxl 3.1.0+ .xlsx file reading
Markdown Export tabulate (via pandas) DataFrame to markdown conversion

4. AI / Intelligence Layer

Component Current Choice Version Purpose
LLM (Primary) DeepSeek-V3 / DeepSeek-Chat API Cost-effective structured extraction, chatbot responses
LLM (Fallback) OpenAI GPT-4o / Gemini API High-accuracy extraction
RAG Framework LlamaIndex 0.11.0+ Document indexing, retrieval, agentic workflows
Embeddings sentence-transformers 2.2.0+ Semantic search, vector embeddings (all-MiniLM-L6-v2)
Orchestration AsyncIO (native) Python 3.12 Parallel processing, non-blocking I/O
Chatbot Engine Custom (brain/chatbot/) - Session management, rate limiting, hybrid retrieval
Query Strategy Cascade Fallback - Vector search → SQL text search (see Query Strategy Analysis)

5. Data / Persistence Layer

Component Current Choice Version Purpose
RDBMS PostgreSQL 15+ Core relational data storage
ORM SQLAlchemy 2.0 2.0.36+ Async ORM, migrations
DB Driver asyncpg 0.30.0+ High-performance async PostgreSQL driver
Migrations Alembic 1.14.0+ Schema version control
Vector Store pgvector 0.2.0+ Embedding storage for semantic search
Job Queue pgqueuer 0.11.0+ Background task processing
Object Storage MinIO Latest S3-compatible document storage

6. DevOps / Operations Layer

Component Current Choice Version Purpose
Logging structlog 24.4.0+ Structured JSON logging
Config Management pydantic-settings 2.6.0+ Environment-based configuration
Encryption cryptography 43.0.0+ File encryption at rest
Environment python-dotenv 1.0.1+ .env file management
Container Docker Compose Latest Service orchestration

7. Testing / Quality Layer

Component Current Choice Version Purpose
Test Framework pytest 8.3.0+ Unit, integration, contract tests
Async Testing pytest-asyncio 0.24.0+ Async test support
HTTP Testing httpx 0.27.0+ API test client
Linting ruff 0.6.0+ Fast Python linter
Type Checking mypy 1.11.0+ Static type analysis

🔄 Alternative Solutions & Comparisons

Frontend / Dashboard Alternatives

Technology Pros Cons Use Case Fit
🟢 Streamlit (Current) • Rapid prototyping
• Python-native
• Built-in components
• Easy HITL workflows
• Limited customization
• Not ideal for public apps
• Session state quirks
Perfect for internal dashboards
React.js + Next.js • Highly customizable
• Better performance
• Modern UX patterns
• Public-facing ready
• Requires separate frontend team
• More boilerplate
• Slower development
⚠️ Consider for production SaaS version
Gradio • Similar to Streamlit
• Better for ML demos
• Automatic API generation
• Less mature ecosystem
• Fewer components
⚠️ Alternative for quick ML demos
Reflex • Python full-stack
• React under the hood
• Type-safe
• Very new (2023)
• Limited community
🔮 Watch for future adoption
Dash (Plotly) • Enterprise-grade
• Advanced visualizations
• Production-ready
• Steeper learning curve
• More verbose code
⚠️ If heavy analytics focus

PDF Processing Alternatives

Technology Pros Cons Current Usage
🟢 Docling (Current) • Layout preservation
• Table extraction
• Markdown output
• IBM-backed
• Newer project
• GPU-heavy (optional)
✅ Primary for complex PDFs
PyPDF (Fallback) • Lightweight
• Pure Python
• Fast for simple PDFs
• No layout understanding
• Poor table handling
✅ Fallback only
Unstructured.io • Multi-format support
• Cloud & local
• Active development
• Heavyweight dependency
• Commercial licensing
🟡 Strong alternative to Docling
PyMuPDF (fitz) • Very fast
• Image extraction
• Low memory
• C++ dependency
• License restrictions (AGPL)
🟡 Consider for speed-critical paths
pdfplumber • Table-focused
• Visual debugging
• Accurate coordinates
• Slower than PyMuPDF
• Less layout context
🟡 Good for table-heavy invoices
Apache Tika • 1000+ formats
• Battle-tested
• Enterprise support
• Requires Java runtime
• Heavier footprint
❌ Too heavyweight for our use case
Azure Document Intelligence • Excellent accuracy
• Pre-trained invoice models
• Microsoft support
High cost ($1.50/1000 pages)
• Cloud dependency
• Vendor lock-in
❌ Expensive for high volume

OCR Engine Alternatives

Technology Pros Cons Current Usage
🟢 PaddleOCR (Current) • Multi-language (80+)
• CPU-friendly
• Open-source
• Chinese text excellent
• Model size ~200MB
• Slower than Tesseract
✅ Primary OCR engine
DeepSeek-OCR • State-of-the-art accuracy
• Multimodal understanding
• Context-aware
Not publicly released yet
• Likely requires API
• Unknown pricing
🔮 Monitor for release - could replace PaddleOCR
Tesseract 5.x • Fast
• Lightweight
• Ubiquitous
• Lower accuracy on Chinese
• Requires preprocessing
🟡 Good for English-only invoices
EasyOCR • 80+ languages
• PyTorch-based
• Good accuracy
• GPU-hungry
• Slower on CPU
🟡 Alternative to PaddleOCR
TrOCR (Hugging Face) • Transformer-based
• SOTA accuracy
• Fine-tunable
• Requires GPU
• Larger models
⚠️ Consider for GPU deployment
Google Cloud Vision • Excellent accuracy
• Handles handwriting
• Managed service
$1.50/1000 images
• Privacy concerns
• Network latency
❌ Too expensive at scale
AWS Textract • Invoice-specific
• Key-value extraction
• High accuracy
$0.015/page
• Vendor lock-in
❌ Cost-prohibitive

LLM / Extraction Alternatives

Technology Pros Cons Current Usage
🟢 DeepSeek-V3 (Current) $0.14/1M tokens (input)
• 128K context
• Strong structured output
• Function calling
• New model (Dec 2024)
• Less tested than GPT
✅ Primary for cost efficiency
OpenAI GPT-4o (Fallback) • Proven reliability
• Best-in-class reasoning
• Vision support
$2.50/1M tokens
• 17x more expensive
✅ Fallback for critical extractions
Claude 3.5 Sonnet • Excellent at structured tasks
• 200K context
• Strong reasoning
$3/1M tokens
• Rate limits
🟡 Consider for complex documents
Gemini 1.5 Pro • 1M token context
• Native multimodal
• Competitive pricing
• Inconsistent quality
• Regional availability
🟡 Alternative to DeepSeek
Qwen 2.5 • Open weights
• Self-hostable
• Good Chinese support
• Requires GPU infrastructure
• Lower accuracy than GPT
⚠️ Consider for on-prem deployment
Llama 3.1 (405B) • Open weights
• Strong reasoning
• Self-hostable
405B requires 8x A100s
• High infra cost
❌ Too resource-intensive

RAG Framework Alternatives

Technology Pros Cons Current Usage
🟢 LlamaIndex (Current) • Document-centric
• Rich ecosystem
• Pydantic integration
• Strong structured extraction
• Heavy dependency tree
• Frequent breaking changes
✅ Primary RAG framework
LangChain • Most popular
• Extensive integrations
• Agent framework
• Overly complex abstractions
• Slower updates
🟡 Alternative if agent needs grow
🔥 RagFlow Open-source RAG engine
• Built-in chunking strategies
• Document parsing pipeline
• Visual workflow designer
• Multi-tenant support
• Newer project (2024)
• Smaller community
• Less mature docs
🔮 Strong alternative - purpose-built for document AI
Haystack • Production-ready
• Pipeline-focused
• Deepset support
• Steeper learning curve
• Less structured extraction
🟡 Consider for complex pipelines
txtai • Lightweight
• Embeddings-first
• Fast indexing
• Fewer integrations
• Simpler feature set
⚠️ Good for simple semantic search
🔥 LangExtract Zero-shot extraction
• No training needed
• Schema-driven
• Built on LangChain
• Less flexible than custom prompts
• Still requires LLM API
🔮 Consider for quick wins - simpler than custom prompts
Marvin • Pydantic-native
• Type-safe extraction
• Elegant API
• Smaller ecosystem
• Less documentation
🟡 Alternative for Pydantic users

Database / Storage Alternatives

Technology Pros Cons Current Usage
🟢 PostgreSQL + pgvector (Current) • All-in-one solution
• ACID guarantees
• Mature ecosystem
• Cost-effective
• Supports both vector and full-text search
• Vector search slower than specialized DBs
• Manual tuning needed
✅ "Complexity Collapse" strategy
✅ Chatbot uses cascade: vector → SQL fallback
Pinecone • Purpose-built vectors
• Managed service
• Fast similarity search
$70/month minimum
• Vendor lock-in
❌ Unnecessary with pgvector
Weaviate • Open-source vector DB
• Hybrid search built-in
• Self-hostable
• Additional infrastructure
• Overkill for our scale
⚠️ Consider for parallel hybrid upgrade (see Query Strategy)
Qdrant • Rust-based speed
• Filtering support
• Good docs
• Another service to manage ⚠️ Alternative to Weaviate
Chroma • Lightweight
• Embedded mode
• Developer-friendly
• Less production-ready
• Limited scale
⚠️ Good for prototyping

Query Strategy Details:

  • Current: Cascading fallback (vector search → SQL text search)
  • Planned: Parallel hybrid search with Reciprocal Rank Fusion (RRF)
  • See comprehensive analysis: Query Strategy Analysis

Object Storage Alternatives

Technology Pros Cons Current Usage
🟢 MinIO (Current) • S3-compatible
• Self-hosted
• No egress fees
• Infrastructure overhead
• Manual backups
✅ Cost control strategy
AWS S3 • Managed service
• 99.999999999% durability
• Global CDN
Egress costs
• Vendor lock-in
🟡 Consider for production SaaS
Local Filesystem • Zero cost
• Simple
• No redundancy
• Not scalable
❌ Development only

Background Job Processing Alternatives

Technology Pros Cons Current Usage
🟢 pgqueuer (Current) • Uses existing Postgres
• ACID guarantees
• Simple setup
• Not as feature-rich as Celery
• Postgres becomes SPOF
✅ Simplicity wins
Celery • Battle-tested
• Rich features
• Monitoring tools
• Requires Redis/RabbitMQ
• Complex setup
⚠️ If job complexity grows
Dramatiq • Simpler than Celery
• Better API
• Smaller ecosystem 🟡 Alternative to Celery
Temporal • Workflow orchestration
• Durable execution
• Enterprise-grade
• Heavy infrastructure
• Overkill for MVP
❌ Too complex for current needs

🚀 Emerging & Trending Solutions

🔥 High-Priority Alternatives to Evaluate

1. RagFlow - Open-Source RAG Engine

  • Repository: infiniflow/ragflow
  • Stars: ~20K+ (as of Jan 2025)
  • What it is: A complete RAG engine with built-in document parsing, chunking, and retrieval
  • Key Advantages:
    • Visual workflow designer - no-code pipeline building
    • Built-in document parsers - handles PDF, Word, Excel natively
    • Intelligent chunking - better than naive splitting
    • Multi-tenant support - SaaS-ready architecture
    • Integrated UI - reduces need for custom dashboard
  • When to Consider:
    • If our LlamaIndex complexity grows
    • If we need multi-tenant isolation
    • If visual workflow management becomes valuable
  • Integration Path:
    # Could replace brain/ layer entirely
    # RagFlow handles: parsing → chunking → embedding → retrieval
    # We'd keep: validation, database, API layers

2. LangExtract - Zero-Shot Extraction

  • Repository: Part of LangChain ecosystem
  • What it is: Schema-driven extraction without training examples
  • Key Advantages:
    • Simpler than custom prompts - just define Pydantic schema
    • Zero-shot - no few-shot examples needed
    • Type-safe - leverages Pydantic validation
  • When to Consider:
    • If we want to simplify brain/extractor.py
    • If prompt engineering becomes bottleneck
  • Example:
    from langextract import extract
    from pydantic import BaseModel
    
    class Invoice(BaseModel):
        vendor: str
        total: float
        date: str
    
    # That's it - no prompt engineering needed
    result = extract(Invoice, raw_text)

3. DeepSeek-OCR - Next-Gen OCR (Unreleased)

  • Status: Rumored/announced, not publicly available
  • Expected Advantages:
    • Multimodal understanding - combines vision + text reasoning
    • Context-aware - understands invoice semantics during OCR
    • Potential accuracy boost - could eliminate extraction errors
  • Risk: May require API access (not self-hostable)
  • Action: Monitor for release announcement - could be game-changer

4. Docling + Unstructured.io - Document Parsing Combo

  • Trend: Using Docling for layout + Unstructured for preprocessing
  • Advantages:
    • Best of both worlds - Docling's layout + Unstructured's robustness
    • Better table handling - especially for complex multi-page tables
  • When to Consider: If current PDF extraction quality is insufficient

5. Marvin AI - Pydantic-Native Extraction

  • Repository: PrefectHQ/marvin
  • What it is: AI engineering framework focused on type-safe extraction
  • Key Advantages:
    • Elegant API - most Pythonic extraction library
    • Zero boilerplate - uses decorators
    • Built-in validation - Pydantic integration
  • Example:
    import marvin
    
    @marvin.fn
    def extract_invoice(text: str) -> ExtractedDataSchema:
        """Extract invoice data from text"""
    
    # That's the entire implementation
    result = extract_invoice(raw_text)

📊 Cost Comparison: Cloud vs Self-Hosted

Scenario: 10,000 invoices/month

Component Cloud Solution Monthly Cost Our Stack Monthly Cost
OCR Google Vision $150 PaddleOCR (self-hosted) $0
PDF Parsing Azure Document Intelligence $150 Docling (self-hosted) $0
LLM Extraction GPT-4o only $250 DeepSeek-V3 primary $35
Vector Database Pinecone $70 pgvector (in Postgres) $0
Job Queue AWS SQS $10 pgqueuer (in Postgres) $0
Object Storage AWS S3 $50 MinIO (self-hosted) $10
Database AWS RDS $200 PostgreSQL (self-hosted) $30
Total $880/month $75/month

Cost Savings: ~90% reduction using our "Complexity Collapse" approach


🎯 Decision Framework: When to Switch

✅ Stick with Current Stack If:

  • MVP/early stage (current state)
  • Processing < 50K invoices/month
  • Team is Python-focused
  • Infrastructure budget is limited

⚠️ Consider Alternatives If:

  • Extraction accuracy < 85% (try DeepSeek-OCR when available)
  • PDF parsing fails on >10% of documents (evaluate Unstructured.io)
  • Processing time > 30s/invoice (optimize or add GPU)
  • Team grows to include frontend specialists (consider React.js)

🔴 Must Switch If:

  • Security audit requires zero-LLM-API-call mode (switch to self-hosted Llama)
  • Scaling to 1M+ invoices/month (migrate to managed services)
  • Multi-tenant SaaS launch (add RagFlow or similar isolation)

🛠️ Technology Radar (2025)

%%{init: {'theme':'base'}}%%
quadrantChart
    title AI E-Invoicing Technology Radar
    x-axis "Adopt" --> "Hold"
    y-axis "Trial" --> "Assess"
    quadrant-1 "ADOPT"
    quadrant-2 "TRIAL"
    quadrant-3 "ASSESS"
    quadrant-4 "HOLD"
    
    DeepSeek-V3: [0.9, 0.9]
    PaddleOCR: [0.85, 0.85]
    Streamlit: [0.8, 0.9]
    FastAPI: [0.95, 0.95]
    PostgreSQL: [0.9, 0.95]
    
    RagFlow: [0.4, 0.8]
    LangExtract: [0.5, 0.7]
    Marvin: [0.45, 0.6]
    Unstructured: [0.55, 0.75]
    
    DeepSeek-OCR: [0.3, 0.5]
    React-Dashboard: [0.25, 0.4]
    Temporal: [0.15, 0.3]
    
    Azure-AI: [0.05, 0.1]
    Pinecone: [0.1, 0.15]
Loading

Legend:

  • ADOPT (Top-Right): Current production stack - proven and reliable
  • TRIAL (Top-Left): Actively evaluate - RagFlow, LangExtract, Marvin
  • ASSESS (Bottom-Left): Monitor developments - DeepSeek-OCR, React dashboard
  • HOLD (Bottom-Right): Avoid - expensive cloud services

🧠 Understanding Processing

Each invoice goes through these stages:

  1. File Ingestion: File is read and hashed (SHA-256) for duplicate detection
  2. OCR/Text Extraction: Image/PDF is processed to extract text (PaddleOCR for images, Docling for PDFs)
  3. AI Extraction: Structured data is extracted using DeepSeek-chat with manual JSON parsing for reliability (vendor, amounts, dates, etc.)
  4. Validation: Business rules are checked (math validation, tax rate constraints, etc.)
  5. Self-Correction: If validation fails, the system attempts to refine extraction by capping confidence and adjusting math logic
  6. Storage: Results are stored in PostgreSQL with JSON-safe serialization

Processing Status:

  • pending - Initial state
  • queued - Added to processing queue
  • processing - Currently being processed
  • completed - Successfully processed
  • failed - Processing failed (check error_message)

🛠️ Troubleshooting

If processing fails:

  • Check backend logs for error messages (logs include processing stage information)
  • Verify the file exists in data/ directory
  • Check database connection in .env file: DATABASE_URL must be set
  • Ensure all dependencies are installed: pip install -e ".[dev]"
  • Run database migrations: alembic upgrade head
  • Check file permissions: ensure data/ directory is writable
  • Verify file is not corrupted: check file size > 0
  • Check for missing processing libraries (OCR, PDF): error messages will indicate which library is missing

If dashboard shows no invoices:

  • Make sure you've processed at least one invoice
  • Check the status filter in the sidebar (may be filtering out your invoices)
  • Verify database connection: check .env file has DATABASE_URL
  • Check dashboard logs for database query errors
  • Verify database schema is up to date: alembic current should show latest migration

API not responding:

  • Check if backend is running: curl http://localhost:${API_PORT}/health
  • Verify port 8000 is not in use by another service
  • Check API logs for startup errors
  • Verify database is accessible: health endpoint will show "degraded" if database issues exist

📊 Current Implementation Status

✅ Completed (Scaffold Phase):

  • Project structure with three-layer architecture (Sensory, Brain, Interaction)
  • PostgreSQL database with pgvector extension
  • Async SQLAlchemy 2.0 ORM models
  • File processing pipeline (PDF, Excel, CSV, Images)
  • Basic data extraction and validation framework
  • FastAPI REST API with async endpoints
  • Streamlit review dashboard
  • File-level encryption at rest
  • SHA-256 file hashing for duplicate detection
  • Database migrations with Alembic

✅ Recently Completed (Ingestion Workflow Fixes):

  • Non-blocking PaddleOCR initialization (prevents system crashes)
  • Comprehensive error handling with user-friendly messages
  • Database schema health checks and connection retry logic
  • Enhanced logging with processing stage tracking
  • OCR timeout handling with retry logic (180s default, up to 10min for large images)
  • File validation (size limits, corruption checks)
  • Background processing with proper session management
  • Status tracking with immediate database commits
  • Performance monitoring (processing time tracking)

✅ Completed (Agentic Phase):

  • Docling integration for advanced PDF processing
  • DeepSeek-chat integration as primary extraction LLM
  • pgqueuer setup for background job management
  • Enhanced validation rules (Tax rate auto-detection, Line item math fallback)
  • Chatbot engine for conversational invoice querying
  • Robust transaction management with explicit rollbacks

🚧 Future Roadmap:

  • Multi-agent coordination for complex multi-page document reconciliation
  • Integration with external ERP (Odoo/SAP) APIs
  • Enhanced local embedding model performance tuning

📚 Further Reading & Resources

Official Documentation

Open-Source Projects to Watch

Comparative Analysis

Alternative Ecosystems Worth Exploring

  • Document AI Platforms: Azure Form Recognizer, AWS Textract (for accuracy baseline)
  • Workflow Orchestration: Apache Airflow, Prefect, Temporal (if complexity grows)
  • Full-Stack Python: Reflex, FastUI, NiceGUI (alternatives to React)

🔄 Version History

Date Change Rationale
2025-01-08 Initial document creation Catalog current stack and alternatives
2025-01-08 Added RagFlow, LangExtract, Marvin Popular emerging solutions in document AI space
2025-01-08 Added cost comparison Justify "Complexity Collapse" approach

💡 Contributing

This document should be updated when:

  1. A major technology is added/replaced in the stack
  2. A new promising alternative emerges in the ecosystem
  3. Cost/performance benchmarks change significantly
  4. Team makes a technology decision (document rationale here)

Maintainer: AI E-Invoicing Team