Reference for the current AI E-Invoicing implementation: dashboard, API, ingestion, and configuration.
| Doc | Description |
|---|---|
| Technical Stack & Architecture | Stack by layer (frontend, API, processing, intelligence, data), alternatives, and processing logic |
| Setup & Scaffold | Step-by-step setup and scaffold guide |
| Dashboard Improvements | Analytics, export to CSV/PDF, filters, bulk reprocess, status/vendor/financial charts |
| Dataset Upload UI | Web upload (PDF, Excel, images), processing flow, and upload API |
| Invoice Chatbot | RAG chatbot: sessions, rate limiting, vector retriever, DeepSeek, dashboard tab |
| Query Strategy Analysis | Cascade vs Parallel Hybrid Search: comprehensive analysis, performance comparison, production upgrade path |
| Ingestion Workflow Fixes | Ingestion pipeline fixes and behavior |
| Duplicate Processing Logic | File hashing, versioning, and duplicate handling |
| Resilient Configuration | Module plugability, runtime configuration APIs, workflow diagram |
| OCR Implementation | OCR providers, configuration, and compare flow |
| OCR Switch Options | Switching and configuring OCR providers |
| CSV Implementation | CSV ingestion and processing |
| PDF Implementation | PDF processing (Docling/PyPDF) |
| Process Images | Image processing and pipeline |
- Health:
GET /api/v1/health - Invoices: process, list, detail, analytics (status-distribution, time-series, vendor-analysis, financial-summary)
- Uploads: upload files, list, status
- Chatbot:
POST /api/v1/chatbot/chat, session management - Quality: extraction quality and confidence metrics
- Configurations: module/stage configuration, activation, rollback
- OCR: providers, compare, run
- Modules / Stages: configuration metadata
Invoice List (filters, bulk actions, export) → Invoice Detail (preview, extracted data, validation analysis) → Upload Files → Chatbot → Quality Metrics → OCR Compare.
Related Documents:
- Query Strategy Analysis - Comprehensive cascade vs parallel hybrid search analysis
- Invoice Chatbot Implementation - Current chatbot implementation details
Is the implementation reasonable? Yes, highly reasonable. The current architecture leverages a "Complexity Collapse" strategy by using powerful inference-only models (DeepSeek-V3/GPT-4o) combined with LlamaIndex for RAG and pgvector for storage.
Your confusion likely stems from the fact that SFT (Supervised Fine-Tuning) and TRL (Transformer Reinforcement Learning) are NOT currently used in this project, nor are they typically required for this stage of an invoice extraction application.
Query Strategy Update: The chatbot currently uses a cascading fallback approach (vector → SQL) that is sufficient for MVP but should be upgraded to parallel hybrid search before production. See Query Strategy Analysis for details.
| Technology | Role | Status | Why it's a good choice |
|---|---|---|---|
| LlamaIndex | RAG Framework | Core | specialized for "Data Context" and structured extraction, making it often superior to LangChain for document-heavy tasks like invoicing. |
| pgvector | Vector Store | Core | Integrated directly into PostgreSQL, reducing infrastructure complexity (no need for Pinecone/Qdrant). |
| DeepSeek-V3 | Inference Model | Core | Extremely cost-effective ($0.14/1M tokens) with performance comparable to top-tier models for extraction. |
| Docling | PDF Parser | Core | Preserves layout structure, which is critical for accurate RAG on invoices. |
| Technology | Role | Status | Recommendation |
|---|---|---|---|
| SFT (Supervised Fine-Tuning) | Training | Absent | Not recommended yet. Fine-tuning is complex and expensive. Modern LLMs (like DeepSeek) are usually capable enough with just good prompting (RAG). Only consider SFT if you have 10,000+ examples and general models fail. |
| TRL (Transformer Reinforcement Learning) | RLHF Training | Absent | Overkill. This is for training models to follow instructions (like ChatGPT itself). You do not need this for an invoice extractor. |
| LangChain | Orchestration | Absent | Redundant. You are using LlamaIndex, which overlaps significantly with LangChain. Using both introduces unnecessary complexity ("dependency hell"). |
The implementation focuses on Context-Aware Extraction:
- Ingest: PDFs are parsed (Docling) and Chunked.
- Retrieve: Relevant chunks are found using vector similarity (pgvector).
- Generate: The LLM extracts specific fields (Vendor, Total, Date) based on that context.
- Validate: Pydantic models ensure the output matches the required format.
Verdict: This is the industry-standard approach for modern Document AI. It is robust and easier to maintain than training custom models.
The project uses LlamaIndex Agents (ReAct pattern) which can:
- Reason about what information is missing.
- Query the vector store specifically for that missing info.
- Self-correct if validation fails (e.g., if Total != Subtotal + Tax).
Verdict: This is a practical implementation of "Agency" without over-engineering.
While the stack is solid, here are 3 targeted improvements:
Current: Chatbot uses cascading fallback strategy (vector search → SQL text search when vector returns empty).
Problem: Vector search might return results but miss exact keyword matches (e.g., specific Invoice ID "INV-9928"), preventing SQL fallback from triggering.
Suggestion: Implement Parallel Hybrid Search with Reciprocal Rank Fusion (RRF).
- Execute
pgvector(semantic) and PostgreSQLtsvector(keyword) in parallel - Combine results using RRF algorithm
- Exact matches get highest scores automatically
- Implementation time: ~3 days
- See detailed analysis: Query Strategy Analysis
Status:
- ✅ Cascade implementation complete (MVP-ready)
- 📋 Parallel hybrid documented and architected
- 🔴 Recommended before production for financial data reliability
Current: Reliance on "vibes" or manual checking. Problem: You don't know if a prompt change improved extraction by 1% or broke it by 5%. Suggestion: Implement a RAG Evaluation (Ragas / TruLens).
- Create a "Golden Dataset" of 50 perfectly extracted invoices.
- Run your pipeline against them.
- Metrics: Answer Relevancy and Faithfulness.
Current: Using large API models (DeepSeek/GPT-4o). Future: If you process millions of invoices, costs will add up. Suggestion: Distillation / SFT (Only massive scale).
- Use GPT-4o to generate training data.
- Fine-tune a small local model (Llama-3-8B) using TRL/SFT to do specifically invoice extraction.
- This would confirm where TRL/SFT fits in—it's an optimization step, not a starting point.
You are not missing out by excluding LangChain or TRL. You have chosen a focused, high-performance stack (LlamaIndex) that is perfectly suited for your problem domain.
Recommendation: Stay the course. Focus on Hybrid Search and Evaluation rather than adding training complexity.
- omniparser (GitHub): Universal ETL engine for tabular data (CSV, Excel, JSON, XML, etc.) with streaming parsing and rich type-system.
- TRL (GitHub): Library by Hugging Face for training and fine-tuning language models with reinforcement learning and supervised fine-tuning (SFT) techniques.
- Supervised Fine Tuning (SFT) (Docs): Train language models using supervised learning techniques for efficient fine-tuning.