An AI-native financial automation platform for processing heterogeneous invoice formats (PDF, Excel, Images) into structured data. Leveraging Agentic AI for "Zero-Template" extraction and self-correcting validation.
| Overview | Invoice List & Bulk Actions |
|---|---|
![]() |
![]() |
| Invoice Detail & Extracted Data | Validation Analysis |
|---|---|
![]() |
![]() |
| Upload Files | Chat with Invoices |
|---|---|
![]() |
![]() |
| Quality Metrics | Financial Summary |
|---|---|
![]() |
![]() |
- Python 3.12.2+
- Docker and Docker Compose
- PostgreSQL (Automated via Docker)
# Install dependencies
pip install -e ".[dev]"
# Configure environment
# Create .env with:
# DATABASE_URL=postgresql+asyncpg://einvoice:einvoice_dev@localhost:${PGDB_PORT:-5432}/einvoicing
# ENCRYPTION_KEY=your-key (Generate with: python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())")
# API_PORT=8000, UI_PORT=8501# Start Database
docker-compose up -d
# Run Migrations
alembic upgrade head
# Start API
python interface/api/main.py --reload
# Start Dashboard (Port 8501)
streamlit run interface/dashboard/app.pyRun the consolidated script to process files in the data/ directory:
$ python scripts/process_invoices.py
$ python scripts/process_invoices.py --recursive --dir data/ --force --concurrency 2
$ python scripts/process_invoices.py --dir data/jimeng --pattern "invoice-1.png" --force --background --api-url "http://127.0.0.1:8800"Or via API:
curl -X POST "http://localhost:8000/api/v1/invoices/process" \
-H "Content-Type: application/json" \
-d '{"file_path": "invoice-1.png"}'- Dashboard: http://localhost:8501
- API Docs: http://localhost:8000/docs
| Layer | Technology |
|---|---|
| Persistence | PostgreSQL (pgvector, pgqueuer) |
| Logic | LlamaIndex, DeepSeek, Docling |
| Interface | FastAPI, Streamlit |
Documents are processed once during ingestion, with extracted data stored for later querying:
graph TB
subgraph "Ingestion Sources"
A1[PDF Files]
A2[Excel/CSV Files]
A3[Images: PNG/JPG/WEBP]
end
subgraph "Universal Ingestion Funnel"
B[File Discovery & Hashing]
C{File Type Router}
end
subgraph "Format-Specific Processing"
D1[PDF Processor<br/>Docling/PyPDF]
D2[Excel Processor<br/>Pandas Agent]
D3[Image Processor<br/>PaddleOCR/Docling]
end
subgraph "AI Extraction Layer"
E[LlamaIndex Agentic AI<br/>Structured Extraction]
F[Pydantic Schema<br/>Validation Agent]
end
subgraph "Storage & Indexing"
G[(PostgreSQL<br/>Invoices + ExtractedData)]
H[(pgvector<br/>Embeddings)]
I[(MinIO<br/>File Storage)]
end
A1 --> B
A2 --> B
A3 --> B
B --> C
C -->|PDF| D1
C -->|Excel/CSV| D2
C -->|Image| D3
D1 --> E
D2 --> E
D3 --> E
E --> F
F -->|Valid| G
F -->|Invalid| J[Human Review Queue]
G --> H
G --> I
style E fill:#e1f5ff
style F fill:#fff4e1
style G fill:#e8f5e9
style H fill:#f3e5f5
Key Points:
- Zero-Template Extraction: AI reads and reasons about layout variations without hardcoded templates
- Validation with Auto-Retry: Failed validations trigger alternative extraction strategies before human review
- Embeddings: Generated during ingestion for semantic search (optional, chatbot falls back to SQL if unavailable)
The chatbot queries already-processed data using a hybrid retrieval strategy:
graph TB
subgraph "User Interface"
U[User Natural Language Query]
end
subgraph "Session & Rate Limiting"
S1[Session Manager<br/>Context: Last 10 Messages]
S2[Rate Limiter<br/>20 queries/min]
end
subgraph "Query Processing"
Q1[Intent Classification<br/>FIND_INVOICE / AGGREGATE / LIST]
Q2{Query Type?}
end
subgraph "Hybrid Retrieval Strategy"
R1[Vector Search RAG<br/>pgvector + sentence-transformers]
R2[SQL Text Search FALLBACK<br/>UUID / Filename / Vendor]
R3[SQL Aggregate DIRECT<br/>Year/Month/Vendor Filters]
end
subgraph "Data Retrieval"
D[(PostgreSQL<br/>Invoices + ExtractedData)]
end
subgraph "Response Generation"
L[DeepSeek Chat LLM<br/>Natural Language Response]
end
U --> S1
S1 --> S2
S2 --> Q1
Q1 --> Q2
Q2 -->|Semantic Query| R1
Q2 -->|Aggregate Query| R3
R1 -->|No Results| R2
R1 -->|Found| D
R2 --> D
R3 --> D
D --> L
L --> U
style Q1 fill:#fff4e1
style R1 fill:#f3e5f5
style R2 fill:#e8f5e9
style R3 fill:#e8f5e9
style L fill:#e1f5ff
Key Points:
- Cascading Fallback Strategy: Vector search (RAG) → SQL text search → SQL aggregates
- Intent-Based Routing: Different query types use optimal retrieval methods
- No Re-Processing: Queries only read stored data; no re-extraction happens
- Future Enhancement: True parallel hybrid search (vector + SQL with RRF) documented but not yet implemented
- Technical Stack & Architecture — Stack by layer, alternatives, and processing logic.
- Setup & Scaffold — Step-by-step implementation guide.
- Dashboard Improvements — Analytics, export, filters, and bulk actions.
- Dataset Upload UI — Web upload and processing flow.
- Invoice Chatbot — RAG-backed chat over invoice data.
- Duplicate Processing Logic — Hashing and versioning.
- Resilient Configuration — Module plugability and runtime configuration APIs.
- Docs Index — Full documentation index and RAG stack analysis.







