This document details the workflow and technology stack for processing PDF invoices within the application.
The PDF processing pipeline follows a multi-stage approach to ensure high-quality data extraction even from complex or image-based PDFs.
graph TD
A[File Discovery] --> B{File Type?}
B -- PDF --> C[PDF Processor]
C --> D{Docling Available?}
D -- Yes --> E[Docling Extraction]
D -- No --> F[PyPDF Fallback]
E --> G{Insufficient Text?}
G -- Yes --> H[OCR Fallback]
G -- No --> I[Raw Text]
F --> I
H --> I
I --> J[LLM Data Extraction]
J --> K[Validation Framework]
K --> L{Critical Failure?}
L -- Yes --> M[Self-Correction Refinement]
L -- No --> N[Data Persistence]
M --> N
Responsible for handling file types and raw text extraction.
- pdf_processor.py:
- Primary (Docling): Uses IBM's
doclingto convert PDFs to Markdown, preserving layout and tables. - Fallback (PyPDF): Slower/simpler extraction if Docling is unavailable.
- OCR Support: If text extraction yields < 100 characters (indicating a scanned image), it triggers an OCR fallback.
- Primary (Docling): Uses IBM's
- orchestrator.py: Coordinates the sequential flow from file hashing to database storage.
Handles intelligence, extraction, and validation.
- extractor.py:
- Uses OpenAI/DeepSeek LLMs to map raw text to structured JSON.
- Implements
refine_extractionfor self-correction when validation fails.
- schemas.py: Defines
ExtractedDataSchemausing Pydantic, including per-field confidence scores. - validator.py: Runs math checks (Subtotal + Tax = Total), date consistency, and vendor sanity rules.
Manage state, models, and storage.
- models.py: SQLAlchemy models for
Invoice,ExtractedData, andValidationResult. - Storage:
- Original files stored in
data/pdf/ordata/uploads/. - Encrypted copies (if enabled) in
data/encrypted/.
- Original files stored in
| Component | Technology | Description |
|---|---|---|
| PDF Conversion | Docling | Advanced document layout analysis and markdown export. |
| OCR Fallback | PaddleOCR | High-accuracy OCR for scanned documents. |
| Extraction | LlamaIndex / OpenAI | Agentic RAG and structured data extraction. |
| LLM | DeepSeek-V3 / GPT-4o | Specialized models for high-accuracy extraction. |
| Validation | Pydantic v2 | Strict schema validation and data normalization. |
| Database | SQLAlchemy 2.0 | Modern ORM with PostgreSQL backend. |
Important
Layout Preservation: Docling ensures that tables and headers are represented as Markdown, which significantly improves LLM extraction accuracy compared to raw text.
Tip
Self-Correction: If mathematical validations fail (e.g., tax doesn't add up), the system automatically performs a "Refinement" pass with the LLM, feeding the error message back as a prompt hint.