This document describes the workflow, tech stack, and internal mechanics of the OCR (Optical Character Recognition) implementation within the AI-eInvoicing application.
The OCR system is designed to handle various image formats (JPG, PNG, WebP, Avif) and extract structured invoice data using a combination of local OCR engines and Large Language Models (LLMs).
graph TD
A[File Upload / Discovery] --> B[Ingestion Orchestrator]
B --> C{File Type?}
C -- Image --> D[Image Processor]
C -- PDF --> E[PDF Processor]
D --> F[PaddleOCR Engine]
F --> G[Extract Raw Text]
G --> H[Brain: Data Extractor]
H --> I[DeepSeek/OpenAI LLM]
I --> J[Structured JSON]
J --> K[Validation Framework]
K --> L[Database Storage]
K -- Fail --> M[Self-Correction Refinement]
M --> H
- OCR Engine: PaddleOCR (CPU-optimized mode).
- Extraction: DeepSeek-Chat (primary) or OpenAI GPT-4.
- Framework: FastAPI (Interface), SQLAlchemy (Core), LlamaIndex/Direct OpenAI (Brain).
- Storage: PostgreSQL with JSONB for flexible schema storage.
- image_processor.py:
- Lazy initialization of
PaddleOCRto save memory. - Resource monitoring: Checks available RAM before processing.
- Image Pre-processing: Resizes large images (max 1500px) to prevent OOM (Out Of Memory) errors.
- Thread-pool execution: OCR runs in a background thread to prevent blocking the async event loop.
- Lazy initialization of
- orchestrator.py:
- Coordinates the flow from file hashing to database commits.
- Implements retry logic for OCR timeouts.
- extractor.py:
- Uses LLMs (DeepSeek/OpenAI) to map raw OCR text to a structured Pydantic schema (
ExtractedDataSchema). - Implements "Self-Correction": If validation fails, it re-prompts the LLM with error feedback to refine the results.
- Uses LLMs (DeepSeek/OpenAI) to map raw OCR text to a structured Pydantic schema (
- validator.py:
- Mathematical consistency checks (Subtotal + Tax = Total).
- Line item sum verification.
- Vendor sanity checks (especially for Chinese '增值税' invoices).
- models.py: Defines the
Invoice,ExtractedData, andValidationResulttables. - config.py: Manages API keys and model settings (DeepSeek vs OpenAI).
- api/routes/uploads.py: Handles multipart file uploads and triggers background processing tasks.
- dashboard/: Streamlit-based UI for visualizing extraction quality and confidence scores.
To ensure stability on resource-constrained environments:
- Memory Guards: The system aborts OCR if free memory is below 300MB.
- Downscaling: Images larger than 2MB or 1500px are automatically resized before being fed to PaddleOCR.
- Async Isolation: Blocking CPU-bound OCR work is offloaded to
ThreadPoolExecutor.
- Confidence Scoring: Each extraction returns a confidence level (calculated by LLM and PaddleOCR).
- Validation Rules: Automated checks catch mathematical errors common in OCR "hallucinations".
- Refinement Loop: Failed validations trigger a secondary "Refine" stage where the LLM is given the error trace to fix specific fields.
- Upload: User sends
9.pngvia API. - Discovery:
orchestratordetects image type and calculates SHA-256. - OCR:
image_processorcalls PaddleOCR (withlang="ch"). - Extraction:
extractorsends raw text + filename hints to DeepSeek. - Validation:
validatorchecks if the 销售方 (Vendor) name exists and if math adds up. - Persistence: Results stored in
extracted_datatable; status set toCOMPLETED.