Branch: 1-e-invoice-scaffold
Date: 2024-12-19
Status: ✅ Complete
This document describes the complete implementation of the e-invoice scaffold application. The scaffold provides a foundational structure for processing heterogeneous invoice formats (PDF, Excel, Images) into structured, validated data using an AI-native approach.
- ✅ Project structure with modular architecture (core, ingestion, brain, interface layers)
- ✅ PostgreSQL database with pgvector extension (via Docker Compose)
- ✅ Async SQLAlchemy 2.0 ORM with database models
- ✅ Alembic database migrations
- ✅ File-level encryption for sensitive data at rest
- ✅ Structured logging with sensitive data filtering
- ✅ SHA-256 file hashing for duplicate detection and versioning
- ✅ File discovery and type detection (PDF, Excel, CSV, Images)
- ✅ PDF processing with pypdf (Docling integration placeholder)
- ✅ Excel/CSV processing with Pandas
- ✅ Image processing placeholder (for future OCR integration)
- ✅ Processing orchestrator with error handling and encryption
- ✅ Pydantic schemas for structured invoice data
- ✅ Regex-based data extraction (vendor, invoice number, dates, amounts)
- ✅ Validation framework with extensible rules
- ✅ Mathematical validation (subtotal + tax = total)
- ✅ FastAPI application with async endpoints
- ✅ Health check endpoint
- ✅ Invoice processing endpoint (
POST /api/v1/invoices/process) - ✅ Invoice listing endpoint (
GET /api/v1/invoices) - ✅ Invoice detail endpoint (
GET /api/v1/invoices/{invoice_id}) - ✅ Structured JSON response envelope with pagination
- ✅ Streamlit dashboard for reviewing processed invoices
- ✅ Status filtering and invoice listing
- ✅ Invoice detail view with validation results
┌─────────────────────────────────────────────────────────┐
│ INTERACTION LAYER │
│ FastAPI (REST API) + Streamlit (Review Dashboard) │
└─────────────────────────────────────────────────────────┘
↕
┌─────────────────────────────────────────────────────────┐
│ BRAIN LAYER │
│ Data Extraction + Validation Framework │
└─────────────────────────────────────────────────────────┘
↕
┌─────────────────────────────────────────────────────────┐
│ SENSORY LAYER │
│ File Discovery + PDF/Excel/Image Processing │
└─────────────────────────────────────────────────────────┘
↕
┌─────────────────────────────────────────────────────────┐
│ INFRASTRUCTURE │
│ PostgreSQL + pgvector + Encryption + Logging │
└─────────────────────────────────────────────────────────┘
ai-einvoicing/
├── core/ # Infrastructure layer
│ ├── database.py # Async SQLAlchemy engine & session management
│ ├── models.py # ORM models (Invoice, ExtractedData, ValidationResult, ProcessingJob)
│ ├── encryption.py # File encryption/decryption utilities
│ └── logging.py # Structured logging configuration
│
├── ingestion/ # Sensory layer
│ ├── file_discovery.py # Discover supported files
│ ├── file_hasher.py # SHA-256 hash calculation
│ ├── pdf_processor.py # PDF text extraction
│ ├── excel_processor.py # Excel/CSV processing
│ ├── image_processor.py # Image processing (placeholder)
│ └── orchestrator.py # Main processing pipeline
│
├── brain/ # Brain layer
│ ├── schemas.py # Pydantic models for invoice data
│ ├── extractor.py # Data extraction from raw text
│ └── validator.py # Validation rules framework
│
├── interface/ # Interaction layer
│ ├── api/
│ │ ├── main.py # FastAPI application entry point
│ │ ├── routes/
│ │ │ ├── health.py # Health check endpoint
│ │ │ └── invoices.py # Invoice processing endpoints
│ │ └── schemas.py # API request/response models
│ └── dashboard/
│ ├── app.py # Streamlit dashboard
│ └── queries.py # Database queries for dashboard
│
├── alembic/ # Database migrations
│ ├── env.py # Alembic environment configuration
│ └── versions/
│ └── 001_initial_schema.py # Initial database schema
│
├── data/ # Local storage for sample invoices
├── docker-compose.yml # PostgreSQL with pgvector
├── pyproject.toml # Project dependencies and metadata
└── alembic.ini # Alembic configuration
-
invoices
- Stores invoice document metadata
- Tracks processing status, file hash, version
- Supports duplicate detection via file hash
- Includes
storage_path,category,group, andjob_id
-
extracted_data
- Stores structured invoice data extracted from documents
- One-to-one relationship with invoices
- Includes vendor, dates, amounts, line items
-
validation_results
- Stores validation rule results
- One-to-many relationship with invoices
- Tracks passed/failed/warning status
-
processing_jobs
- Tracks processing job execution
- One-to-many relationship with invoices
- Records execution type (async_coroutine vs cpu_process)
- Python 3.12.2
- Docker and Docker Compose
- Conda (optional, for environment management)
# Using pip (recommended)
pip install -e ".[dev]"
# Or using conda
conda create -n ai-einvoicing-env python=3.12
conda activate ai-einvoicing-env
pip install -e ".[dev]"Create a .env file in the project root:
# Database Configuration
DATABASE_URL=postgresql+asyncpg://einvoice:einvoice_dev@localhost:${PGDB_PORT:-5432}/einvoicing
# Encryption Key (generate with: python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())")
ENCRYPTION_KEY=your-generated-encryption-key-here
# Logging Configuration
LOG_LEVEL=INFO
LOG_FORMAT=json
# API Configuration
API_HOST=0.0.0.0
API_PORT=8000Generate Encryption Key:
python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"docker-compose up -dVerify database is running:
docker ps --filter "name=ai-einvoicing-db"alembic upgrade headVerify tables were created:
docker exec ai-einvoicing-db psql -U einvoice -d einvoicing -c "\dt"Start FastAPI API:
# Start API
python interface/api/main.py --reloadAPI will be available at: http://localhost:${API_PORT}
- API Documentation:
http://localhost:${API_PORT}/docs - Health Check:
http://localhost:${API_PORT}/health
Start Streamlit Dashboard:
streamlit run interface/dashboard/app.pyDashboard will be available at: http://localhost:${UI_PORT:-8501}
Error:
sqlalchemy.exc.InvalidRequestError: Attribute name 'metadata' is reserved when using the Declarative API.
Root Cause: The ProcessingJob model had a field named metadata, which conflicts with SQLAlchemy's reserved metadata attribute.
Fix: Renamed metadata to job_metadata in:
core/models.py(line 230)alembic/versions/001_initial_schema.py(migration)
Error:
sqlalchemy.exc.NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:driver
Root Cause: alembic.ini had a placeholder database URL (driver://user:pass@localhost/dbname).
Fix: Updated alembic.ini with correct database URL:
sqlalchemy.url = postgresql+asyncpg://einvoice:einvoice_dev@localhost:${PGDB_PORT:-5432}/einvoicingError:
asyncpg.exceptions.FeatureNotSupportedError: extension "pgqueuer" is not available
Root Cause: The migration tried to create the pgqueuer extension, which is not installed in the PostgreSQL image.
Fix: Commented out pgqueuer extension creation in alembic/versions/001_initial_schema.py. Can be enabled later when the extension is installed.
Error:
error: Multiple top-level packages discovered in a flat-layout: ['core', 'data', 'specs', 'brain', 'alembic', 'interface', 'ingestion']
Root Cause: setuptools couldn't automatically discover which directories were Python packages.
Fix: Added explicit package configuration in pyproject.toml:
[tool.setuptools]
packages = ["core", "ingestion", "brain", "interface", "interface.api", "interface.api.routes", "interface.dashboard"]Warning:
SetuptoolsDeprecationWarning: project.license as a TOML table is deprecated
Fix: Changed license format in pyproject.toml from:
license = {text = "MIT"}to:
license = "MIT"Error:
asyncpg.exceptions.InterfaceError: cannot perform operation: another operation is in progress
TCPTransport closed=True reading=False
Root Cause: Streamlit's synchronous execution model conflicted with async database operations. Multiple asyncio.run() calls created conflicting event loops, and database sessions weren't being properly closed, leaving connections open.
Fix: Implemented proper session lifecycle management:
- Added explicit session creation, commit, rollback, and close in try/finally blocks
- Configured connection pooling with
pool_pre_ping=Trueandpool_recycle=3600 - Created new event loops for each Streamlit request and properly closed them
- Ensured sessions are always closed even if errors occur
Files Changed:
interface/dashboard/queries.py: Added proper session lifecycle managementinterface/dashboard/app.py: Fixed event loop handling for Streamlit's sync context
The application uses async PostgreSQL connections via asyncpg:
- Host:
localhost(when connecting from host machine) - Port:
5432 - Database:
einvoicing - User:
einvoice - Password:
einvoice_dev
Connection string format:
postgresql+asyncpg://USER:PASSWORD@HOST:PORT/DATABASE
Supported File Types:
- PDF (
.pdf) - Excel (
.xlsx,.xls) - CSV (
.csv) - Images (
.jpg,.jpeg,.png) - placeholder implementation
File Storage:
- Files are encrypted at rest using Fernet symmetric encryption
- Original storage path stored in database
- Encrypted file path stored separately
- I/O Operations: Async/await coroutines (file reading, database, API calls)
- CPU-Intensive Tasks: Separate processes (OCR, image processing, AI inference)
- Rationale: Maximizes I/O throughput while preventing GIL blocking
core/database.py
- Async SQLAlchemy engine and session management
- Database initialization and cleanup
- Session dependency for FastAPI
core/models.py
- SQLAlchemy ORM models:
Invoice: Document metadata and statusExtractedData: Structured invoice dataValidationResult: Validation rule resultsProcessingJob: Job tracking and execution
core/encryption.py
- File encryption/decryption utilities
- Uses
cryptography.fernetfor symmetric encryption
core/logging.py
- Structured logging configuration with
structlog - Sensitive data filtering (invoice numbers, amounts, etc.)
ingestion/orchestrator.py
- Main processing pipeline coordinator
- Handles file discovery, hashing, encryption, processing, extraction, validation
- Error handling and job tracking
ingestion/file_hasher.py
- SHA-256 hash calculation for file identity
- Enables duplicate detection and versioning
ingestion/pdf_processor.py
- PDF text extraction using
pypdf - Placeholder for Docling integration
ingestion/excel_processor.py
- Excel/CSV processing using Pandas
- Converts to text representation for extraction
brain/extractor.py
- Regex-based data extraction from raw text
- Extracts vendor, invoice number, dates, amounts
- Returns structured
ExtractedDataSchema
brain/validator.py
- Validation framework with extensible rules
- Implements mathematical checks (subtotal + tax = total)
- Returns validation results with status (passed/failed/warning)
interface/api/main.py
- FastAPI application entry point
- Lifespan management (database init/cleanup)
- CORS configuration
- Route registration
interface/api/routes/invoices.py
- Invoice processing endpoints
- List, retrieve, and process invoices
- Structured JSON responses with pagination
interface/dashboard/app.py
- Streamlit dashboard for reviewing invoices
- Status filtering and invoice detail views
- Proper async event loop management for Streamlit's synchronous context
interface/dashboard/queries.py
- Database query utilities for dashboard
- Proper async session lifecycle management
- Connection pooling configuration
GET /health
Returns API health status.
POST /api/v1/invoices/process
Processes an invoice file from a local path.
Request Body:
{
"file_path": "data/invoice-1.png",
"category": "Invoice",
"group": "manual",
"force_reprocess": false
}Response:
{
"status": "success",
"timestamp": "2024-12-19T12:00:00Z",
"data": {
"invoice_id": "uuid",
"processing_status": "processing"
}
}GET /api/v1/invoices?status=completed&page=1&page_size=20
Returns paginated list of invoices with optional status filter.
GET /api/v1/invoices/{invoice_id}
Returns detailed invoice information including extracted data and validation results.
# Check tables exist
docker exec ai-einvoicing-db psql -U einvoice -d einvoicing -c "\dt"
# Check table structure
docker exec ai-einvoicing-db psql -U einvoice -d einvoicing -c "\d invoices"# Health check
curl http://localhost:${API_PORT}/health
# List invoices
curl http://localhost:${API_PORT}/api/v1/invoices- Place a sample invoice file in the
data/directory - Process via API:
curl -X POST http://localhost:${API_PORT}/api/v1/invoices/process \
-H "Content-Type: application/json" \
-d '{"file_path": "data/sample_invoice.pdf"}'- Check results in Streamlit dashboard or via API
- Docling Integration: Replace pypdf with Docling for better PDF extraction
- OCR Integration: Implement DeepSeek-OCR or similar for image processing
- pgqueuer Setup: Install and configure pgqueuer extension for job queue management
- LlamaIndex Integration: Add RAG capabilities for intelligent extraction
- Enhanced Validation: Add more validation rules (date ranges, vendor matching, etc.)
- Batch Processing: Process multiple files in parallel
- Webhook Support: Notify external systems on processing completion
- Advanced Encryption: Add encryption in transit for production
- User Authentication: Add FastAPI-Users for multi-user support
- Vector Search: Use pgvector for semantic search of invoice data
Key dependencies from pyproject.toml:
- FastAPI: Async web framework
- SQLAlchemy 2.0: Async ORM
- Pydantic v2: Data validation
- Alembic: Database migrations
- asyncpg: PostgreSQL async driver
- pypdf: PDF text extraction
- pandas: Excel/CSV processing
- cryptography: File encryption
- structlog: Structured logging
- streamlit: Dashboard UI
- uvicorn: ASGI server
The e-invoice scaffold implementation provides a solid foundation for processing heterogeneous invoice formats. All 67 implementation tasks have been completed, including:
- ✅ Complete project structure
- ✅ Database schema and migrations
- ✅ File processing pipeline
- ✅ Data extraction and validation
- ✅ REST API endpoints
- ✅ Review dashboard
- ✅ Error handling and logging
- ✅ File encryption
The scaffold is ready for further development and enhancement with AI capabilities, advanced validation, and production-ready features.