Skip to content

Latest commit

 

History

History
572 lines (443 loc) · 18.3 KB

File metadata and controls

572 lines (443 loc) · 18.3 KB

E-Invoice Scaffold Implementation Documentation

Branch: 1-e-invoice-scaffold
Date: 2024-12-19
Status: ✅ Complete

Overview

This document describes the complete implementation of the e-invoice scaffold application. The scaffold provides a foundational structure for processing heterogeneous invoice formats (PDF, Excel, Images) into structured, validated data using an AI-native approach.

What Was Implemented

Core Infrastructure

  • ✅ Project structure with modular architecture (core, ingestion, brain, interface layers)
  • ✅ PostgreSQL database with pgvector extension (via Docker Compose)
  • ✅ Async SQLAlchemy 2.0 ORM with database models
  • ✅ Alembic database migrations
  • ✅ File-level encryption for sensitive data at rest
  • ✅ Structured logging with sensitive data filtering
  • ✅ SHA-256 file hashing for duplicate detection and versioning

File Processing Pipeline

  • ✅ File discovery and type detection (PDF, Excel, CSV, Images)
  • ✅ PDF processing with pypdf (Docling integration placeholder)
  • ✅ Excel/CSV processing with Pandas
  • ✅ Image processing placeholder (for future OCR integration)
  • ✅ Processing orchestrator with error handling and encryption

Data Extraction & Validation

  • ✅ Pydantic schemas for structured invoice data
  • ✅ Regex-based data extraction (vendor, invoice number, dates, amounts)
  • ✅ Validation framework with extensible rules
  • ✅ Mathematical validation (subtotal + tax = total)

API Layer

  • ✅ FastAPI application with async endpoints
  • ✅ Health check endpoint
  • ✅ Invoice processing endpoint (POST /api/v1/invoices/process)
  • ✅ Invoice listing endpoint (GET /api/v1/invoices)
  • ✅ Invoice detail endpoint (GET /api/v1/invoices/{invoice_id})
  • ✅ Structured JSON response envelope with pagination

Review Dashboard

  • ✅ Streamlit dashboard for reviewing processed invoices
  • ✅ Status filtering and invoice listing
  • ✅ Invoice detail view with validation results

Architecture

Three-Layer Architecture

┌─────────────────────────────────────────────────────────┐
│              INTERACTION LAYER                           │
│  FastAPI (REST API) + Streamlit (Review Dashboard)     │
└─────────────────────────────────────────────────────────┘
                         ↕
┌─────────────────────────────────────────────────────────┐
│                  BRAIN LAYER                             │
│  Data Extraction + Validation Framework                  │
└─────────────────────────────────────────────────────────┘
                         ↕
┌─────────────────────────────────────────────────────────┐
│              SENSORY LAYER                               │
│  File Discovery + PDF/Excel/Image Processing            │
└─────────────────────────────────────────────────────────┘
                         ↕
┌─────────────────────────────────────────────────────────┐
│              INFRASTRUCTURE                              │
│  PostgreSQL + pgvector + Encryption + Logging           │
└─────────────────────────────────────────────────────────┘

Project Structure

ai-einvoicing/
├── core/                    # Infrastructure layer
│   ├── database.py         # Async SQLAlchemy engine & session management
│   ├── models.py           # ORM models (Invoice, ExtractedData, ValidationResult, ProcessingJob)
│   ├── encryption.py       # File encryption/decryption utilities
│   └── logging.py          # Structured logging configuration
│
├── ingestion/              # Sensory layer
│   ├── file_discovery.py   # Discover supported files
│   ├── file_hasher.py      # SHA-256 hash calculation
│   ├── pdf_processor.py    # PDF text extraction
│   ├── excel_processor.py # Excel/CSV processing
│   ├── image_processor.py # Image processing (placeholder)
│   └── orchestrator.py    # Main processing pipeline
│
├── brain/                  # Brain layer
│   ├── schemas.py          # Pydantic models for invoice data
│   ├── extractor.py        # Data extraction from raw text
│   └── validator.py        # Validation rules framework
│
├── interface/              # Interaction layer
│   ├── api/
│   │   ├── main.py         # FastAPI application entry point
│   │   ├── routes/
│   │   │   ├── health.py   # Health check endpoint
│   │   │   └── invoices.py # Invoice processing endpoints
│   │   └── schemas.py      # API request/response models
│   └── dashboard/
│       ├── app.py          # Streamlit dashboard
│       └── queries.py      # Database queries for dashboard
│
├── alembic/                # Database migrations
│   ├── env.py             # Alembic environment configuration
│   └── versions/
│       └── 001_initial_schema.py  # Initial database schema
│
├── data/                   # Local storage for sample invoices
├── docker-compose.yml      # PostgreSQL with pgvector
├── pyproject.toml          # Project dependencies and metadata
└── alembic.ini            # Alembic configuration

Database Schema

Tables

  1. invoices

    • Stores invoice document metadata
    • Tracks processing status, file hash, version
    • Supports duplicate detection via file hash
    • Includes storage_path, category, group, and job_id
  2. extracted_data

    • Stores structured invoice data extracted from documents
    • One-to-one relationship with invoices
    • Includes vendor, dates, amounts, line items
  3. validation_results

    • Stores validation rule results
    • One-to-many relationship with invoices
    • Tracks passed/failed/warning status
  4. processing_jobs

    • Tracks processing job execution
    • One-to-many relationship with invoices
    • Records execution type (async_coroutine vs cpu_process)

Setup & Installation

Prerequisites

  • Python 3.12.2
  • Docker and Docker Compose
  • Conda (optional, for environment management)

Step 1: Install Dependencies

# Using pip (recommended)
pip install -e ".[dev]"

# Or using conda
conda create -n ai-einvoicing-env python=3.12
conda activate ai-einvoicing-env
pip install -e ".[dev]"

Step 2: Configure Environment

Create a .env file in the project root:

# Database Configuration
DATABASE_URL=postgresql+asyncpg://einvoice:einvoice_dev@localhost:${PGDB_PORT:-5432}/einvoicing

# Encryption Key (generate with: python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())")
ENCRYPTION_KEY=your-generated-encryption-key-here

# Logging Configuration
LOG_LEVEL=INFO
LOG_FORMAT=json

# API Configuration
API_HOST=0.0.0.0
API_PORT=8000

Generate Encryption Key:

python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

Step 3: Start PostgreSQL

docker-compose up -d

Verify database is running:

docker ps --filter "name=ai-einvoicing-db"

Step 4: Run Database Migrations

alembic upgrade head

Verify tables were created:

docker exec ai-einvoicing-db psql -U einvoice -d einvoicing -c "\dt"

Step 5: Start the Application

Start FastAPI API:

# Start API
python interface/api/main.py --reload

API will be available at: http://localhost:${API_PORT}

  • API Documentation: http://localhost:${API_PORT}/docs
  • Health Check: http://localhost:${API_PORT}/health

Start Streamlit Dashboard:

streamlit run interface/dashboard/app.py

Dashboard will be available at: http://localhost:${UI_PORT:-8501}

Issues Encountered & Fixes

Issue 1: SQLAlchemy Reserved Name Conflict

Error:

sqlalchemy.exc.InvalidRequestError: Attribute name 'metadata' is reserved when using the Declarative API.

Root Cause: The ProcessingJob model had a field named metadata, which conflicts with SQLAlchemy's reserved metadata attribute.

Fix: Renamed metadata to job_metadata in:

  • core/models.py (line 230)
  • alembic/versions/001_initial_schema.py (migration)

Issue 2: Database URL Configuration

Error:

sqlalchemy.exc.NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:driver

Root Cause: alembic.ini had a placeholder database URL (driver://user:pass@localhost/dbname).

Fix: Updated alembic.ini with correct database URL:

sqlalchemy.url = postgresql+asyncpg://einvoice:einvoice_dev@localhost:${PGDB_PORT:-5432}/einvoicing

Issue 3: pgqueuer Extension Not Available

Error:

asyncpg.exceptions.FeatureNotSupportedError: extension "pgqueuer" is not available

Root Cause: The migration tried to create the pgqueuer extension, which is not installed in the PostgreSQL image.

Fix: Commented out pgqueuer extension creation in alembic/versions/001_initial_schema.py. Can be enabled later when the extension is installed.

Issue 4: Package Discovery Error

Error:

error: Multiple top-level packages discovered in a flat-layout: ['core', 'data', 'specs', 'brain', 'alembic', 'interface', 'ingestion']

Root Cause: setuptools couldn't automatically discover which directories were Python packages.

Fix: Added explicit package configuration in pyproject.toml:

[tool.setuptools]
packages = ["core", "ingestion", "brain", "interface", "interface.api", "interface.api.routes", "interface.dashboard"]

Issue 5: License Format Deprecation

Warning:

SetuptoolsDeprecationWarning: project.license as a TOML table is deprecated

Fix: Changed license format in pyproject.toml from:

license = {text = "MIT"}

to:

license = "MIT"

Issue 6: Streamlit Dashboard Database Connection Errors

Error:

asyncpg.exceptions.InterfaceError: cannot perform operation: another operation is in progress
TCPTransport closed=True reading=False

Root Cause: Streamlit's synchronous execution model conflicted with async database operations. Multiple asyncio.run() calls created conflicting event loops, and database sessions weren't being properly closed, leaving connections open.

Fix: Implemented proper session lifecycle management:

  • Added explicit session creation, commit, rollback, and close in try/finally blocks
  • Configured connection pooling with pool_pre_ping=True and pool_recycle=3600
  • Created new event loops for each Streamlit request and properly closed them
  • Ensured sessions are always closed even if errors occur

Files Changed:

  • interface/dashboard/queries.py: Added proper session lifecycle management
  • interface/dashboard/app.py: Fixed event loop handling for Streamlit's sync context

Configuration Details

Database Connection

The application uses async PostgreSQL connections via asyncpg:

  • Host: localhost (when connecting from host machine)
  • Port: 5432
  • Database: einvoicing
  • User: einvoice
  • Password: einvoice_dev

Connection string format:

postgresql+asyncpg://USER:PASSWORD@HOST:PORT/DATABASE

File Processing

Supported File Types:

  • PDF (.pdf)
  • Excel (.xlsx, .xls)
  • CSV (.csv)
  • Images (.jpg, .jpeg, .png) - placeholder implementation

File Storage:

  • Files are encrypted at rest using Fernet symmetric encryption
  • Original storage path stored in database
  • Encrypted file path stored separately

Concurrency Model

  • I/O Operations: Async/await coroutines (file reading, database, API calls)
  • CPU-Intensive Tasks: Separate processes (OCR, image processing, AI inference)
  • Rationale: Maximizes I/O throughput while preventing GIL blocking

Key Files & Their Purposes

Core Infrastructure

core/database.py

  • Async SQLAlchemy engine and session management
  • Database initialization and cleanup
  • Session dependency for FastAPI

core/models.py

  • SQLAlchemy ORM models:
    • Invoice: Document metadata and status
    • ExtractedData: Structured invoice data
    • ValidationResult: Validation rule results
    • ProcessingJob: Job tracking and execution

core/encryption.py

  • File encryption/decryption utilities
  • Uses cryptography.fernet for symmetric encryption

core/logging.py

  • Structured logging configuration with structlog
  • Sensitive data filtering (invoice numbers, amounts, etc.)

Ingestion Layer

ingestion/orchestrator.py

  • Main processing pipeline coordinator
  • Handles file discovery, hashing, encryption, processing, extraction, validation
  • Error handling and job tracking

ingestion/file_hasher.py

  • SHA-256 hash calculation for file identity
  • Enables duplicate detection and versioning

ingestion/pdf_processor.py

  • PDF text extraction using pypdf
  • Placeholder for Docling integration

ingestion/excel_processor.py

  • Excel/CSV processing using Pandas
  • Converts to text representation for extraction

Brain Layer

brain/extractor.py

  • Regex-based data extraction from raw text
  • Extracts vendor, invoice number, dates, amounts
  • Returns structured ExtractedDataSchema

brain/validator.py

  • Validation framework with extensible rules
  • Implements mathematical checks (subtotal + tax = total)
  • Returns validation results with status (passed/failed/warning)

Interface Layer

interface/api/main.py

  • FastAPI application entry point
  • Lifespan management (database init/cleanup)
  • CORS configuration
  • Route registration

interface/api/routes/invoices.py

  • Invoice processing endpoints
  • List, retrieve, and process invoices
  • Structured JSON responses with pagination

interface/dashboard/app.py

  • Streamlit dashboard for reviewing invoices
  • Status filtering and invoice detail views
  • Proper async event loop management for Streamlit's synchronous context

interface/dashboard/queries.py

  • Database query utilities for dashboard
  • Proper async session lifecycle management
  • Connection pooling configuration

API Endpoints

Health Check

GET /health

Returns API health status.

Process Invoice

POST /api/v1/invoices/process

Processes an invoice file from a local path.

Request Body:

{
  "file_path": "data/invoice-1.png",
  "category": "Invoice",
  "group": "manual",
  "force_reprocess": false
}

Response:

{
  "status": "success",
  "timestamp": "2024-12-19T12:00:00Z",
  "data": {
    "invoice_id": "uuid",
    "processing_status": "processing"
  }
}

List Invoices

GET /api/v1/invoices?status=completed&page=1&page_size=20

Returns paginated list of invoices with optional status filter.

Get Invoice Details

GET /api/v1/invoices/{invoice_id}

Returns detailed invoice information including extracted data and validation results.

Testing & Verification

Verify Database Setup

# Check tables exist
docker exec ai-einvoicing-db psql -U einvoice -d einvoicing -c "\dt"

# Check table structure
docker exec ai-einvoicing-db psql -U einvoice -d einvoicing -c "\d invoices"

Verify API

# Health check
curl http://localhost:${API_PORT}/health

# List invoices
curl http://localhost:${API_PORT}/api/v1/invoices

Verify Processing

  1. Place a sample invoice file in the data/ directory
  2. Process via API:
curl -X POST http://localhost:${API_PORT}/api/v1/invoices/process \
  -H "Content-Type: application/json" \
  -d '{"file_path": "data/sample_invoice.pdf"}'
  1. Check results in Streamlit dashboard or via API

Next Steps

Immediate Enhancements

  1. Docling Integration: Replace pypdf with Docling for better PDF extraction
  2. OCR Integration: Implement DeepSeek-OCR or similar for image processing
  3. pgqueuer Setup: Install and configure pgqueuer extension for job queue management
  4. LlamaIndex Integration: Add RAG capabilities for intelligent extraction
  5. Enhanced Validation: Add more validation rules (date ranges, vendor matching, etc.)

Future Features

  1. Batch Processing: Process multiple files in parallel
  2. Webhook Support: Notify external systems on processing completion
  3. Advanced Encryption: Add encryption in transit for production
  4. User Authentication: Add FastAPI-Users for multi-user support
  5. Vector Search: Use pgvector for semantic search of invoice data

Dependencies

Key dependencies from pyproject.toml:

  • FastAPI: Async web framework
  • SQLAlchemy 2.0: Async ORM
  • Pydantic v2: Data validation
  • Alembic: Database migrations
  • asyncpg: PostgreSQL async driver
  • pypdf: PDF text extraction
  • pandas: Excel/CSV processing
  • cryptography: File encryption
  • structlog: Structured logging
  • streamlit: Dashboard UI
  • uvicorn: ASGI server

References

Summary

The e-invoice scaffold implementation provides a solid foundation for processing heterogeneous invoice formats. All 67 implementation tasks have been completed, including:

  • ✅ Complete project structure
  • ✅ Database schema and migrations
  • ✅ File processing pipeline
  • ✅ Data extraction and validation
  • ✅ REST API endpoints
  • ✅ Review dashboard
  • ✅ Error handling and logging
  • ✅ File encryption

The scaffold is ready for further development and enhancement with AI capabilities, advanced validation, and production-ready features.