E-Invoice Scaffold Implementation Documentation

Branch: 1-e-invoice-scaffold
Date: 2024-12-19
Status: ✅ Complete

Overview

This document describes the complete implementation of the e-invoice scaffold application. The scaffold provides a foundational structure for processing heterogeneous invoice formats (PDF, Excel, Images) into structured, validated data using an AI-native approach.

What Was Implemented

Core Infrastructure

✅ Project structure with modular architecture (core, ingestion, brain, interface layers)
✅ PostgreSQL database with pgvector extension (via Docker Compose)
✅ Async SQLAlchemy 2.0 ORM with database models
✅ Alembic database migrations
✅ File-level encryption for sensitive data at rest
✅ Structured logging with sensitive data filtering
✅ SHA-256 file hashing for duplicate detection and versioning

File Processing Pipeline

✅ File discovery and type detection (PDF, Excel, CSV, Images)
✅ PDF processing with pypdf (Docling integration placeholder)
✅ Excel/CSV processing with Pandas
✅ Image processing placeholder (for future OCR integration)
✅ Processing orchestrator with error handling and encryption

Data Extraction & Validation

✅ Pydantic schemas for structured invoice data
✅ Regex-based data extraction (vendor, invoice number, dates, amounts)
✅ Validation framework with extensible rules
✅ Mathematical validation (subtotal + tax = total)

API Layer

✅ FastAPI application with async endpoints
✅ Health check endpoint
✅ Invoice processing endpoint (POST /api/v1/invoices/process)
✅ Invoice listing endpoint (GET /api/v1/invoices)
✅ Invoice detail endpoint (GET /api/v1/invoices/{invoice_id})
✅ Structured JSON response envelope with pagination

Review Dashboard

✅ Streamlit dashboard for reviewing processed invoices
✅ Status filtering and invoice listing
✅ Invoice detail view with validation results

Architecture

Three-Layer Architecture

┌─────────────────────────────────────────────────────────┐
│              INTERACTION LAYER                           │
│  FastAPI (REST API) + Streamlit (Review Dashboard)     │
└─────────────────────────────────────────────────────────┘
                         ↕
┌─────────────────────────────────────────────────────────┐
│                  BRAIN LAYER                             │
│  Data Extraction + Validation Framework                  │
└─────────────────────────────────────────────────────────┘
                         ↕
┌─────────────────────────────────────────────────────────┐
│              SENSORY LAYER                               │
│  File Discovery + PDF/Excel/Image Processing            │
└─────────────────────────────────────────────────────────┘
                         ↕
┌─────────────────────────────────────────────────────────┐
│              INFRASTRUCTURE                              │
│  PostgreSQL + pgvector + Encryption + Logging           │
└─────────────────────────────────────────────────────────┘

Project Structure

ai-einvoicing/
├── core/                    # Infrastructure layer
│   ├── database.py         # Async SQLAlchemy engine & session management
│   ├── models.py           # ORM models (Invoice, ExtractedData, ValidationResult, ProcessingJob)
│   ├── encryption.py       # File encryption/decryption utilities
│   └── logging.py          # Structured logging configuration
│
├── ingestion/              # Sensory layer
│   ├── file_discovery.py   # Discover supported files
│   ├── file_hasher.py      # SHA-256 hash calculation
│   ├── pdf_processor.py    # PDF text extraction
│   ├── excel_processor.py # Excel/CSV processing
│   ├── image_processor.py # Image processing (placeholder)
│   └── orchestrator.py    # Main processing pipeline
│
├── brain/                  # Brain layer
│   ├── schemas.py          # Pydantic models for invoice data
│   ├── extractor.py        # Data extraction from raw text
│   └── validator.py        # Validation rules framework
│
├── interface/              # Interaction layer
│   ├── api/
│   │   ├── main.py         # FastAPI application entry point
│   │   ├── routes/
│   │   │   ├── health.py   # Health check endpoint
│   │   │   └── invoices.py # Invoice processing endpoints
│   │   └── schemas.py      # API request/response models
│   └── dashboard/
│       ├── app.py          # Streamlit dashboard
│       └── queries.py      # Database queries for dashboard
│
├── alembic/                # Database migrations
│   ├── env.py             # Alembic environment configuration
│   └── versions/
│       └── 001_initial_schema.py  # Initial database schema
│
├── data/                   # Local storage for sample invoices
├── docker-compose.yml      # PostgreSQL with pgvector
├── pyproject.toml          # Project dependencies and metadata
└── alembic.ini            # Alembic configuration

Database Schema

Tables

invoices
- Stores invoice document metadata
- Tracks processing status, file hash, version
- Supports duplicate detection via file hash
- Includes storage_path, category, group, and job_id
extracted_data
- Stores structured invoice data extracted from documents
- One-to-one relationship with invoices
- Includes vendor, dates, amounts, line items
validation_results
- Stores validation rule results
- One-to-many relationship with invoices
- Tracks passed/failed/warning status
processing_jobs
- Tracks processing job execution
- One-to-many relationship with invoices
- Records execution type (async_coroutine vs cpu_process)

Setup & Installation

Prerequisites

Python 3.12.2
Docker and Docker Compose
Conda (optional, for environment management)

Step 1: Install Dependencies

# Using pip (recommended)
pip install -e ".[dev]"

# Or using conda
conda create -n ai-einvoicing-env python=3.12
conda activate ai-einvoicing-env
pip install -e ".[dev]"

Step 2: Configure Environment

Create a .env file in the project root:

# Database Configuration
DATABASE_URL=postgresql+asyncpg://einvoice:einvoice_dev@localhost:${PGDB_PORT:-5432}/einvoicing

# Encryption Key (generate with: python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())")
ENCRYPTION_KEY=your-generated-encryption-key-here

# Logging Configuration
LOG_LEVEL=INFO
LOG_FORMAT=json

# API Configuration
API_HOST=0.0.0.0
API_PORT=8000

Generate Encryption Key:

python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

Step 3: Start PostgreSQL

docker-compose up -d

Verify database is running:

docker ps --filter "name=ai-einvoicing-db"

Step 4: Run Database Migrations

alembic upgrade head

Verify tables were created:

docker exec ai-einvoicing-db psql -U einvoice -d einvoicing -c "\dt"

Step 5: Start the Application

Start FastAPI API:

# Start API
python interface/api/main.py --reload

API will be available at: http://localhost:${API_PORT}

API Documentation: http://localhost:${API_PORT}/docs
Health Check: http://localhost:${API_PORT}/health

Start Streamlit Dashboard:

streamlit run interface/dashboard/app.py

Dashboard will be available at: http://localhost:${UI_PORT:-8501}

Issues Encountered & Fixes

Issue 1: SQLAlchemy Reserved Name Conflict

Error:

sqlalchemy.exc.InvalidRequestError: Attribute name 'metadata' is reserved when using the Declarative API.

Root Cause: The ProcessingJob model had a field named metadata, which conflicts with SQLAlchemy's reserved metadata attribute.

Fix: Renamed metadata to job_metadata in:

core/models.py (line 230)
alembic/versions/001_initial_schema.py (migration)

Issue 2: Database URL Configuration

Error:

sqlalchemy.exc.NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:driver

Root Cause: alembic.ini had a placeholder database URL (driver://user:pass@localhost/dbname).

Fix: Updated alembic.ini with correct database URL:

sqlalchemy.url = postgresql+asyncpg://einvoice:einvoice_dev@localhost:${PGDB_PORT:-5432}/einvoicing

Issue 3: pgqueuer Extension Not Available

Error:

asyncpg.exceptions.FeatureNotSupportedError: extension "pgqueuer" is not available

Root Cause: The migration tried to create the pgqueuer extension, which is not installed in the PostgreSQL image.

Fix: Commented out pgqueuer extension creation in alembic/versions/001_initial_schema.py. Can be enabled later when the extension is installed.

Issue 4: Package Discovery Error

Error:

error: Multiple top-level packages discovered in a flat-layout: ['core', 'data', 'specs', 'brain', 'alembic', 'interface', 'ingestion']

Root Cause: setuptools couldn't automatically discover which directories were Python packages.

Fix: Added explicit package configuration in pyproject.toml:

[tool.setuptools]
packages = ["core", "ingestion", "brain", "interface", "interface.api", "interface.api.routes", "interface.dashboard"]

Issue 5: License Format Deprecation

Warning:

SetuptoolsDeprecationWarning: project.license as a TOML table is deprecated

Fix: Changed license format in pyproject.toml from:

license = {text = "MIT"}

to:

license = "MIT"

Issue 6: Streamlit Dashboard Database Connection Errors

Error:

asyncpg.exceptions.InterfaceError: cannot perform operation: another operation is in progress
TCPTransport closed=True reading=False

Root Cause: Streamlit's synchronous execution model conflicted with async database operations. Multiple asyncio.run() calls created conflicting event loops, and database sessions weren't being properly closed, leaving connections open.

Fix: Implemented proper session lifecycle management:

Added explicit session creation, commit, rollback, and close in try/finally blocks
Configured connection pooling with pool_pre_ping=True and pool_recycle=3600
Created new event loops for each Streamlit request and properly closed them
Ensured sessions are always closed even if errors occur

Files Changed:

interface/dashboard/queries.py: Added proper session lifecycle management
interface/dashboard/app.py: Fixed event loop handling for Streamlit's sync context

Configuration Details

Database Connection

The application uses async PostgreSQL connections via asyncpg:

Host: localhost (when connecting from host machine)
Port: 5432
Database: einvoicing
User: einvoice
Password: einvoice_dev

Connection string format:

postgresql+asyncpg://USER:PASSWORD@HOST:PORT/DATABASE

File Processing

Supported File Types:

PDF (.pdf)
Excel (.xlsx, .xls)
CSV (.csv)
Images (.jpg, .jpeg, .png) - placeholder implementation

File Storage:

Files are encrypted at rest using Fernet symmetric encryption
Original storage path stored in database
Encrypted file path stored separately

Concurrency Model

I/O Operations: Async/await coroutines (file reading, database, API calls)
CPU-Intensive Tasks: Separate processes (OCR, image processing, AI inference)
Rationale: Maximizes I/O throughput while preventing GIL blocking

Key Files & Their Purposes

Core Infrastructure

core/database.py

Async SQLAlchemy engine and session management
Database initialization and cleanup
Session dependency for FastAPI

core/models.py

SQLAlchemy ORM models:
- Invoice: Document metadata and status
- ExtractedData: Structured invoice data
- ValidationResult: Validation rule results
- ProcessingJob: Job tracking and execution

core/encryption.py

File encryption/decryption utilities
Uses cryptography.fernet for symmetric encryption

core/logging.py

Structured logging configuration with structlog
Sensitive data filtering (invoice numbers, amounts, etc.)

Ingestion Layer

ingestion/orchestrator.py

Main processing pipeline coordinator
Handles file discovery, hashing, encryption, processing, extraction, validation
Error handling and job tracking

ingestion/file_hasher.py

SHA-256 hash calculation for file identity
Enables duplicate detection and versioning

ingestion/pdf_processor.py

PDF text extraction using pypdf
Placeholder for Docling integration

ingestion/excel_processor.py

Excel/CSV processing using Pandas
Converts to text representation for extraction

Brain Layer

brain/extractor.py

Regex-based data extraction from raw text
Extracts vendor, invoice number, dates, amounts
Returns structured ExtractedDataSchema

brain/validator.py

Validation framework with extensible rules
Implements mathematical checks (subtotal + tax = total)
Returns validation results with status (passed/failed/warning)

Interface Layer

interface/api/main.py

FastAPI application entry point
Lifespan management (database init/cleanup)
CORS configuration
Route registration

interface/api/routes/invoices.py

Invoice processing endpoints
List, retrieve, and process invoices
Structured JSON responses with pagination

interface/dashboard/app.py

Streamlit dashboard for reviewing invoices
Status filtering and invoice detail views
Proper async event loop management for Streamlit's synchronous context

interface/dashboard/queries.py

Database query utilities for dashboard
Proper async session lifecycle management
Connection pooling configuration

API Endpoints

Health Check

GET /health

Returns API health status.

Process Invoice

POST /api/v1/invoices/process

Processes an invoice file from a local path.

Request Body:

{
  "file_path": "data/invoice-1.png",
  "category": "Invoice",
  "group": "manual",
  "force_reprocess": false
}

Response:

{
  "status": "success",
  "timestamp": "2024-12-19T12:00:00Z",
  "data": {
    "invoice_id": "uuid",
    "processing_status": "processing"
  }
}

List Invoices

GET /api/v1/invoices?status=completed&page=1&page_size=20

Returns paginated list of invoices with optional status filter.

Get Invoice Details

GET /api/v1/invoices/{invoice_id}

Returns detailed invoice information including extracted data and validation results.

Testing & Verification

Verify Database Setup

# Check tables exist
docker exec ai-einvoicing-db psql -U einvoice -d einvoicing -c "\dt"

# Check table structure
docker exec ai-einvoicing-db psql -U einvoice -d einvoicing -c "\d invoices"

Verify API

# Health check
curl http://localhost:${API_PORT}/health

# List invoices
curl http://localhost:${API_PORT}/api/v1/invoices

Verify Processing

Place a sample invoice file in the data/ directory
Process via API:

curl -X POST http://localhost:${API_PORT}/api/v1/invoices/process \
  -H "Content-Type: application/json" \
  -d '{"file_path": "data/sample_invoice.pdf"}'

Check results in Streamlit dashboard or via API

Next Steps

Immediate Enhancements

Docling Integration: Replace pypdf with Docling for better PDF extraction
OCR Integration: Implement DeepSeek-OCR or similar for image processing
pgqueuer Setup: Install and configure pgqueuer extension for job queue management
LlamaIndex Integration: Add RAG capabilities for intelligent extraction
Enhanced Validation: Add more validation rules (date ranges, vendor matching, etc.)

Future Features

Batch Processing: Process multiple files in parallel
Webhook Support: Notify external systems on processing completion
Advanced Encryption: Add encryption in transit for production
User Authentication: Add FastAPI-Users for multi-user support
Vector Search: Use pgvector for semantic search of invoice data

Dependencies

Key dependencies from pyproject.toml:

FastAPI: Async web framework
SQLAlchemy 2.0: Async ORM
Pydantic v2: Data validation
Alembic: Database migrations
asyncpg: PostgreSQL async driver
pypdf: PDF text extraction
pandas: Excel/CSV processing
cryptography: File encryption
structlog: Structured logging
streamlit: Dashboard UI
uvicorn: ASGI server

References

Summary

The e-invoice scaffold implementation provides a solid foundation for processing heterogeneous invoice formats. All 67 implementation tasks have been completed, including:

✅ Complete project structure
✅ Database schema and migrations
✅ File processing pipeline
✅ Data extraction and validation
✅ REST API endpoints
✅ Review dashboard
✅ Error handling and logging
✅ File encryption

The scaffold is ready for further development and enhancement with AI capabilities, advanced validation, and production-ready features.

FilesExpand file tree

setup-scaffold-1.md

Latest commit

History

setup-scaffold-1.md

File metadata and controls

E-Invoice Scaffold Implementation Documentation

Overview

What Was Implemented

Core Infrastructure

File Processing Pipeline

Data Extraction & Validation

API Layer

Review Dashboard

Architecture

Three-Layer Architecture

Project Structure

Database Schema

Tables

Setup & Installation

Prerequisites

Step 1: Install Dependencies

Step 2: Configure Environment

Step 3: Start PostgreSQL

Step 4: Run Database Migrations

Step 5: Start the Application

Issues Encountered & Fixes

Issue 1: SQLAlchemy Reserved Name Conflict

Issue 2: Database URL Configuration

Issue 3: pgqueuer Extension Not Available

Issue 4: Package Discovery Error

Issue 5: License Format Deprecation

Issue 6: Streamlit Dashboard Database Connection Errors

Configuration Details

Database Connection

File Processing

Concurrency Model

Key Files & Their Purposes

Core Infrastructure

Ingestion Layer

Brain Layer

Interface Layer

API Endpoints

Health Check

Process Invoice

List Invoices

Get Invoice Details

Testing & Verification

Verify Database Setup

Verify API

Verify Processing

Next Steps

Immediate Enhancements

Future Features

Dependencies

References

Summary