-
Notifications
You must be signed in to change notification settings - Fork 57
Description
Vision SDK - Document Processing Pipeline
Summary
Build a unified Vision Pipeline SDK for document processing, OCR, and structured data extraction. Consolidates fragmented vision capabilities across GAIA (400+ lines duplicated in EMR agent) into a clean, developer-friendly API.
Value: Process any document type (medical forms, legal logs, technical manuals) with VLM-powered OCR—100% locally on AMD hardware.
Problem
Current state:
- Vision code duplicated across agents (EMR has 400+ lines of boilerplate)
- No unified API for document processing
- Cannot process multi-page documents efficiently
- Missing features: table extraction, visual element extraction, RAG integration
Impact:
- Each vision agent starts from scratch (400+ lines per agent)
- Cannot handle legal documents (1,200+ pages)
- Manual preprocessing required for RAG indexing
Solution
One SDK for all document types:
from gaia.vision import VisionSDK, ExtractionSchema
vision = VisionSDK()
# Medical forms → structured data
result = vision.extract("form.pdf", schema=medical_schema)
# Legal logs → text + tables + visuals
result = vision.extract("logs.pdf", extract_tables=True, extract_visuals=True)
# Technical manuals → RAG indexing
result = vision.extract("manual.pdf", pages="all")
# Batch receipts → expense report
results = vision.extract_batch(files=receipts, schema=receipt_schema)
results.to_excel("expenses.xlsx")Implementation Milestones
M1: Consolidation & Reorganization (1 week)
Goal: Extract existing code from EMR/utils into gaia.vision module
- Move image preprocessing from EMR (114 lines) →
vision/preprocessor.py - Move PDF loading from utils →
vision/loaders.py - Create basic Document/Page models
- Refactor EMR agent to use new module
Success: EMR agent uses gaia.vision, no code duplication
M2: Structured Extraction (1-1.5 weeks)
Goal: EMR-complete with multi-page and batch support
- Multi-page document processing
- ExtractionSchema with flexible fields
- Prompt template library (medical, legal, invoice, receipt)
- Batch file processing
- Validation rules
- Agent mixin integration
Success: EMR agent code reduced 60% (1500 → 600 lines), batch receipt processing working
M3: Tables & Visuals (1.5-2 weeks)
Goal: Driver logs prototype + RAG integration
- Table extraction and parsing
- Visual element extraction (charts, timelines)
- Multi-document detection (multiple receipts in one image)
- RAG integration helpers
- Table export (CSV, DataFrame)
Success: Driver logs prototype (10-20 pages), RAG indexing with tables, receipt batch processing
CRITICAL: VLM visual extraction validated early (week 1 of M3)
M4: Optimization & Scale (1-1.5 weeks)
Goal: Production-ready for large documents
- Parallel processing (multi-threaded)
- Checkpoint/resume for long jobs
- Async job API
- Memory management
- Batch optimization
Success: Process 1,200-page driver logs successfully, optimized batch receipt processing
M5: Superset - Complete SDK (1.5-2 weeks)
Goal: All advanced features
- Layout analysis
- Form field detection
- Temporal analysis tools
- Pattern detection framework
- Report generation templates
- Specialized processors (invoice, receipt, ID card, etc.)
- Complete documentation
Success: Production-ready SDK with all features, 95%+ test coverage
Validation Strategy
Two-Part Evaluation
Part 1: OCR Extraction Quality
- Text accuracy, table structure, visual detection
- Compared to ground truth dataset (70 documents)
- Metrics: CER < 5%, table accuracy > 70%, visual detection > 70%
Part 2: Agent Q&A & Analysis
- Agents reason about extracted data
- RAG retrieval quality over extracted content
- Analysis tasks (violations, summaries)
Validation Use Cases
- EMR Medical Forms - Single-page structured (validated in M2)
- Driver Logs (Legal) - Multi-page with visuals (validated in M3-M4)
- Oil & Gas Manual (RAG) - Technical content indexing (validated in M3)
- Batch Receipts - Expense reporting automation (validated in M2-M4)
Evaluation Framework Extension
- Extend
src/gaia/eval/with vision-specific evaluators - Ground truth dataset (20 tier-1, 70 comprehensive)
- Automated accuracy measurement
- CLI:
gaia vision-evalfor continuous validation
Success Criteria
After M2 (Week 3)
- ✅ EMR agent refactored with 60% code reduction
- ✅ Batch receipt processing working
- ✅ Can process multi-page documents
After M3 (Week 5)
- ✅ Driver logs prototype validated (10-20 pages)
- ✅ Table/visual extraction working (70%+ accuracy)
- ✅ RAG integration seamless
- ✅ Multi-receipt detection working
After M4 (Week 6.5)
- ✅ Full 1,200-page driver logs processed
- ✅ Checkpoint/resume working
- ✅ Parallel processing 2-4x faster
- ✅ Batch receipts optimized
After M5 (Week 8)
- ✅ 95%+ test coverage
- ✅ Complete documentation with 20+ examples
- ✅ All 4 use cases validated
- ✅ Production-ready
Technical Requirements
Performance:
- Single page: < 30 seconds
- 100 pages: < 1 hour
- 1,000 pages: < 12 hours (parallel)
- Memory: < 4GB peak
Accuracy:
- Text OCR: 95%+ on clean scans
- Table extraction: 80%+ structure
- Visual extraction: 70%+ data or description
- Form extraction: 90%+ fields
Dependencies:
- VLMClient (Qwen3-VL-4B-Instruct-GGUF)
- Lemonade Server
- PyMuPDF, PIL/Pillow
Timeline
Start: Week of February 10, 2026
M2 Complete (EMR-ready): ~March 3, 2026
M4 Complete (Driver Logs): ~April 4, 2026
M5 Complete (Superset): ~April 21, 2026
Total: 6-8 weeks
Related Documentation
- 📋 Detailed Plan: docs/plans/vision-sdk.mdx
- 🎯 Milestone Breakdown:
VISION_SDK_MILESTONES.md(comprehensive spec) - ✅ Validation Strategy:
VISION_SDK_VALIDATION.md - 📊 Use Case Matrix:
VISION_SDK_USE_CASES.md - 🎨 Prompt Templates:
VISION_SDK_PROMPTS.md - 📄 Complex Documents:
VISION_SDK_COMPLEX_DOCS.md
Risks & Mitigation
Risk: VLM visual extraction may not work
Mitigation: Test early (M3 week 1), have description fallback, adjust scope if needed
Risk: 1,200-page processing too slow
Mitigation: Parallel processing (M4), checkpoint/resume, user expectations set
Risk: Scope creep in M5
Mitigation: Ship M4 as "complete", defer M5 features based on demand
Next Steps
- ✅ Review and approve milestones
- 🔨 M1: Consolidation (extract EMR code)
- 🔨 M2: Structured extraction (EMR-complete)
- 🔬 M3: Validate VLM visual capabilities (CRITICAL)
- ⚙️ M4-M5: Based on M3 validation results