Skip to content

Vision SDK - Document Processing Pipeline #325

@kovtcharov

Description

@kovtcharov

Vision SDK - Document Processing Pipeline

Summary

Build a unified Vision Pipeline SDK for document processing, OCR, and structured data extraction. Consolidates fragmented vision capabilities across GAIA (400+ lines duplicated in EMR agent) into a clean, developer-friendly API.

Value: Process any document type (medical forms, legal logs, technical manuals) with VLM-powered OCR—100% locally on AMD hardware.


Problem

Current state:

  • Vision code duplicated across agents (EMR has 400+ lines of boilerplate)
  • No unified API for document processing
  • Cannot process multi-page documents efficiently
  • Missing features: table extraction, visual element extraction, RAG integration

Impact:

  • Each vision agent starts from scratch (400+ lines per agent)
  • Cannot handle legal documents (1,200+ pages)
  • Manual preprocessing required for RAG indexing

Solution

One SDK for all document types:

from gaia.vision import VisionSDK, ExtractionSchema

vision = VisionSDK()

# Medical forms → structured data
result = vision.extract("form.pdf", schema=medical_schema)

# Legal logs → text + tables + visuals
result = vision.extract("logs.pdf", extract_tables=True, extract_visuals=True)

# Technical manuals → RAG indexing
result = vision.extract("manual.pdf", pages="all")

# Batch receipts → expense report
results = vision.extract_batch(files=receipts, schema=receipt_schema)
results.to_excel("expenses.xlsx")

Implementation Milestones

M1: Consolidation & Reorganization (1 week)

Goal: Extract existing code from EMR/utils into gaia.vision module

  • Move image preprocessing from EMR (114 lines) → vision/preprocessor.py
  • Move PDF loading from utils → vision/loaders.py
  • Create basic Document/Page models
  • Refactor EMR agent to use new module

Success: EMR agent uses gaia.vision, no code duplication


M2: Structured Extraction (1-1.5 weeks)

Goal: EMR-complete with multi-page and batch support

  • Multi-page document processing
  • ExtractionSchema with flexible fields
  • Prompt template library (medical, legal, invoice, receipt)
  • Batch file processing
  • Validation rules
  • Agent mixin integration

Success: EMR agent code reduced 60% (1500 → 600 lines), batch receipt processing working


M3: Tables & Visuals (1.5-2 weeks)

Goal: Driver logs prototype + RAG integration

  • Table extraction and parsing
  • Visual element extraction (charts, timelines)
  • Multi-document detection (multiple receipts in one image)
  • RAG integration helpers
  • Table export (CSV, DataFrame)

Success: Driver logs prototype (10-20 pages), RAG indexing with tables, receipt batch processing

CRITICAL: VLM visual extraction validated early (week 1 of M3)


M4: Optimization & Scale (1-1.5 weeks)

Goal: Production-ready for large documents

  • Parallel processing (multi-threaded)
  • Checkpoint/resume for long jobs
  • Async job API
  • Memory management
  • Batch optimization

Success: Process 1,200-page driver logs successfully, optimized batch receipt processing


M5: Superset - Complete SDK (1.5-2 weeks)

Goal: All advanced features

  • Layout analysis
  • Form field detection
  • Temporal analysis tools
  • Pattern detection framework
  • Report generation templates
  • Specialized processors (invoice, receipt, ID card, etc.)
  • Complete documentation

Success: Production-ready SDK with all features, 95%+ test coverage


Validation Strategy

Two-Part Evaluation

Part 1: OCR Extraction Quality

  • Text accuracy, table structure, visual detection
  • Compared to ground truth dataset (70 documents)
  • Metrics: CER < 5%, table accuracy > 70%, visual detection > 70%

Part 2: Agent Q&A & Analysis

  • Agents reason about extracted data
  • RAG retrieval quality over extracted content
  • Analysis tasks (violations, summaries)

Validation Use Cases

  1. EMR Medical Forms - Single-page structured (validated in M2)
  2. Driver Logs (Legal) - Multi-page with visuals (validated in M3-M4)
  3. Oil & Gas Manual (RAG) - Technical content indexing (validated in M3)
  4. Batch Receipts - Expense reporting automation (validated in M2-M4)

Evaluation Framework Extension

  • Extend src/gaia/eval/ with vision-specific evaluators
  • Ground truth dataset (20 tier-1, 70 comprehensive)
  • Automated accuracy measurement
  • CLI: gaia vision-eval for continuous validation

Success Criteria

After M2 (Week 3)

  • ✅ EMR agent refactored with 60% code reduction
  • ✅ Batch receipt processing working
  • ✅ Can process multi-page documents

After M3 (Week 5)

  • ✅ Driver logs prototype validated (10-20 pages)
  • ✅ Table/visual extraction working (70%+ accuracy)
  • ✅ RAG integration seamless
  • ✅ Multi-receipt detection working

After M4 (Week 6.5)

  • ✅ Full 1,200-page driver logs processed
  • ✅ Checkpoint/resume working
  • ✅ Parallel processing 2-4x faster
  • ✅ Batch receipts optimized

After M5 (Week 8)

  • ✅ 95%+ test coverage
  • ✅ Complete documentation with 20+ examples
  • ✅ All 4 use cases validated
  • ✅ Production-ready

Technical Requirements

Performance:

  • Single page: < 30 seconds
  • 100 pages: < 1 hour
  • 1,000 pages: < 12 hours (parallel)
  • Memory: < 4GB peak

Accuracy:

  • Text OCR: 95%+ on clean scans
  • Table extraction: 80%+ structure
  • Visual extraction: 70%+ data or description
  • Form extraction: 90%+ fields

Dependencies:

  • VLMClient (Qwen3-VL-4B-Instruct-GGUF)
  • Lemonade Server
  • PyMuPDF, PIL/Pillow

Timeline

Start: Week of February 10, 2026
M2 Complete (EMR-ready): ~March 3, 2026
M4 Complete (Driver Logs): ~April 4, 2026
M5 Complete (Superset): ~April 21, 2026

Total: 6-8 weeks


Related Documentation

  • 📋 Detailed Plan: docs/plans/vision-sdk.mdx
  • 🎯 Milestone Breakdown: VISION_SDK_MILESTONES.md (comprehensive spec)
  • ✅ Validation Strategy: VISION_SDK_VALIDATION.md
  • 📊 Use Case Matrix: VISION_SDK_USE_CASES.md
  • 🎨 Prompt Templates: VISION_SDK_PROMPTS.md
  • 📄 Complex Documents: VISION_SDK_COMPLEX_DOCS.md

Risks & Mitigation

Risk: VLM visual extraction may not work
Mitigation: Test early (M3 week 1), have description fallback, adjust scope if needed

Risk: 1,200-page processing too slow
Mitigation: Parallel processing (M4), checkpoint/resume, user expectations set

Risk: Scope creep in M5
Mitigation: Ship M4 as "complete", defer M5 features based on demand


Next Steps

  1. ✅ Review and approve milestones
  2. 🔨 M1: Consolidation (extract EMR code)
  3. 🔨 M2: Structured extraction (EMR-complete)
  4. 🔬 M3: Validate VLM visual capabilities (CRITICAL)
  5. ⚙️ M4-M5: Based on M3 validation results

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions