Vision SDK - Document Processing Pipeline

# Vision SDK - Document Processing Pipeline

## Summary

Build a unified Vision Pipeline SDK for document processing, OCR, and structured data extraction. Consolidates fragmented vision capabilities across GAIA (400+ lines duplicated in EMR agent) into a clean, developer-friendly API.

**Value:** Process any document type (medical forms, legal logs, technical manuals) with VLM-powered OCR—100% locally on AMD hardware.

---

## Problem

**Current state:**
- Vision code duplicated across agents (EMR has 400+ lines of boilerplate)
- No unified API for document processing
- Cannot process multi-page documents efficiently
- Missing features: table extraction, visual element extraction, RAG integration

**Impact:**
- Each vision agent starts from scratch (400+ lines per agent)
- Cannot handle legal documents (1,200+ pages)
- Manual preprocessing required for RAG indexing

---

## Solution

**One SDK for all document types:**

```python
from gaia.vision import VisionSDK, ExtractionSchema

vision = VisionSDK()

# Medical forms → structured data
result = vision.extract("form.pdf", schema=medical_schema)

# Legal logs → text + tables + visuals
result = vision.extract("logs.pdf", extract_tables=True, extract_visuals=True)

# Technical manuals → RAG indexing
result = vision.extract("manual.pdf", pages="all")

# Batch receipts → expense report
results = vision.extract_batch(files=receipts, schema=receipt_schema)
results.to_excel("expenses.xlsx")
```

---

## Implementation Milestones

### M1: Consolidation & Reorganization (1 week)
**Goal:** Extract existing code from EMR/utils into `gaia.vision` module

- Move image preprocessing from EMR (114 lines) → `vision/preprocessor.py`
- Move PDF loading from utils → `vision/loaders.py`
- Create basic Document/Page models
- Refactor EMR agent to use new module

**Success:** EMR agent uses `gaia.vision`, no code duplication

---

### M2: Structured Extraction (1-1.5 weeks)
**Goal:** EMR-complete with multi-page and batch support

- Multi-page document processing
- ExtractionSchema with flexible fields
- Prompt template library (medical, legal, invoice, receipt)
- Batch file processing
- Validation rules
- Agent mixin integration

**Success:** EMR agent code reduced 60% (1500 → 600 lines), batch receipt processing working

---

### M3: Tables & Visuals (1.5-2 weeks)
**Goal:** Driver logs prototype + RAG integration

- Table extraction and parsing
- Visual element extraction (charts, timelines)
- Multi-document detection (multiple receipts in one image)
- RAG integration helpers
- Table export (CSV, DataFrame)

**Success:** Driver logs prototype (10-20 pages), RAG indexing with tables, receipt batch processing

**CRITICAL:** VLM visual extraction validated early (week 1 of M3)

---

### M4: Optimization & Scale (1-1.5 weeks)
**Goal:** Production-ready for large documents

- Parallel processing (multi-threaded)
- Checkpoint/resume for long jobs
- Async job API
- Memory management
- Batch optimization

**Success:** Process 1,200-page driver logs successfully, optimized batch receipt processing

---

### M5: Superset - Complete SDK (1.5-2 weeks)
**Goal:** All advanced features

- Layout analysis
- Form field detection
- Temporal analysis tools
- Pattern detection framework
- Report generation templates
- Specialized processors (invoice, receipt, ID card, etc.)
- Complete documentation

**Success:** Production-ready SDK with all features, 95%+ test coverage

---

## Validation Strategy

### Two-Part Evaluation

**Part 1: OCR Extraction Quality**
- Text accuracy, table structure, visual detection
- Compared to ground truth dataset (70 documents)
- Metrics: CER < 5%, table accuracy > 70%, visual detection > 70%

**Part 2: Agent Q&A & Analysis**
- Agents reason about extracted data
- RAG retrieval quality over extracted content
- Analysis tasks (violations, summaries)

### Validation Use Cases
1. **EMR Medical Forms** - Single-page structured (validated in M2)
2. **Driver Logs (Legal)** - Multi-page with visuals (validated in M3-M4)
3. **Oil & Gas Manual (RAG)** - Technical content indexing (validated in M3)
4. **Batch Receipts** - Expense reporting automation (validated in M2-M4)

### Evaluation Framework Extension
- Extend `src/gaia/eval/` with vision-specific evaluators
- Ground truth dataset (20 tier-1, 70 comprehensive)
- Automated accuracy measurement
- CLI: `gaia vision-eval` for continuous validation

---

## Success Criteria

### After M2 (Week 3)
- ✅ EMR agent refactored with 60% code reduction
- ✅ Batch receipt processing working
- ✅ Can process multi-page documents

### After M3 (Week 5)
- ✅ Driver logs prototype validated (10-20 pages)
- ✅ Table/visual extraction working (70%+ accuracy)
- ✅ RAG integration seamless
- ✅ Multi-receipt detection working

### After M4 (Week 6.5)
- ✅ Full 1,200-page driver logs processed
- ✅ Checkpoint/resume working
- ✅ Parallel processing 2-4x faster
- ✅ Batch receipts optimized

### After M5 (Week 8)
- ✅ 95%+ test coverage
- ✅ Complete documentation with 20+ examples
- ✅ All 4 use cases validated
- ✅ Production-ready

---

## Technical Requirements

**Performance:**
- Single page: < 30 seconds
- 100 pages: < 1 hour
- 1,000 pages: < 12 hours (parallel)
- Memory: < 4GB peak

**Accuracy:**
- Text OCR: 95%+ on clean scans
- Table extraction: 80%+ structure
- Visual extraction: 70%+ data or description
- Form extraction: 90%+ fields

**Dependencies:**
- VLMClient (Qwen3-VL-4B-Instruct-GGUF)
- Lemonade Server
- PyMuPDF, PIL/Pillow

---

## Timeline

**Start:** Week of February 10, 2026
**M2 Complete (EMR-ready):** ~March 3, 2026
**M4 Complete (Driver Logs):** ~April 4, 2026
**M5 Complete (Superset):** ~April 21, 2026

**Total:** 6-8 weeks

---

## Related Documentation

- **📋 Detailed Plan:** [docs/plans/vision-sdk.mdx](/plans/vision-sdk)
- **🎯 Milestone Breakdown:** `VISION_SDK_MILESTONES.md` (comprehensive spec)
- **✅ Validation Strategy:** `VISION_SDK_VALIDATION.md`
- **📊 Use Case Matrix:** `VISION_SDK_USE_CASES.md`
- **🎨 Prompt Templates:** `VISION_SDK_PROMPTS.md`
- **📄 Complex Documents:** `VISION_SDK_COMPLEX_DOCS.md`

---

## Risks & Mitigation

**Risk:** VLM visual extraction may not work
**Mitigation:** Test early (M3 week 1), have description fallback, adjust scope if needed

**Risk:** 1,200-page processing too slow
**Mitigation:** Parallel processing (M4), checkpoint/resume, user expectations set

**Risk:** Scope creep in M5
**Mitigation:** Ship M4 as "complete", defer M5 features based on demand

---

## Next Steps

1. ✅ Review and approve milestones
2. 🔨 M1: Consolidation (extract EMR code)
3. 🔨 M2: Structured extraction (EMR-complete)
4. 🔬 M3: Validate VLM visual capabilities (CRITICAL)
5. ⚙️ M4-M5: Based on M3 validation results


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vision SDK - Document Processing Pipeline #325

Vision SDK - Document Processing Pipeline

Summary

Problem

Solution

Implementation Milestones

M1: Consolidation & Reorganization (1 week)

M2: Structured Extraction (1-1.5 weeks)

M3: Tables & Visuals (1.5-2 weeks)

M4: Optimization & Scale (1-1.5 weeks)

M5: Superset - Complete SDK (1.5-2 weeks)

Validation Strategy

Two-Part Evaluation

Validation Use Cases

Evaluation Framework Extension

Success Criteria

After M2 (Week 3)

After M3 (Week 5)

After M4 (Week 6.5)

After M5 (Week 8)

Technical Requirements

Timeline

Related Documentation

Risks & Mitigation

Next Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vision SDK - Document Processing Pipeline #325

Description

Vision SDK - Document Processing Pipeline

Summary

Problem

Solution

Implementation Milestones

M1: Consolidation & Reorganization (1 week)

M2: Structured Extraction (1-1.5 weeks)

M3: Tables & Visuals (1.5-2 weeks)

M4: Optimization & Scale (1-1.5 weeks)

M5: Superset - Complete SDK (1.5-2 weeks)

Validation Strategy

Two-Part Evaluation

Validation Use Cases

Evaluation Framework Extension

Success Criteria

After M2 (Week 3)

After M3 (Week 5)

After M4 (Week 6.5)

After M5 (Week 8)

Technical Requirements

Timeline

Related Documentation

Risks & Mitigation

Next Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions