A complete workflow to transform World Bank Country and Climate Development Reports (CCDRs) from PDF format into a structured graph database format suitable for semantic search and retrieval.
This project processes PDF documents from PostgreSQL/AWS S3 storage through a multi-stage pipeline that extracts layout information, images, and text content, then uses Large Language Models (LLMs) to structure the content hierarchically as HTML DOM before uploading structured nodes back to the database. See here for the database schema and discussion of schema design.
The pipeline consists of 11 main stages:
- Document Discovery - Identifies unprocessed documents from the database
- PDF Acquisition - Downloads PDFs from S3 or directly from World Bank URLs
- Layout Analysis - Extracts bounding boxes and element labels using Layout Extractor API
- Logical Page Mapping - Maps physical pages to logical page numbers using LLM analysis
- Content Block Reclassification - Reclassifies content blocks to improve accuracy
- Image Extraction - Extracts images using PyMuPDF
- Image Description - Describes images using Vision Language Models
- Text Styling - Applies formatting information from PDF to text blocks
- Top-Level Structure Detection - Identifies front, body, and back matter using LLM analysis
- Nested Structure Detection - Converts the top-level structure into a nested HTML DOM structure using LLM analysis
- Database Ingestion - Converts structured content to database directed graph nodes and uploads to PostgreSQL
- Relation Enrichment - Generate relationships from anchor tags and references (TODO)
- Embedding Generation - Generate embeddings for each ContentData record (TODO)
The project has evolved from a complex multi-stage schema (see schema_legacy.md) to a simplified DOM-based approach (see schema_revision.md). The current schema closely follows HTML DOM structure while adding semantic enrichments through data attributes and relationships.
- DOM-based Structure: Mirrors HTML element hierarchy for easy reconstruction
- Semantic Enrichment: Uses
data-section-typeattributes for rich semantic labeling - Positional Data: Maintains PDF page numbers and bounding boxes
- Content Relationships: Captures citations, footnotes, and cross-references
- Multi-modal Support: Handles text, images, and vector graphics uniformly
This project uses uv for dependency management:
# Install dependencies
uv sync
# Or add new dependencies
uv add package_nameCreate a .env file:
# LLM API credentials
GEMINI_API_KEY=
DEEPSEEK_API_KEY=
OPENROUTER_API_KEY=
OPENAI_API_KEY=
# Temporary bug fix to prevent litellm resource leakage
DISABLE_AIOHTTP_TRANSPORT=True
# Experimental ML document layout extraction service
LAYOUT_EXTRACTOR_API_URL=
LAYOUT_EXTRACTOR_API_KEY=
# AWS S3 credentials
S3_BUCKET_NAME=
AWS_REGION=
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
# Database instance
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=ccdr-explorer-db
Run the complete pipeline to process unprocessed documents from the database:
uv run ingest_ccdrs.pyThis will:
- Query the database for unprocessed documents (documents with no child nodes)
- Download PDFs from S3 or World Bank URLs
- Process documents through the complete pipeline
- Upload structured nodes back to the database
- Output all intermediate artifacts to
./artifacts/wkdirfor debugging
The pipeline processes documents in batches (configurable via LIMIT variable) and includes comprehensive error handling with fail-fast validation for required environment variables and database schema synchronization.
You can also test individual transformation components:
uv run -m transform.extract_layout document.pdf output.jsonuv run -m transform.map_page_numbersuv run -m transform.describe_imagesuv run -m transform.detect_structureccdr-ingestion-workflow/
├── ingest_ccdrs.py # Main pipeline orchestrator
├── transform/ # Core transformation modules
│ ├── extract_layout.py # PDF layout extraction using Layout Extractor API
│ ├── map_page_numbers.py # Logical page number mapping using LLM router
│ ├── reclassify_blocks.py # Content block type reclassification
│ ├── extract_images.py # Image extraction from PDF
│ ├── describe_images.py # Image description using Vision Language Models
│ ├── style_text_blocks.py # Text styling from PDF formatting
│ ├── detect_top_level_structure.py # Top-level document structure detection
│ ├── detect_structure.py # Nested structure detection with concurrency control
│ ├── upload_to_db.py # Database upload functionality
│ └── models.py # Pydantic data models
├── utils/ # Utility modules
│ ├── db.py # Database connection and schema validation
│ ├── schema.py # Database schema definitions
│ ├── aws.py # S3 and AWS operations
│ └── html.py # HTML processing utilities
├── artifacts/ # Working directory for pipeline outputs
└── pyproject.toml # Project dependencies
- Queries PostgreSQL database for unprocessed documents
- Downloads PDFs from S3 or World Bank URLs as fallback
- Uploads structured content back to database as graph nodes
- Fail-fast validation for environment variables and schema sync
- Uses dedicated Layout Extractor API for precise bounding box detection
- Intelligent logical page number mapping using LLM analysis
- Content block reclassification to improve accuracy
- Header/footer filtering based on logical page analysis
- Text: Preserves styling and formatting from original PDF
- Images: Automatic extraction with context-aware AI descriptions
- Structure: Hierarchical document organization with nested sections
- Two-stage structure detection (top-level and nested)
- Uses multiple LLM providers with router-based load balancing
- Concurrency control for efficient API usage
- Context-aware section identification
- LiteLLM Router with advanced load balancing and fallbacks
- Multiple provider support (Gemini, OpenAI, DeepSeek, OpenRouter)
- Built-in retry logic and error handling
- Configurable rate limiting and concurrency control
- Async/await patterns throughout the pipeline
- Batch processing with configurable limits
- Comprehensive error handling and recovery
- Intermediate artifact preservation for debugging
- SQLModel: Database ORM and schema definitions
- PostgreSQL: Primary database for document storage
- PyMuPDF: PDF parsing and content extraction
- Pillow: Image processing and conversion
- LiteLLM: Unified LLM API interface with router support
- Pydantic: Data validation and serialization
- Tenacity: Retry logic for API calls
- Boto3: AWS S3 integration for PDF storage
The project uses multiple LLM providers through LiteLLM Router:
- Gemini: Image description and top-level structure detection
- OpenAI: Available through router for various tasks
- DeepSeek: Page number mapping and nested structure detection
- OpenRouter: Alternative provider access
The router provides load balancing, fallbacks, and automatic retry logic with configurable concurrency limits.
- ✅ Database-driven document discovery and processing
- ✅ PDF download from S3 and World Bank URLs
- ✅ Layout extraction using dedicated API
- ✅ Logical page number mapping with LLM analysis
- ✅ Content block reclassification and filtering
- ✅ Image extraction and description with VLM
- ✅ Text styling preservation from PDF formatting
- ✅ Two-stage hierarchical structure detection
- ✅ Database ingestion of structured content
- ✅ Concurrent processing with semaphore control
- ✅ LiteLLM Router integration with multiple providers
- 🚧 Relationship extraction from anchor tags (
create_relations.py) - 🚧 Vector embedding generation for semantic search
- 📋 Enhanced error recovery
- 📋 Performance optimization and batch size tuning
- 📋 Enhanced LLM response validation/evaluation
- 📋 Fine-tuning dataset prep and/or similar-example injection
This project uses modern Python practices:
- Type hints throughout
- Pydantic models for data validation
- Async/await for concurrent operations
- Comprehensive error handling and retry logic
MIT
This project processes World Bank Country and Climate Development Reports (CCDRs) to make them more accessible for research and analysis through semantic search and retrieval systems.