Teal-Insights
diff --git a/‎.cursor/rules/general.mdc‎
Lines changed: 44 additions & 34 deletions b/‎.cursor/rules/general.mdc‎
Lines changed: 44 additions & 34 deletions
diff --git a/‎.cursor/rules/python.mdc‎
Lines changed: 7 additions & 5 deletions b/‎.cursor/rules/python.mdc‎
Lines changed: 7 additions & 5 deletions
diff --git a/‎.cursor/rules/scraping.mdc‎
Lines changed: 1 addition & 3 deletions b/‎.cursor/rules/scraping.mdc‎
Lines changed: 1 addition & 3 deletions
diff --git a/‎.env.example‎
Lines changed: 12 additions & 4 deletions b/‎.env.example‎
Lines changed: 12 additions & 4 deletions
diff --git a/‎.github/workflows/monthly-extraction.yml‎
Lines changed: 43 additions & 0 deletions b/‎.github/workflows/monthly-extraction.yml‎
Lines changed: 43 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 58 additions & 50 deletions b/‎README.md‎
Lines changed: 58 additions & 50 deletions
@@ -3,60 +3,70 @@ description:
 globs: 
 alwaysApply: true
 ---
-This repository contains workflows and scripts for:
+This repository contains the extraction workflow for scraping World Bank Country and Climate Development Reports (CCDRs) from the Internet. This is part of the data preparation pipeline for the [CCDR Explorer](mdc:README.md), created by Teal Insights for Nature Finance.
 
-1. Scraping the World Bank Country and Climate Development Reports (CCDRs) from the Internet
-2. Transforming the scraped documents into structured JSON data
-3. Uploading to a PostgreSQL database
-4. Generating embeddings for semantic search
+The repository focuses specifically on:
+
+1. Scraping CCDR publication metadata from the World Bank Open Knowledge Repository
+2. Downloading PDF files
+3. Uploading metadata to a PostgreSQL database
+4. Uploading PDFs to OpenAI vector store for AI-powered search
 
 Working documents and helper files not meant to be retained may be placed in the `artifacts` folder, which is ignored by git.
 
 To avoid confusion, keep your bash console opened to the project root and specify all file paths relative to root.
 
-#  Workflow
-
-There are currently 61 CCDRs, comprising 126 PDF files and a total of 8,185 pages. (Additional CCDRs are published regularly, so ultimately we will need to automate the end-to-end workflow to continually ingest newly published documents. But for now, the goal is proof of concept.)
+# Workflow
 
-The ETL workflow is organized into `extract`, `transform`, and `load` folders. Each folder contains Python and Bash scripts with filenames numbered in a logical sequence (the order they were developed or are meant to be run).
+There are currently at least 67 CCDRs (including two Tajikistan reports with the same title), comprising on the order of 200 PDF files. Additional CCDRs are published regularly, so ultimately we will need to automate the end-to-end workflow to continually ingest newly published documents.
 
-## Extraction
+The extraction workflow is organized in the `extract` folder and can be run as a complete pipeline using [extract_ccdrs.py](mdc:extract_ccdrs.py) or as individual steps.
 
-The `extract` folder contains scripts for scraping the CCDRs from the World Bank Open Knowledge Repository. Each CCDR ("publication") may consist of one or more PDF files ("documents").
+## Extraction Steps
 
-First we collect links to the CCDR detail pages in `extract/data/publication_links.json`, an array with keys for publication "title", details page "url", "source" repository title, and "page_found" in the paginated CCDR collection results. Then we enrich this array—with data scraped from the publication details page, generated publication and document IDs, and LLM-generated document classifications—and write the result to an array in `extract/data/publication_details.json` with the following schema: `extract/data/publication_details_schema.json`.
+The extraction workflow consists of 9 sequential steps:
 
-Finally, we download all PDF files to `extract/data/pub_*/dl_*.pdf`, where the asterisks are incrementing integers. (The document IDs skip some integers, corresponding to skipped plaintext versions of the PDF files.) Some downloaded files are compressed as `.bin`, so we unzip those.
+1. **Extract Publication Links** - Scrapes publication links from the World Bank repository, creating `data/publication_links.json`
+2. **Extract Publication Details** - Extracts detailed information from each publication page, creating `data/publication_details.json`
+3. **Add IDs** - Adds unique IDs to publications and download links
+4. **Classify File Types** - Classifies file types for each download link using LLM
+5. **Filter Download Links** - Filters and classifies which links to download
+6. **Download Files** - Downloads the selected PDF files to `data/pub_*/doc_*.pdf`
+7. **Convert BIN Files** - Converts .bin files to .pdf if they are PDF documents
+8. **Upload to Database** - Uploads publications and documents to PostgreSQL database using [schema.py](mdc:extract/schema.py)
+9. **Upload PDFs to OpenAI** - Uploads PDF files to OpenAI vector store for AI-powered search
 
-## Transformation
+Each CCDR ("publication") may consist of one or more PDF files ("documents"). The workflow enriches publication data with scraped details, generated IDs, and LLM-generated document classifications.
 
-In the `transform` folder, we first define our desired database schema and map it to the above JSON schema in [1_db_schema_design.md](mdc:transform/1_db_schema_design.md).
+## Usage
 
-From here, we explore various methods for transforming the downloaded PDF files into hierarchically structured data matching our schema. Ideally, we could do this in a single pass with a model specialized for the purpose. In practice, we will probably need a multi-step workflow. Perhaps the cumulative results from the multi-step workflow can eventually be used to create synthetic data for training a model.
+Run the complete workflow:
+```bash
+uv run extract_ccdrs.py
+```
 
-Currently, we have scripts to:
+Or run individual steps:
+```bash
+uv run -m extract.extract_publication_links
+uv run -m extract.extract_publication_details
+# ... etc
+```
 
-1. Convert each page to an image and use a VLM to generate bounding box coordinates for contentful images such as charts and tables, writing the result to `transform/images/document_regions.json` (schema in `transform/images/document_regions_schema.json`)
-2. Mechanically extract text with pyMuPDF and write to `transform/text/text_content.json`
-3. Clean up extracted text and convert to markdown with an LLM, writing to `transform/text/text_content_processed.json`
-4. Identify page ranges of [document components](mdc:https:/sparontologies.github.io/doco/current/doco.html) with an LLM and write to `transform/text/sections.json` (schema in `transform/text/sections_schema.json`)
-5. Extract hierarchical section headings from table of contents text using an LLM and write to `transform/text/hierarchical_headings.json`
-6. Extract hierarchical section headings from a document without ToC using an LLM and write to `transform/text/headings.json`
+## Document Content Processing
 
-I would like to create some evals to measure workflow performance both on the individual subtasks and on the end-to-end workflow. We should create the most naive possible version of the end-to-end workflow and start with that. Then I'll see if I can improve on that. To measure performance, I can probably just use cosine similarity to a human-produced result.
+Note that this repository handles extraction and basic metadata processing only. The detailed parsing and transformation of document content (text extraction, section identification, hierarchical structuring, etc.) happens in separate repositories as part of the broader CCDR analysis pipeline.
 
-The steps, as I see them, roughly in order of priority/practicality:
+- This project uses `uv` to manage dependencies and run Python files.
 
-1. Create, for a few PDF pages, my own ideal human-produced example and a simple eval scoring script.
-2. Try giving Google Gemini a few PDF pages with a prompt to see what it can do out of the box. Try with both the actual PDF pages extracted from file and then with images of the pages.
-3. Run a PDF through a few commercial and open-source tools to see what outputs we get.
-4. Try using an LLM to draw bounding boxes around both text and images with an array to indicate reading order, and then mechanically extract. (Section hierarchy a challenge here; can we enrich with hierarchical heading extraction?)
-5. Try mechanical text extraction with markdown and see if maybe we can de mechanical image extraction as well, even though the images are SVGs and their boundaries are therefore hard to detect. (Reading order and section hierarchy a challenge here; can we guess them mechanically?)
-6. Try mechanical text extraction plus VLM image extraction. (Reading order and section hierarchy a challenge here; can we guide with an LLM?)
+Add Python dependencies like:
 
-## Loading
+```bash
+uv add pydantic # add --dev flag for development dependencies
+```
 
-Currently we have scripts for creating a PostgreSQL database matching our schema, for uploading `extract/data/publication_details.json` to our PostgreSQL tables, and for uploading PDFs to S3 (though this may ultimately be unnecessary, given that the PDFs are available online from the World Bank).
+Then run the file with `uv run`, like:
 
-In future, we will need to upload each document's extracted content nodes.
+```bash
+uv run -m extract.extract_publication_links
+```
 
@@ -3,16 +3,18 @@ description:
 globs: 
 alwaysApply: true
 ---
+# Python dependency management
+
 This project uses `uv` to manage dependencies and run Python files.
 
-Add Python dependencies like:
+Add Python dependencies to the workspace like:
 
 ```bash
 uv add pydantic # add --dev flag for development dependencies
 ```
 
-Then run the file with with `uv run`, like:
+Then run the file as a module with `uv run -m`.
 
-```bash
-uv run -m transform.1_extract_images
-```
+# Python programming patterns
+
+Write type-annotated Python code that will pass a mypy check. Objects returned from functions should generally be Pydantic models rather than dictionaries; this helps enforce the interface between modules.
@@ -3,6 +3,4 @@ description: Data discovery and scraping
 globs: 
 alwaysApply: false
 ---
-For deep Internet research and question-answering, you may find it helpful to send queries to a Perplexity research agent using your `perplexity` tool.
-
-To find and scrape data from the Internet, you will probably want to use your web search tool with `wget`, ask the user to enable the `playwright` tool, or make scripted requests to an official API.
+To find and scrape data from the Internet, you will probably want to use your web search in combination with command-line tools for fetching the content of a specific page, such as `wget` for static pages or `playwright` for dynamic ones.
@@ -4,8 +4,16 @@ POSTGRES_PASSWORD=postgres
 POSTGRES_HOST=localhost
 POSTGRES_PORT=5432
 POSTGRES_DB=ccdr-explorer-db
-ASSISTANT_ID=
-OPENAI_API_KEY=
+
+# AWS S3 bucket
 S3_BUCKET_NAME=
-AWS_REGION=us-east-1
-AWS_PROFILE=
+AWS_REGION=
+
+# AWS S3 IAM user credentials
+AWS_ACCESS_KEY_ID=
+AWS_SECRET_ACCESS_KEY=
+# or `AWS_PROFILE=` for single-sign on (SSO) authentication method
+
+# OpenAI Configuration (optional)
+OPENAI_API_KEY=
+ASSISTANT_ID=
@@ -0,0 +1,43 @@
+name: Monthly CCDR Extraction Pipeline
+
+on:
+  schedule:
+    # Runs at 2:00 AM UTC on the 1st day of every month
+    - cron: '0 2 1 * *'
+  workflow_dispatch:
+    # Allow manual triggering
+
+jobs:
+  extract-ccdrs:
+    runs-on: ubuntu-latest
+    timeout-minutes: 30  # 30 minutes timeout for the full pipeline
+    
+    steps:
+    - name: Checkout repository
+      uses: actions/checkout@v4
+
+    - name: Install uv
+      uses: astral-sh/setup-uv@v6
+
+    - name: Install Playwright browsers
+      run: |
+        uv sync
+        uv run playwright install chromium
+        uv run playwright install-deps
+
+    - name: Run CCDR extraction pipeline
+      env:
+        POSTGRES_USER: ${{ secrets.POSTGRES_USER }}
+        POSTGRES_PASSWORD: ${{ secrets.POSTGRES_PASSWORD }}
+        POSTGRES_HOST: ${{ secrets.POSTGRES_HOST }}
+        POSTGRES_PORT: ${{ secrets.POSTGRES_PORT }}
+        POSTGRES_DB: ${{ secrets.POSTGRES_DB }}
+        S3_BUCKET_NAME: ${{ secrets.S3_BUCKET_NAME }}
+        AWS_REGION: ${{ secrets.AWS_REGION }}
+        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
+        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
+        ASSISTANT_ID: ${{ secrets.ASSISTANT_ID }}
+      run: |
+        echo "Running full CCDR extraction pipeline with OpenAI upload and cleanup"
+        uv run extract_ccdrs.py --openai --cleanup
@@ -1,4 +1,5 @@
 .env
+.env.production
 .specstory
 .cursorindexingignore
 repomix-output.*
 
@@ -2,84 +2,92 @@
 
 This directory contains the complete workflow for extracting World Bank Country and Climate Development Reports (CCDRs). It is part of the data preparation pipeline for the [CCDR Explorer](https://github.com/Teal-Insights/ccdr-explorer-client), created by Teal Insights for Nature Finance.
 
+## Automated Pipeline
+
+This workflow runs automatically **once a month** via GitHub Actions to discover and process newly published CCDRs. The automated pipeline:
+
+- Runs on the 1st day of each month at 2:00 AM UTC
+- Identifies new publications not yet in the database
+- Downloads and processes only new content
+- Uploads PDFs to AWS S3 and metadata to PostgreSQL
+- Includes OpenAI vector store integration for AI-powered search
+- Can also be triggered manually when needed
+
+The pipeline is designed to be incremental - it only processes new CCDRs that haven't been seen before, making monthly runs efficient and cost-effective.
+
 ## Quick Start
 
-To run the complete extraction workflow:
+To run the complete extraction workflow locally:
 
 ```bash
 uv run extract_ccdrs.py
 ```
 
-## Individual Steps
+## Workflow Architecture
 
-The workflow consists of 9 steps that can also be run individually:
+The extraction workflow uses a **two-stage architecture**:
 
-1. **Extract Publication Links** (`extract_publication_links.py`)
-   - Scrapes publication links from the World Bank repository
-   - Output: `data/publication_links.json`
+### Stage 1: Metadata Ingestion & Persistence
 
-2. **Extract Publication Details** (`extract_publication_details.py`)
-   - Extracts detailed information from each publication page
-   - Output: `data/publication_details.json`
+- Scrapes publication links from the World Bank repository
+- Extracts detailed metadata from each publication page
+- Classifies download links and file types
+- Persists publication and document metadata to PostgreSQL database
 
-3. **Add IDs** (`add_ids.py`)
-   - Adds unique IDs to publications and download links
-   - Modifies: `data/publication_details.json`
+### Stage 2: File Processing & Record Enrichment
 
-4. **Classify File Types** (`classify_file_types.py`)
-   - Classifies file types for each download link
-   - Modifies: `data/publication_details.json`
+- Downloads PDF files for documents stored in the database
+- Converts file formats when necessary (e.g., .bin to .pdf)
+- Uploads files to AWS S3 storage
+- Updates database records with S3 URLs and file metadata
 
-5. **Filter Download Links** (`filter_download_links.py`)
-   - Filters and classifies which links to download
-   - Modifies: `data/publication_details.json`
+The database serves as the handoff point between stages, enabling better error recovery and allowing stages to be run independently.
 
-6. **Download Files** (`download_files.py`)
-   - Downloads the selected PDF files
-   - Output: `data/pub_*/dl_*.pdf`
+## Running Specific Stages
 
-7. **Convert BIN Files** (`convert_bin_files.py`)
-   - Converts .bin files to .pdf if they are PDF documents
-   - Modifies: Files in `data/` directory
-
-8. **Upload to Database** (`upload_pubs_to_db.py`)
-   - Uploads publications and documents to the PostgreSQL database
-   - Uses: `data/publication_details.json`
+```bash
+# Run only metadata ingestion
+uv run extract_ccdrs.py --stage1
 
-9. **Upload PDFs to OpenAI** (`upload_pdfs_to_openai.py`)
-   - Uploads PDF files to OpenAI vector store for AI-powered search
-   - Uses: PDF files from `data/pub_*/` directories
-   - Includes deduplication to avoid uploading existing files
+# Run only file processing
+uv run extract_ccdrs.py --stage2
 
-## Running Individual Steps
+# Include OpenAI upload
+uv run extract_ccdrs.py --openai
 
-```bash
-# Run a specific step
-uv run -m extract.extract_publication_links
-uv run -m extract.extract_publication_details
-uv run -m extract.add_ids
-uv run -m extract.classify_file_types
-uv run -m extract.filter_download_links
-uv run -m extract.download_files
-uv run -m extract.convert_bin_files
-uv run -m extract.upload_pubs_to_db
-uv run -m extract.upload_pdfs_to_openai
+# Clean up local files after processing
+uv run extract_ccdrs.py --cleanup
 ```
 
 ## Configuration
 
-For the OpenAI upload step, you'll need to set up environment variables in `extract/.env`:
+The workflow requires several environment variables. For local development, create an `.env` file in the project root:
 
 ```bash
+# Database Configuration
+POSTGRES_USER=your_postgres_user
+POSTGRES_PASSWORD=your_postgres_password
+POSTGRES_HOST=your_postgres_host
+POSTGRES_PORT=5432
+POSTGRES_DB=your_database_name
+
+# AWS S3 Configuration
+S3_BUCKET_NAME=your_s3_bucket_name
+AWS_REGION=your_aws_region
+AWS_ACCESS_KEY_ID=your_aws_access_key
+AWS_SECRET_ACCESS_KEY=your_aws_secret_key
+
+# OpenAI Configuration (optional)
 OPENAI_API_KEY=your_openai_api_key_here
 ASSISTANT_ID=your_openai_assistant_id_here
 ```
 
 ## Output
 
 After running the complete workflow, you'll have:
-- JSON files with publication metadata in `data/`
-- Downloaded PDF files organized in `data/pub_*/` directories
-- Publications and documents uploaded to the PostgreSQL database
-- PDF files uploaded to OpenAI vector store for AI-powered search
-- Currently processes 61 CCDRs comprising 126+ PDF files 
+
+- Publication and document metadata stored in PostgreSQL database
+- PDF files uploaded to AWS S3 storage with organized naming
+- PDF files uploaded to OpenAI vector store for AI-powered search (if configured)
+- Local temporary files cleaned up automatically
+- Currently processes 67+ CCDRs comprising 198+ PDF files, with new publications added monthly
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,5 @@`
`1`	`1`	`.env`
	`2`	`+.env.production`
`2`	`3`	`.specstory`
`3`	`4`	`.cursorindexingignore`
`4`	`5`	`repomix-output.*`