Skip to content

Commit 2ed4a1a

Browse files
Merge pull request #38 from Teal-Insights/27-make-sure-the-extraction-workflow-is-repeatable
Automate the extraction workflow to update the CCDR database once per month
2 parents 325607f + a8ce022 commit 2ed4a1a

28 files changed

+3340
-4251
lines changed

.cursor/rules/general.mdc

Lines changed: 44 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -3,60 +3,70 @@ description:
33
globs:
44
alwaysApply: true
55
---
6-
This repository contains workflows and scripts for:
6+
This repository contains the extraction workflow for scraping World Bank Country and Climate Development Reports (CCDRs) from the Internet. This is part of the data preparation pipeline for the [CCDR Explorer](mdc:README.md), created by Teal Insights for Nature Finance.
77

8-
1. Scraping the World Bank Country and Climate Development Reports (CCDRs) from the Internet
9-
2. Transforming the scraped documents into structured JSON data
10-
3. Uploading to a PostgreSQL database
11-
4. Generating embeddings for semantic search
8+
The repository focuses specifically on:
9+
10+
1. Scraping CCDR publication metadata from the World Bank Open Knowledge Repository
11+
2. Downloading PDF files
12+
3. Uploading metadata to a PostgreSQL database
13+
4. Uploading PDFs to OpenAI vector store for AI-powered search
1214

1315
Working documents and helper files not meant to be retained may be placed in the `artifacts` folder, which is ignored by git.
1416

1517
To avoid confusion, keep your bash console opened to the project root and specify all file paths relative to root.
1618

17-
# Workflow
18-
19-
There are currently 61 CCDRs, comprising 126 PDF files and a total of 8,185 pages. (Additional CCDRs are published regularly, so ultimately we will need to automate the end-to-end workflow to continually ingest newly published documents. But for now, the goal is proof of concept.)
19+
# Workflow
2020

21-
The ETL workflow is organized into `extract`, `transform`, and `load` folders. Each folder contains Python and Bash scripts with filenames numbered in a logical sequence (the order they were developed or are meant to be run).
21+
There are currently at least 67 CCDRs (including two Tajikistan reports with the same title), comprising on the order of 200 PDF files. Additional CCDRs are published regularly, so ultimately we will need to automate the end-to-end workflow to continually ingest newly published documents.
2222

23-
## Extraction
23+
The extraction workflow is organized in the `extract` folder and can be run as a complete pipeline using [extract_ccdrs.py](mdc:extract_ccdrs.py) or as individual steps.
2424

25-
The `extract` folder contains scripts for scraping the CCDRs from the World Bank Open Knowledge Repository. Each CCDR ("publication") may consist of one or more PDF files ("documents").
25+
## Extraction Steps
2626

27-
First we collect links to the CCDR detail pages in `extract/data/publication_links.json`, an array with keys for publication "title", details page "url", "source" repository title, and "page_found" in the paginated CCDR collection results. Then we enrich this array—with data scraped from the publication details page, generated publication and document IDs, and LLM-generated document classifications—and write the result to an array in `extract/data/publication_details.json` with the following schema: `extract/data/publication_details_schema.json`.
27+
The extraction workflow consists of 9 sequential steps:
2828

29-
Finally, we download all PDF files to `extract/data/pub_*/dl_*.pdf`, where the asterisks are incrementing integers. (The document IDs skip some integers, corresponding to skipped plaintext versions of the PDF files.) Some downloaded files are compressed as `.bin`, so we unzip those.
29+
1. **Extract Publication Links** - Scrapes publication links from the World Bank repository, creating `data/publication_links.json`
30+
2. **Extract Publication Details** - Extracts detailed information from each publication page, creating `data/publication_details.json`
31+
3. **Add IDs** - Adds unique IDs to publications and download links
32+
4. **Classify File Types** - Classifies file types for each download link using LLM
33+
5. **Filter Download Links** - Filters and classifies which links to download
34+
6. **Download Files** - Downloads the selected PDF files to `data/pub_*/doc_*.pdf`
35+
7. **Convert BIN Files** - Converts .bin files to .pdf if they are PDF documents
36+
8. **Upload to Database** - Uploads publications and documents to PostgreSQL database using [schema.py](mdc:extract/schema.py)
37+
9. **Upload PDFs to OpenAI** - Uploads PDF files to OpenAI vector store for AI-powered search
3038

31-
## Transformation
39+
Each CCDR ("publication") may consist of one or more PDF files ("documents"). The workflow enriches publication data with scraped details, generated IDs, and LLM-generated document classifications.
3240

33-
In the `transform` folder, we first define our desired database schema and map it to the above JSON schema in [1_db_schema_design.md](mdc:transform/1_db_schema_design.md).
41+
## Usage
3442

35-
From here, we explore various methods for transforming the downloaded PDF files into hierarchically structured data matching our schema. Ideally, we could do this in a single pass with a model specialized for the purpose. In practice, we will probably need a multi-step workflow. Perhaps the cumulative results from the multi-step workflow can eventually be used to create synthetic data for training a model.
43+
Run the complete workflow:
44+
```bash
45+
uv run extract_ccdrs.py
46+
```
3647

37-
Currently, we have scripts to:
48+
Or run individual steps:
49+
```bash
50+
uv run -m extract.extract_publication_links
51+
uv run -m extract.extract_publication_details
52+
# ... etc
53+
```
3854

39-
1. Convert each page to an image and use a VLM to generate bounding box coordinates for contentful images such as charts and tables, writing the result to `transform/images/document_regions.json` (schema in `transform/images/document_regions_schema.json`)
40-
2. Mechanically extract text with pyMuPDF and write to `transform/text/text_content.json`
41-
3. Clean up extracted text and convert to markdown with an LLM, writing to `transform/text/text_content_processed.json`
42-
4. Identify page ranges of [document components](mdc:https:/sparontologies.github.io/doco/current/doco.html) with an LLM and write to `transform/text/sections.json` (schema in `transform/text/sections_schema.json`)
43-
5. Extract hierarchical section headings from table of contents text using an LLM and write to `transform/text/hierarchical_headings.json`
44-
6. Extract hierarchical section headings from a document without ToC using an LLM and write to `transform/text/headings.json`
55+
## Document Content Processing
4556

46-
I would like to create some evals to measure workflow performance both on the individual subtasks and on the end-to-end workflow. We should create the most naive possible version of the end-to-end workflow and start with that. Then I'll see if I can improve on that. To measure performance, I can probably just use cosine similarity to a human-produced result.
57+
Note that this repository handles extraction and basic metadata processing only. The detailed parsing and transformation of document content (text extraction, section identification, hierarchical structuring, etc.) happens in separate repositories as part of the broader CCDR analysis pipeline.
4758

48-
The steps, as I see them, roughly in order of priority/practicality:
59+
- This project uses `uv` to manage dependencies and run Python files.
4960

50-
1. Create, for a few PDF pages, my own ideal human-produced example and a simple eval scoring script.
51-
2. Try giving Google Gemini a few PDF pages with a prompt to see what it can do out of the box. Try with both the actual PDF pages extracted from file and then with images of the pages.
52-
3. Run a PDF through a few commercial and open-source tools to see what outputs we get.
53-
4. Try using an LLM to draw bounding boxes around both text and images with an array to indicate reading order, and then mechanically extract. (Section hierarchy a challenge here; can we enrich with hierarchical heading extraction?)
54-
5. Try mechanical text extraction with markdown and see if maybe we can de mechanical image extraction as well, even though the images are SVGs and their boundaries are therefore hard to detect. (Reading order and section hierarchy a challenge here; can we guess them mechanically?)
55-
6. Try mechanical text extraction plus VLM image extraction. (Reading order and section hierarchy a challenge here; can we guide with an LLM?)
61+
Add Python dependencies like:
5662

57-
## Loading
63+
```bash
64+
uv add pydantic # add --dev flag for development dependencies
65+
```
5866

59-
Currently we have scripts for creating a PostgreSQL database matching our schema, for uploading `extract/data/publication_details.json` to our PostgreSQL tables, and for uploading PDFs to S3 (though this may ultimately be unnecessary, given that the PDFs are available online from the World Bank).
67+
Then run the file with `uv run`, like:
6068

61-
In future, we will need to upload each document's extracted content nodes.
69+
```bash
70+
uv run -m extract.extract_publication_links
71+
```
6272

.cursor/rules/python.mdc

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,18 @@ description:
33
globs:
44
alwaysApply: true
55
---
6+
# Python dependency management
7+
68
This project uses `uv` to manage dependencies and run Python files.
79

8-
Add Python dependencies like:
10+
Add Python dependencies to the workspace like:
911

1012
```bash
1113
uv add pydantic # add --dev flag for development dependencies
1214
```
1315

14-
Then run the file with with `uv run`, like:
16+
Then run the file as a module with `uv run -m`.
1517

16-
```bash
17-
uv run -m transform.1_extract_images
18-
```
18+
# Python programming patterns
19+
20+
Write type-annotated Python code that will pass a mypy check. Objects returned from functions should generally be Pydantic models rather than dictionaries; this helps enforce the interface between modules.

.cursor/rules/scraping.mdc

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,4 @@ description: Data discovery and scraping
33
globs:
44
alwaysApply: false
55
---
6-
For deep Internet research and question-answering, you may find it helpful to send queries to a Perplexity research agent using your `perplexity` tool.
7-
8-
To find and scrape data from the Internet, you will probably want to use your web search tool with `wget`, ask the user to enable the `playwright` tool, or make scripted requests to an official API.
6+
To find and scrape data from the Internet, you will probably want to use your web search in combination with command-line tools for fetching the content of a specific page, such as `wget` for static pages or `playwright` for dynamic ones.

.env.example

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,16 @@ POSTGRES_PASSWORD=postgres
44
POSTGRES_HOST=localhost
55
POSTGRES_PORT=5432
66
POSTGRES_DB=ccdr-explorer-db
7-
ASSISTANT_ID=
8-
OPENAI_API_KEY=
7+
8+
# AWS S3 bucket
99
S3_BUCKET_NAME=
10-
AWS_REGION=us-east-1
11-
AWS_PROFILE=
10+
AWS_REGION=
11+
12+
# AWS S3 IAM user credentials
13+
AWS_ACCESS_KEY_ID=
14+
AWS_SECRET_ACCESS_KEY=
15+
# or `AWS_PROFILE=` for single-sign on (SSO) authentication method
16+
17+
# OpenAI Configuration (optional)
18+
OPENAI_API_KEY=
19+
ASSISTANT_ID=
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
name: Monthly CCDR Extraction Pipeline
2+
3+
on:
4+
schedule:
5+
# Runs at 2:00 AM UTC on the 1st day of every month
6+
- cron: '0 2 1 * *'
7+
workflow_dispatch:
8+
# Allow manual triggering
9+
10+
jobs:
11+
extract-ccdrs:
12+
runs-on: ubuntu-latest
13+
timeout-minutes: 30 # 30 minutes timeout for the full pipeline
14+
15+
steps:
16+
- name: Checkout repository
17+
uses: actions/checkout@v4
18+
19+
- name: Install uv
20+
uses: astral-sh/setup-uv@v6
21+
22+
- name: Install Playwright browsers
23+
run: |
24+
uv sync
25+
uv run playwright install chromium
26+
uv run playwright install-deps
27+
28+
- name: Run CCDR extraction pipeline
29+
env:
30+
POSTGRES_USER: ${{ secrets.POSTGRES_USER }}
31+
POSTGRES_PASSWORD: ${{ secrets.POSTGRES_PASSWORD }}
32+
POSTGRES_HOST: ${{ secrets.POSTGRES_HOST }}
33+
POSTGRES_PORT: ${{ secrets.POSTGRES_PORT }}
34+
POSTGRES_DB: ${{ secrets.POSTGRES_DB }}
35+
S3_BUCKET_NAME: ${{ secrets.S3_BUCKET_NAME }}
36+
AWS_REGION: ${{ secrets.AWS_REGION }}
37+
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
38+
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
39+
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
40+
ASSISTANT_ID: ${{ secrets.ASSISTANT_ID }}
41+
run: |
42+
echo "Running full CCDR extraction pipeline with OpenAI upload and cleanup"
43+
uv run extract_ccdrs.py --openai --cleanup

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
.env
2+
.env.production
23
.specstory
34
.cursorindexingignore
45
repomix-output.*

README.md

Lines changed: 58 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -2,84 +2,92 @@
22

33
This directory contains the complete workflow for extracting World Bank Country and Climate Development Reports (CCDRs). It is part of the data preparation pipeline for the [CCDR Explorer](https://github.com/Teal-Insights/ccdr-explorer-client), created by Teal Insights for Nature Finance.
44

5+
## Automated Pipeline
6+
7+
This workflow runs automatically **once a month** via GitHub Actions to discover and process newly published CCDRs. The automated pipeline:
8+
9+
- Runs on the 1st day of each month at 2:00 AM UTC
10+
- Identifies new publications not yet in the database
11+
- Downloads and processes only new content
12+
- Uploads PDFs to AWS S3 and metadata to PostgreSQL
13+
- Includes OpenAI vector store integration for AI-powered search
14+
- Can also be triggered manually when needed
15+
16+
The pipeline is designed to be incremental - it only processes new CCDRs that haven't been seen before, making monthly runs efficient and cost-effective.
17+
518
## Quick Start
619

7-
To run the complete extraction workflow:
20+
To run the complete extraction workflow locally:
821

922
```bash
1023
uv run extract_ccdrs.py
1124
```
1225

13-
## Individual Steps
26+
## Workflow Architecture
1427

15-
The workflow consists of 9 steps that can also be run individually:
28+
The extraction workflow uses a **two-stage architecture**:
1629

17-
1. **Extract Publication Links** (`extract_publication_links.py`)
18-
- Scrapes publication links from the World Bank repository
19-
- Output: `data/publication_links.json`
30+
### Stage 1: Metadata Ingestion & Persistence
2031

21-
2. **Extract Publication Details** (`extract_publication_details.py`)
22-
- Extracts detailed information from each publication page
23-
- Output: `data/publication_details.json`
32+
- Scrapes publication links from the World Bank repository
33+
- Extracts detailed metadata from each publication page
34+
- Classifies download links and file types
35+
- Persists publication and document metadata to PostgreSQL database
2436

25-
3. **Add IDs** (`add_ids.py`)
26-
- Adds unique IDs to publications and download links
27-
- Modifies: `data/publication_details.json`
37+
### Stage 2: File Processing & Record Enrichment
2838

29-
4. **Classify File Types** (`classify_file_types.py`)
30-
- Classifies file types for each download link
31-
- Modifies: `data/publication_details.json`
39+
- Downloads PDF files for documents stored in the database
40+
- Converts file formats when necessary (e.g., .bin to .pdf)
41+
- Uploads files to AWS S3 storage
42+
- Updates database records with S3 URLs and file metadata
3243

33-
5. **Filter Download Links** (`filter_download_links.py`)
34-
- Filters and classifies which links to download
35-
- Modifies: `data/publication_details.json`
44+
The database serves as the handoff point between stages, enabling better error recovery and allowing stages to be run independently.
3645

37-
6. **Download Files** (`download_files.py`)
38-
- Downloads the selected PDF files
39-
- Output: `data/pub_*/dl_*.pdf`
46+
## Running Specific Stages
4047

41-
7. **Convert BIN Files** (`convert_bin_files.py`)
42-
- Converts .bin files to .pdf if they are PDF documents
43-
- Modifies: Files in `data/` directory
44-
45-
8. **Upload to Database** (`upload_pubs_to_db.py`)
46-
- Uploads publications and documents to the PostgreSQL database
47-
- Uses: `data/publication_details.json`
48+
```bash
49+
# Run only metadata ingestion
50+
uv run extract_ccdrs.py --stage1
4851

49-
9. **Upload PDFs to OpenAI** (`upload_pdfs_to_openai.py`)
50-
- Uploads PDF files to OpenAI vector store for AI-powered search
51-
- Uses: PDF files from `data/pub_*/` directories
52-
- Includes deduplication to avoid uploading existing files
52+
# Run only file processing
53+
uv run extract_ccdrs.py --stage2
5354

54-
## Running Individual Steps
55+
# Include OpenAI upload
56+
uv run extract_ccdrs.py --openai
5557

56-
```bash
57-
# Run a specific step
58-
uv run -m extract.extract_publication_links
59-
uv run -m extract.extract_publication_details
60-
uv run -m extract.add_ids
61-
uv run -m extract.classify_file_types
62-
uv run -m extract.filter_download_links
63-
uv run -m extract.download_files
64-
uv run -m extract.convert_bin_files
65-
uv run -m extract.upload_pubs_to_db
66-
uv run -m extract.upload_pdfs_to_openai
58+
# Clean up local files after processing
59+
uv run extract_ccdrs.py --cleanup
6760
```
6861

6962
## Configuration
7063

71-
For the OpenAI upload step, you'll need to set up environment variables in `extract/.env`:
64+
The workflow requires several environment variables. For local development, create an `.env` file in the project root:
7265

7366
```bash
67+
# Database Configuration
68+
POSTGRES_USER=your_postgres_user
69+
POSTGRES_PASSWORD=your_postgres_password
70+
POSTGRES_HOST=your_postgres_host
71+
POSTGRES_PORT=5432
72+
POSTGRES_DB=your_database_name
73+
74+
# AWS S3 Configuration
75+
S3_BUCKET_NAME=your_s3_bucket_name
76+
AWS_REGION=your_aws_region
77+
AWS_ACCESS_KEY_ID=your_aws_access_key
78+
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
79+
80+
# OpenAI Configuration (optional)
7481
OPENAI_API_KEY=your_openai_api_key_here
7582
ASSISTANT_ID=your_openai_assistant_id_here
7683
```
7784

7885
## Output
7986

8087
After running the complete workflow, you'll have:
81-
- JSON files with publication metadata in `data/`
82-
- Downloaded PDF files organized in `data/pub_*/` directories
83-
- Publications and documents uploaded to the PostgreSQL database
84-
- PDF files uploaded to OpenAI vector store for AI-powered search
85-
- Currently processes 61 CCDRs comprising 126+ PDF files
88+
89+
- Publication and document metadata stored in PostgreSQL database
90+
- PDF files uploaded to AWS S3 storage with organized naming
91+
- PDF files uploaded to OpenAI vector store for AI-powered search (if configured)
92+
- Local temporary files cleaned up automatically
93+
- Currently processes 67+ CCDRs comprising 198+ PDF files, with new publications added monthly

0 commit comments

Comments
 (0)