You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .cursor/rules/general.mdc
+44-34Lines changed: 44 additions & 34 deletions
Original file line number
Diff line number
Diff line change
@@ -3,60 +3,70 @@ description:
3
3
globs:
4
4
alwaysApply: true
5
5
---
6
-
This repository contains workflows and scripts for:
6
+
This repository contains the extraction workflow for scraping World Bank Country and Climate Development Reports (CCDRs) from the Internet. This is part of the data preparation pipeline for the [CCDR Explorer](mdc:README.md), created by Teal Insights for Nature Finance.
7
7
8
-
1. Scraping the World Bank Country and Climate Development Reports (CCDRs) from the Internet
9
-
2. Transforming the scraped documents into structured JSON data
10
-
3. Uploading to a PostgreSQL database
11
-
4. Generating embeddings for semantic search
8
+
The repository focuses specifically on:
9
+
10
+
1. Scraping CCDR publication metadata from the World Bank Open Knowledge Repository
11
+
2. Downloading PDF files
12
+
3. Uploading metadata to a PostgreSQL database
13
+
4. Uploading PDFs to OpenAI vector store for AI-powered search
12
14
13
15
Working documents and helper files not meant to be retained may be placed in the `artifacts` folder, which is ignored by git.
14
16
15
17
To avoid confusion, keep your bash console opened to the project root and specify all file paths relative to root.
16
18
17
-
# Workflow
18
-
19
-
There are currently 61 CCDRs, comprising 126 PDF files and a total of 8,185 pages. (Additional CCDRs are published regularly, so ultimately we will need to automate the end-to-end workflow to continually ingest newly published documents. But for now, the goal is proof of concept.)
19
+
# Workflow
20
20
21
-
The ETL workflow is organized into `extract`, `transform`, and `load` folders. Each folder contains Python and Bash scripts with filenames numbered in a logical sequence (the order they were developed or are meant to be run).
21
+
There are currently at least 67 CCDRs (including two Tajikistan reports with the same title), comprising on the order of 200 PDF files. Additional CCDRs are published regularly, so ultimately we will need to automate the end-to-end workflow to continually ingest newly published documents.
22
22
23
-
## Extraction
23
+
The extraction workflow is organized in the `extract` folder and can be run as a complete pipeline using [extract_ccdrs.py](mdc:extract_ccdrs.py) or as individual steps.
24
24
25
-
The `extract` folder contains scripts for scraping the CCDRs from the World Bank Open Knowledge Repository. Each CCDR ("publication") may consist of one or more PDF files ("documents").
25
+
## Extraction Steps
26
26
27
-
First we collect links to the CCDR detail pages in `extract/data/publication_links.json`, an array with keys for publication "title", details page "url", "source" repository title, and "page_found" in the paginated CCDR collection results. Then we enrich this array—with data scraped from the publication details page, generated publication and document IDs, and LLM-generated document classifications—and write the result to an array in `extract/data/publication_details.json` with the following schema: `extract/data/publication_details_schema.json`.
27
+
The extraction workflow consists of 9 sequential steps:
28
28
29
-
Finally, we download all PDF files to `extract/data/pub_*/dl_*.pdf`, where the asterisks are incrementing integers. (The document IDs skip some integers, corresponding to skipped plaintext versions of the PDF files.) Some downloaded files are compressed as `.bin`, so we unzip those.
29
+
1. **Extract Publication Links** - Scrapes publication links from the World Bank repository, creating `data/publication_links.json`
30
+
2. **Extract Publication Details** - Extracts detailed information from each publication page, creating `data/publication_details.json`
31
+
3. **Add IDs** - Adds unique IDs to publications and download links
32
+
4. **Classify File Types** - Classifies file types for each download link using LLM
33
+
5. **Filter Download Links** - Filters and classifies which links to download
34
+
6. **Download Files** - Downloads the selected PDF files to `data/pub_*/doc_*.pdf`
35
+
7. **Convert BIN Files** - Converts .bin files to .pdf if they are PDF documents
36
+
8. **Upload to Database** - Uploads publications and documents to PostgreSQL database using [schema.py](mdc:extract/schema.py)
37
+
9. **Upload PDFs to OpenAI** - Uploads PDF files to OpenAI vector store for AI-powered search
30
38
31
-
## Transformation
39
+
Each CCDR ("publication") may consist of one or more PDF files ("documents"). The workflow enriches publication data with scraped details, generated IDs, and LLM-generated document classifications.
32
40
33
-
In the `transform` folder, we first define our desired database schema and map it to the above JSON schema in [1_db_schema_design.md](mdc:transform/1_db_schema_design.md).
41
+
## Usage
34
42
35
-
From here, we explore various methods for transforming the downloaded PDF files into hierarchically structured data matching our schema. Ideally, we could do this in a single pass with a model specialized for the purpose. In practice, we will probably need a multi-step workflow. Perhaps the cumulative results from the multi-step workflow can eventually be used to create synthetic data for training a model.
43
+
Run the complete workflow:
44
+
```bash
45
+
uv run extract_ccdrs.py
46
+
```
36
47
37
-
Currently, we have scripts to:
48
+
Or run individual steps:
49
+
```bash
50
+
uv run -m extract.extract_publication_links
51
+
uv run -m extract.extract_publication_details
52
+
# ... etc
53
+
```
38
54
39
-
1. Convert each page to an image and use a VLM to generate bounding box coordinates for contentful images such as charts and tables, writing the result to `transform/images/document_regions.json` (schema in `transform/images/document_regions_schema.json`)
40
-
2. Mechanically extract text with pyMuPDF and write to `transform/text/text_content.json`
41
-
3. Clean up extracted text and convert to markdown with an LLM, writing to `transform/text/text_content_processed.json`
42
-
4. Identify page ranges of [document components](mdc:https:/sparontologies.github.io/doco/current/doco.html) with an LLM and write to `transform/text/sections.json` (schema in `transform/text/sections_schema.json`)
43
-
5. Extract hierarchical section headings from table of contents text using an LLM and write to `transform/text/hierarchical_headings.json`
44
-
6. Extract hierarchical section headings from a document without ToC using an LLM and write to `transform/text/headings.json`
55
+
## Document Content Processing
45
56
46
-
I would like to create some evals to measure workflow performance both on the individual subtasks and on the end-to-end workflow. We should create the most naive possible version of the end-to-end workflow and start with that. Then I'll see if I can improve on that. To measure performance, I can probably just use cosine similarity to a human-produced result.
57
+
Note that this repository handles extraction and basic metadata processing only. The detailed parsing and transformation of document content (text extraction, section identification, hierarchical structuring, etc.) happens in separate repositories as part of the broader CCDR analysis pipeline.
47
58
48
-
The steps, as I see them, roughly in order of priority/practicality:
59
+
- This project uses `uv` to manage dependencies and run Python files.
49
60
50
-
1. Create, for a few PDF pages, my own ideal human-produced example and a simple eval scoring script.
51
-
2. Try giving Google Gemini a few PDF pages with a prompt to see what it can do out of the box. Try with both the actual PDF pages extracted from file and then with images of the pages.
52
-
3. Run a PDF through a few commercial and open-source tools to see what outputs we get.
53
-
4. Try using an LLM to draw bounding boxes around both text and images with an array to indicate reading order, and then mechanically extract. (Section hierarchy a challenge here; can we enrich with hierarchical heading extraction?)
54
-
5. Try mechanical text extraction with markdown and see if maybe we can de mechanical image extraction as well, even though the images are SVGs and their boundaries are therefore hard to detect. (Reading order and section hierarchy a challenge here; can we guess them mechanically?)
55
-
6. Try mechanical text extraction plus VLM image extraction. (Reading order and section hierarchy a challenge here; can we guide with an LLM?)
61
+
Add Python dependencies like:
56
62
57
-
## Loading
63
+
```bash
64
+
uv add pydantic # add --dev flag for development dependencies
65
+
```
58
66
59
-
Currently we have scripts for creating a PostgreSQL database matching our schema, for uploading `extract/data/publication_details.json` to our PostgreSQL tables, and for uploading PDFs to S3 (though this may ultimately be unnecessary, given that the PDFs are available online from the World Bank).
67
+
Then run the file with `uv run`, like:
60
68
61
-
In future, we will need to upload each document's extracted content nodes.
Copy file name to clipboardExpand all lines: .cursor/rules/python.mdc
+7-5Lines changed: 7 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -3,16 +3,18 @@ description:
3
3
globs:
4
4
alwaysApply: true
5
5
---
6
+
# Python dependency management
7
+
6
8
This project uses `uv` to manage dependencies and run Python files.
7
9
8
-
Add Python dependencies like:
10
+
Add Python dependencies to the workspace like:
9
11
10
12
```bash
11
13
uv add pydantic # add --dev flag for development dependencies
12
14
```
13
15
14
-
Then run the file with with `uv run`, like:
16
+
Then run the file as a module with `uv run -m`.
15
17
16
-
```bash
17
-
uv run -m transform.1_extract_images
18
-
```
18
+
# Python programming patterns
19
+
20
+
Write type-annotated Python code that will pass a mypy check. Objects returned from functions should generally be Pydantic models rather than dictionaries; this helps enforce the interface between modules.
Copy file name to clipboardExpand all lines: .cursor/rules/scraping.mdc
+1-3Lines changed: 1 addition & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,4 @@ description: Data discovery and scraping
3
3
globs:
4
4
alwaysApply: false
5
5
---
6
-
For deep Internet research and question-answering, you may find it helpful to send queries to a Perplexity research agent using your `perplexity` tool.
7
-
8
-
To find and scrape data from the Internet, you will probably want to use your web search tool with `wget`, ask the user to enable the `playwright` tool, or make scripted requests to an official API.
6
+
To find and scrape data from the Internet, you will probably want to use your web search in combination with command-line tools for fetching the content of a specific page, such as `wget` for static pages or `playwright` for dynamic ones.
Copy file name to clipboardExpand all lines: README.md
+58-50Lines changed: 58 additions & 50 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,84 +2,92 @@
2
2
3
3
This directory contains the complete workflow for extracting World Bank Country and Climate Development Reports (CCDRs). It is part of the data preparation pipeline for the [CCDR Explorer](https://github.com/Teal-Insights/ccdr-explorer-client), created by Teal Insights for Nature Finance.
4
4
5
+
## Automated Pipeline
6
+
7
+
This workflow runs automatically **once a month** via GitHub Actions to discover and process newly published CCDRs. The automated pipeline:
8
+
9
+
- Runs on the 1st day of each month at 2:00 AM UTC
10
+
- Identifies new publications not yet in the database
11
+
- Downloads and processes only new content
12
+
- Uploads PDFs to AWS S3 and metadata to PostgreSQL
13
+
- Includes OpenAI vector store integration for AI-powered search
14
+
- Can also be triggered manually when needed
15
+
16
+
The pipeline is designed to be incremental - it only processes new CCDRs that haven't been seen before, making monthly runs efficient and cost-effective.
17
+
5
18
## Quick Start
6
19
7
-
To run the complete extraction workflow:
20
+
To run the complete extraction workflow locally:
8
21
9
22
```bash
10
23
uv run extract_ccdrs.py
11
24
```
12
25
13
-
## Individual Steps
26
+
## Workflow Architecture
14
27
15
-
The workflow consists of 9 steps that can also be run individually:
28
+
The extraction workflow uses a **two-stage architecture**:
0 commit comments