Funderbolt Eval

Data processing pipeline for creating high-precision funder name variants for fulltext matching.

Overview

This repository contains scripts to process OpenAlex funder data and create an unambiguous list of funder name variants optimized for precision in fulltext matching.

Pipeline

1. Data Collection

download_funders.py - Download funder data from OpenAlex API
fetch_nih_acknowledgements.py - Fetch works with NIH acknowledgements

2. Funder Name Processing

create_funder_list.py - Convert funders.jsonl to funder_names.csv
create_duplicates_csv.py - Identify duplicate funder names (names used by multiple funders)
explode_duplicates.py - Expand duplicate names to show all name+funder pairs

3. Filtering for Precision

pick_winning_funders.py - Apply 10x works_count advantage rule to pick winners among duplicates
- Generates winning_funders.csv (winning name+funder pairs)
- Generates losing_funder_names.csv (losing name+funder pairs)
- Rejects name variants ≤2 letters (too short for reliable matching)
filter_unambiguous_names.py - Final filtering to create unambiguous names list
- Removes losing name+funder pairs
- Removes names ≤2 letters
- Removes 3-letter names for funders with ≤10k works (not famous enough)
- Keeps 3-letter names only for high-impact funders (>10k works, e.g., NIH, NSF, DFG)

4. Additional Tools

convert_to_csv.py - General CSV conversion utilities
convert_works_to_csv.py - Convert works JSONL to CSV
search_fulltext_mentions.py - Search for funder mentions in fulltext

Output Files

Primary Output

unambiguous_funder_names.csv - Final filtered list of unambiguous funder name variants
- 91,434 name+funder pairs (87.8% retention)
- Perfect 1:1 name-to-funder mapping
- No names ≤2 letters
- Only 48 three-letter names (all >10k works)
- Columns: name, works_count, display_name, id

Intermediate Files

funders.csv - All funders from OpenAlex
funder_names.csv - All name variants with works_count
duplicate_names.csv - Summary of duplicate names
duplicate_names_exploded.csv - Exploded view of all name+funder pairs for duplicates
winning_funders.csv - Name+funder pairs that won (1,163 pairs)
losing_funder_names.csv - Name+funder pairs that lost (11,042 pairs)

Filtering Rules

10x Advantage Rule

A name variant belongs to a "winning" funder if:

The funder is the only one using that name variant, OR
The funder has ≥10x more works than the second-highest funder using that name

Short Name Rules

≤2 letters: Always rejected (e.g., "EU", "UK")
3 letters: Rejected unless funder has >10k works
- Kept: "NIH" (395k works), "NSF" (404k works), "DFG" (249k works)
- Removed: "ESA" (8.5k works), "TRF" (9.4k works)

Statistics

Final Output

Starting rows: 104,082
Final rows: 91,434
Retention rate: 87.8%

Filtering Breakdown

9,006 losing pairs (ambiguous names or non-winners)
914 names with ≤2 letters
7,148 3-letter names with ≤10k works
Total removed: 12,648 rows

Name Length Distribution

3 letters: 48 names
4 letters: 5,510 names
5 letters: 3,124 names
6+ letters: 82,752 names

Requirements

requests

Install with:

pip install -r requirements.txt

Usage

To regenerate the full pipeline:

# 1. Download funders (optional, funders.jsonl already included)
python download_funders.py

# 2. Create funder names list
python create_funder_list.py

# 3. Identify duplicates
python create_duplicates_csv.py

# 4. Explode duplicates
python explode_duplicates.py

# 5. Pick winners
python pick_winning_funders.py

# 6. Filter to unambiguous names
python filter_unambiguous_names.py

Goal

The goal is to maximize precision in funder matching by removing ambiguous name variants. This is optimized for fulltext matching where false positives are costly.

Trade-off: Lower recall (fewer name variants) for higher precision (fewer false matches).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Funderbolt Eval

Overview

Pipeline

1. Data Collection

2. Funder Name Processing

3. Filtering for Precision

4. Additional Tools

Output Files

Primary Output

Intermediate Files

Filtering Rules

10x Advantage Rule

Short Name Rules

Statistics

Final Output

Filtering Breakdown

Name Length Distribution

Requirements

Usage

Goal

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
convert_to_csv.py		convert_to_csv.py
convert_works_to_csv.py		convert_works_to_csv.py
create_duplicates_csv.py		create_duplicates_csv.py
create_funder_list.py		create_funder_list.py
create_unique_names.py		create_unique_names.py
download_funders.py		download_funders.py
duplicate_names.csv		duplicate_names.csv
duplicate_names_exploded.csv		duplicate_names_exploded.csv
explode_duplicates.py		explode_duplicates.py
fetch_nih_acknowledgements.py		fetch_nih_acknowledgements.py
filter_unambiguous_names.py		filter_unambiguous_names.py
fulltext_mentions.csv		fulltext_mentions.csv
funder_names.csv		funder_names.csv
funders.csv		funders.csv
losing_funder_names.csv		losing_funder_names.csv
pick_winning_funders.py		pick_winning_funders.py
requirements.txt		requirements.txt
search_fulltext_mentions.py		search_fulltext_mentions.py
unambiguous_funder_names.csv		unambiguous_funder_names.csv
unique_funder_names.csv		unique_funder_names.csv
winning_funders.csv		winning_funders.csv
works_acknowledgements_nih.csv		works_acknowledgements_nih.csv

ourresearch/funderbolt-eval

Folders and files

Latest commit

History

Repository files navigation

Funderbolt Eval

Overview

Pipeline

1. Data Collection

2. Funder Name Processing

3. Filtering for Precision

4. Additional Tools

Output Files

Primary Output

Intermediate Files

Filtering Rules

10x Advantage Rule

Short Name Rules

Statistics

Final Output

Filtering Breakdown

Name Length Distribution

Requirements

Usage

Goal

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages