Data processing pipeline for creating high-precision funder name variants for fulltext matching.
This repository contains scripts to process OpenAlex funder data and create an unambiguous list of funder name variants optimized for precision in fulltext matching.
download_funders.py- Download funder data from OpenAlex APIfetch_nih_acknowledgements.py- Fetch works with NIH acknowledgements
create_funder_list.py- Convert funders.jsonl to funder_names.csvcreate_duplicates_csv.py- Identify duplicate funder names (names used by multiple funders)explode_duplicates.py- Expand duplicate names to show all name+funder pairs
-
pick_winning_funders.py- Apply 10x works_count advantage rule to pick winners among duplicates- Generates
winning_funders.csv(winning name+funder pairs) - Generates
losing_funder_names.csv(losing name+funder pairs) - Rejects name variants ≤2 letters (too short for reliable matching)
- Generates
-
filter_unambiguous_names.py- Final filtering to create unambiguous names list- Removes losing name+funder pairs
- Removes names ≤2 letters
- Removes 3-letter names for funders with ≤10k works (not famous enough)
- Keeps 3-letter names only for high-impact funders (>10k works, e.g., NIH, NSF, DFG)
convert_to_csv.py- General CSV conversion utilitiesconvert_works_to_csv.py- Convert works JSONL to CSVsearch_fulltext_mentions.py- Search for funder mentions in fulltext
unambiguous_funder_names.csv- Final filtered list of unambiguous funder name variants- 91,434 name+funder pairs (87.8% retention)
- Perfect 1:1 name-to-funder mapping
- No names ≤2 letters
- Only 48 three-letter names (all >10k works)
- Columns:
name,works_count,display_name,id
funders.csv- All funders from OpenAlexfunder_names.csv- All name variants with works_countduplicate_names.csv- Summary of duplicate namesduplicate_names_exploded.csv- Exploded view of all name+funder pairs for duplicateswinning_funders.csv- Name+funder pairs that won (1,163 pairs)losing_funder_names.csv- Name+funder pairs that lost (11,042 pairs)
A name variant belongs to a "winning" funder if:
- The funder is the only one using that name variant, OR
- The funder has ≥10x more works than the second-highest funder using that name
- ≤2 letters: Always rejected (e.g., "EU", "UK")
- 3 letters: Rejected unless funder has >10k works
- Kept: "NIH" (395k works), "NSF" (404k works), "DFG" (249k works)
- Removed: "ESA" (8.5k works), "TRF" (9.4k works)
- Starting rows: 104,082
- Final rows: 91,434
- Retention rate: 87.8%
- 9,006 losing pairs (ambiguous names or non-winners)
- 914 names with ≤2 letters
- 7,148 3-letter names with ≤10k works
- Total removed: 12,648 rows
- 3 letters: 48 names
- 4 letters: 5,510 names
- 5 letters: 3,124 names
- 6+ letters: 82,752 names
requests
Install with:
pip install -r requirements.txtTo regenerate the full pipeline:
# 1. Download funders (optional, funders.jsonl already included)
python download_funders.py
# 2. Create funder names list
python create_funder_list.py
# 3. Identify duplicates
python create_duplicates_csv.py
# 4. Explode duplicates
python explode_duplicates.py
# 5. Pick winners
python pick_winning_funders.py
# 6. Filter to unambiguous names
python filter_unambiguous_names.pyThe goal is to maximize precision in funder matching by removing ambiguous name variants. This is optimized for fulltext matching where false positives are costly.
Trade-off: Lower recall (fewer name variants) for higher precision (fewer false matches).