Skip to content

ourresearch/funderbolt-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Funderbolt Eval

Data processing pipeline for creating high-precision funder name variants for fulltext matching.

Overview

This repository contains scripts to process OpenAlex funder data and create an unambiguous list of funder name variants optimized for precision in fulltext matching.

Pipeline

1. Data Collection

  • download_funders.py - Download funder data from OpenAlex API
  • fetch_nih_acknowledgements.py - Fetch works with NIH acknowledgements

2. Funder Name Processing

  • create_funder_list.py - Convert funders.jsonl to funder_names.csv
  • create_duplicates_csv.py - Identify duplicate funder names (names used by multiple funders)
  • explode_duplicates.py - Expand duplicate names to show all name+funder pairs

3. Filtering for Precision

  • pick_winning_funders.py - Apply 10x works_count advantage rule to pick winners among duplicates

    • Generates winning_funders.csv (winning name+funder pairs)
    • Generates losing_funder_names.csv (losing name+funder pairs)
    • Rejects name variants ≤2 letters (too short for reliable matching)
  • filter_unambiguous_names.py - Final filtering to create unambiguous names list

    • Removes losing name+funder pairs
    • Removes names ≤2 letters
    • Removes 3-letter names for funders with ≤10k works (not famous enough)
    • Keeps 3-letter names only for high-impact funders (>10k works, e.g., NIH, NSF, DFG)

4. Additional Tools

  • convert_to_csv.py - General CSV conversion utilities
  • convert_works_to_csv.py - Convert works JSONL to CSV
  • search_fulltext_mentions.py - Search for funder mentions in fulltext

Output Files

Primary Output

  • unambiguous_funder_names.csv - Final filtered list of unambiguous funder name variants
    • 91,434 name+funder pairs (87.8% retention)
    • Perfect 1:1 name-to-funder mapping
    • No names ≤2 letters
    • Only 48 three-letter names (all >10k works)
    • Columns: name, works_count, display_name, id

Intermediate Files

  • funders.csv - All funders from OpenAlex
  • funder_names.csv - All name variants with works_count
  • duplicate_names.csv - Summary of duplicate names
  • duplicate_names_exploded.csv - Exploded view of all name+funder pairs for duplicates
  • winning_funders.csv - Name+funder pairs that won (1,163 pairs)
  • losing_funder_names.csv - Name+funder pairs that lost (11,042 pairs)

Filtering Rules

10x Advantage Rule

A name variant belongs to a "winning" funder if:

  • The funder is the only one using that name variant, OR
  • The funder has ≥10x more works than the second-highest funder using that name

Short Name Rules

  • ≤2 letters: Always rejected (e.g., "EU", "UK")
  • 3 letters: Rejected unless funder has >10k works
    • Kept: "NIH" (395k works), "NSF" (404k works), "DFG" (249k works)
    • Removed: "ESA" (8.5k works), "TRF" (9.4k works)

Statistics

Final Output

  • Starting rows: 104,082
  • Final rows: 91,434
  • Retention rate: 87.8%

Filtering Breakdown

  • 9,006 losing pairs (ambiguous names or non-winners)
  • 914 names with ≤2 letters
  • 7,148 3-letter names with ≤10k works
  • Total removed: 12,648 rows

Name Length Distribution

  • 3 letters: 48 names
  • 4 letters: 5,510 names
  • 5 letters: 3,124 names
  • 6+ letters: 82,752 names

Requirements

requests

Install with:

pip install -r requirements.txt

Usage

To regenerate the full pipeline:

# 1. Download funders (optional, funders.jsonl already included)
python download_funders.py

# 2. Create funder names list
python create_funder_list.py

# 3. Identify duplicates
python create_duplicates_csv.py

# 4. Explode duplicates
python explode_duplicates.py

# 5. Pick winners
python pick_winning_funders.py

# 6. Filter to unambiguous names
python filter_unambiguous_names.py

Goal

The goal is to maximize precision in funder matching by removing ambiguous name variants. This is optimized for fulltext matching where false positives are costly.

Trade-off: Lower recall (fewer name variants) for higher precision (fewer false matches).

About

Learning about funders, making them more better!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages