This repository contains the official evaluation pipeline used to compute the ReXrank leaderboard metrics for radiology report generation. It evaluates model-generated reports against ground-truth references using 7 complementary metrics.
git clone git@github.com:rajpurkarlab/ReXrank-metric.git
cd ReXrank-metric| Metric | Column Name | Range | Higher is Better | Description |
|---|---|---|---|---|
| BLEU-2 | bleu_score |
[0, 1] | Yes | Bigram overlap (weights: 1/2, 1/2) via fast_bleu |
| BERTScore | bertscore |
[-1, 1] | Yes | Contextual embedding F1 using distilroberta-base with baseline rescaling; IDF disabled |
| SembScore | semb_score |
[-1, 1] | Yes | Cosine similarity of CheXbert embeddings |
| RadGraph | radgraph_combined |
[0, 1] | Yes | Mean of entity F1 and relation F1 from RadGraph (DyGIE++) |
| 1/RadCliQ-v1 | 1/RadCliQ-v1 |
(0, +inf) | Yes | Inverse of RadCliQ-v1 composite score (primary ranking metric) |
| RaTEScore | ratescore |
[0, 1] | Yes | Factual/temporal consistency score |
| GREEN | green_score |
[0, 1] | Yes | LLM-based clinical error analysis using StanfordAIMI/GREEN-radllama2-7b |
Submission JSON ({model}.json)
│
┌─────────────────────────────────────────────────────────────────┐
│ run_cxr_metrics.sh (conda: radgraph, GPU) │
│ │
│ Step 0: json_to_csv.py │
│ Converts JSON → gt_reports_{model}.csv │
│ + predicted_reports_{model}.csv │
│ │
│ Step 1: run_cxr_metrics.py │
│ Computes: bleu_score, bertscore, semb_score, │
│ radgraph_combined, RadCliQ-v0, RadCliQ-v1 │
│ Tool: CXR-Report-Metric │
│ https://github.com/rajpurkarlab/CXR-Report-Metric │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ run_green_ratescore.sh (conda: green_score, GPU) │
│ │
│ Step 2: run_ratescore.py │
│ Computes: ratescore │
│ Tool: RaTEScore │
│ https://github.com/MAGIC-AI4Med/RaTEScore │
│ │
│ Step 3: run_green.py │
│ Computes: green_score │
│ Tool: GREEN (StanfordAIMI/GREEN-radllama2-7b) │
│ https://github.com/Stanford-AIMI/GREEN │
│ │
│ Step 4: aggregate_metrics.py │
│ Averages per-sample scores → one summary CSV per model │
│ Computes 1/RadCliQ-v1 = 1 / mean(RadCliQ-v1) │
│ │
│ Step 5: summarize_leaderboard.py │
│ Merges all models, ranks by 1/RadCliQ-v1, outputs leaderboard│
└─────────────────────────────────────────────────────────────────┘
rexrank-metric/
├── README.md
├── requirements.txt
├── data/ # Input data
│ └── submission_example.json # Example submission JSON (5 samples)
├── results/ # Example outputs (derived from data/submission_example.json)
│ ├── output_report_scores.csv # Step 1: per-sample BLEU/BERTScore/SembScore/RadGraph/RadCliQ
│ ├── output_ratescore.csv # Step 2: per-sample RaTEScore
│ ├── output_green.csv # Step 3: per-sample GREEN with error breakdown
│ ├── output_aggregated.csv # Step 4: single-row model summary
│ └── output_leaderboard.csv # Step 5: ranked leaderboard
├── scripts/
│ ├── CXR-Report-Metric/ # Bundled: BLEU, BERTScore, SembScore, RadGraph, RadCliQ
│ │ ├── CXRMetric/ # Core metric implementations
│ │ │ ├── run_eval.py # calc_metric() entry point
│ │ │ ├── radcliq-v1.pkl # RadCliQ-v1 model weights
│ │ │ ├── normalizer.pkl # RadCliQ-v0 MinMaxScaler
│ │ │ └── composite_metric_model.pkl
│ │ ├── CheXbert/models/chexbert.pth # ← download required
│ │ ├── radgraph/.../model.tar.gz # ← download required (PhysioNet)
│ │ └── config.py # Checkpoint path configuration
│ ├── json_to_csv.py # Step 0: Convert submission JSON → gt/predicted CSVs
│ ├── run_cxr_metrics.py # Step 1: BLEU, BERTScore, SembScore, RadGraph, RadCliQ
│ ├── run_ratescore.py # Step 2: RaTEScore
│ ├── run_green.py # Step 3: GREEN score
│ ├── aggregate_metrics.py # Step 4: Average per-sample → per-model summary
│ ├── summarize_leaderboard.py # Step 5: Build ranked leaderboard CSVs
│ ├── run_cxr_metrics.sh # SLURM script: Steps 0-1 (conda: radgraph, GPU)
│ ├── run_green_ratescore.sh # SLURM script: Steps 2-5 (conda: green_score, GPU)
│ └── run_all.sh # Convenience wrapper: runs both .sh scripts
All data and results live inside rexrank-metric/:
rexrank-metric/
├── data/
│ ├── findings/{dataset}/gt_reports_{model}.csv # Ground truth (findings only)
│ ├── findings/{dataset}/predicted_reports_{model}.csv # Model predictions (findings)
│ ├── reports/{dataset}/gt_reports_{model}.csv # Ground truth (full reports)
│ └── reports/{dataset}/predicted_reports_{model}.csv # Model predictions (reports)
├── results/
│ ├── {dataset}_{split}/ # Per-sample outputs from Steps 1-3
│ │ ├── report_scores_{model}.csv
│ │ ├── ratescore_{model}.csv
│ │ └── results_green_{model}.csv
│ ├── metric/{dataset}/{split}/{model}.csv # Per-model aggregated summary (Step 4)
│ └── metric/results_summary/{dataset}.csv # Final leaderboard (Step 5)
Datasets: mimic-cxr, iu_xray, chexpert_plus, gradient_health
See data/submission_example.json for the input and results/ for the corresponding outputs. Each output file is derived from the same 5-sample input.
The default submission format is a JSON file keyed by case_id. Each entry contains ground-truth fields and the model's prediction:
{
"patient64545_study1": {
"image_path": ["valid/patient64545/study1/view1_frontal.png"],
"context": "Age:82.0.Gender:F.Indication:nanComparison: None.",
"section_clinical_history": null,
"section_findings": "There are low lung volumes. The cardiomediastinal silhouette is within normal limits. There is evidence of trace pulmonary edema with a left pleural effusion. Left retrocardiac atelectasis is noted. There are old bilateral rib fractures.",
"section_impression": "1. Low lung volumes with trace pulmonary edema and left pleural effusion. 2. Left retrocardiac atelectasis. 3. Old bilateral rib fractures.",
"key_image_path": "valid/patient64545/study1/view1_frontal.png",
"model_prediction": "Findings: Heart size is normal. The mediastinal contours are within normal limits. There are low lung volumes. There is a small left pleural effusion and associated opacity in the left base, likely representing atelectasis. There is mild pulmonary edema. No right pleural effusion is identified. No pneumothorax."
},
"patient64661_study1": { ... },
"patient64604_study1": { ... },
"patient64700_study1": { ... },
"patient64523_study1": { ... }
}Key fields:
model_prediction(required): Model-generated report. For thefindingssplit this is the Findings section only; for thereportsplit it includes "Findings: ... Impression: ..." (both sections).section_findings/section_impression: Ground-truth reference text.context: Patient demographics and clinical indication.image_path/key_image_path: Paths to the input chest X-ray images.
Note: The exact metadata fields vary by dataset (e.g., CheXpert-Plus includes
race/ethnicity/ap_pa; MIMIC-CXR includessubject_id/split; ReXGradient includesimage_description). The metric pipeline only usessection_findings,section_impression, andmodel_prediction.
Before metric computation, the JSON is converted to two CSVs (ground-truth and predicted). The conversion maps:
JSON key → case_id
enumerated index → study_id (0, 1, 2, ...)
section_findings → report (gt, findings split)
section_impression → report (gt, reports split)
model_prediction → report (predicted)
Resulting CSVs (gt_reports_{model}.csv / predicted_reports_{model}.csv):
study_id,case_id,report
0,patient64545_study1,"There are low lung volumes. The cardiomediastinal silhouette is within normal limits..."
1,patient64661_study1,"The transesophageal echo probe has been removed..."study_id,case_id,report
0,patient64545_study1,"Heart size is normal. The mediastinal contours are within normal limits..."
1,patient64661_study1,"Endotracheal tube tip is 4 cm above the carina..."One row per sample. Contains the 4 base metrics plus two composite scores.
,index,study_id,case_id,report,bleu_score,bertscore,semb_score,radgraph_combined,RadCliQ-v0,RadCliQ-v1
0,0,0,patient64545_study1,"Heart size is normal...",0.1466,-0.1302,0.7308,0.1667,4.6436,1.9381
1,1,1,patient64661_study1,"Endotracheal tube...",0.0522,-0.1357,0.1648,0.0833,5.6532,2.5102
2,2,2,patient64604_study1,"Left chest tube...",0.0943,-0.0917,0.1329,0.0000,5.6103,2.4692
3,3,3,patient64700_study1,"The cardiac silhouette...",0.3162,0.2788,0.8521,0.4583,2.1957,0.7296
4,4,4,patient64523_study1,"Heart size is normal...",0.3536,0.4075,0.9105,0.5000,1.6257,0.5512Appends a ratescore column to the predicted reports.
study_id,case_id,report,ratescore
0,patient64545_study1,"Heart size is normal...",0.5518
1,patient64661_study1,"Endotracheal tube...",0.4485
2,patient64604_study1,"Left chest tube...",0.5422
3,patient64700_study1,"The cardiac silhouette...",0.6123
4,patient64523_study1,"Heart size is normal...",0.7235Includes the LLM error analysis breakdown by category.
reference,predictions,green_analysis,green_score,(a) False report of a finding in the candidate,(b) Missing a finding present in the reference,(c) Misidentification of a finding's anatomic location/position,(d) Misassessment of the severity of a finding,(e) Mentioning a comparison that isn't in the reference,(f) Omitting a comparison detailing a change from a prior study,Matched Findings
"There are low lung volumes...","Heart size is normal...","The candidate report correctly identifies low lung volumes, left pleural effusion, atelectasis, and pulmonary edema. However, it misses the old bilateral rib fractures (b). The severity of pulmonary edema is described as 'mild' vs 'trace' (d).",0.5,0,1,0,1,0,0,4Error categories:
- (a) False finding reported in candidate
- (b) Missing finding from reference
- (c) Wrong anatomic location
- (d) Wrong severity assessment
- (e) Extra comparison not in reference
- (f) Missing comparison from reference
Single row with mean scores across all samples, rounded to 3 decimals.
bleu_score,bertscore,semb_score,radgraph_combined,1/RadCliQ-v1,ratescore,green_score
0.19,0.428,0.46,0.236,1.008,0.564,0.279Note: 1/RadCliQ-v1 = 1.0 / mean(RadCliQ-v1) (inverted so higher = better).
Ranked table with model metadata. Primary ranking metric: 1/RadCliQ-v1 (descending).
Rank,Date,Model Name,Institution,Model URL,BLEU,BertScore,SembScore,RadGraph,1/RadCliQ-v1,RaTEScore,GREEN
1,2025,UniRGCXR,Microsoft Research,,0.262,0.492,0.496,0.269,1.233,0.602,0.36
2,2025,Flora Microsoft,Microsoft Research,,0.248,0.493,0.487,0.265,1.217,0.596,0.352
3,2025,CheXOne-R1,Stanford,https://github.com/YBZh/CheXOne-R1,0.218,0.461,0.455,0.235,1.06,0.519,0.314
4,2025,RadPhi4VisionCXR,Microsoft Research,,0.234,0.444,0.439,0.251,1.033,0.584,0.351
5,2025,zzy-RGv1-8b,Individual,,0.213,0.432,0.455,0.253,1.031,0.579,0.339
6,2025,CX-Mind,SJTU,https://arxiv.org/pdf/2508.03733,0.15,0.385,0.319,0.174,0.782,0.543,0.267The pipeline uses 2 separate conda environments because Step 1 (CXR-Report-Metric) requires older dependencies that conflict with Steps 2-3 (RaTEScore, GREEN).
conda create -n radgraph python=3.8 -y
conda activate radgraph
pip install torch==1.13.1+cu117 -f https://download.pytorch.org/whl/torch_stable.html
pip install transformers==4.25.1
pip install bert-score==0.3.13
pip install fast-bleu==0.0.4
pip install scikit-learn==1.2.2
pip install pandas numpy scipy tqdmCXR-Report-Metric is already bundled in scripts/CXR-Report-Metric/. You only need to download two checkpoints:
| Checkpoint | Path (relative to scripts/CXR-Report-Metric/) |
Source |
|---|---|---|
| CheXbert | CheXbert/models/chexbert.pth |
stanfordmlgroup/CheXbert |
| RadGraph | radgraph/physionet.org/files/radgraph/1.0.0/models/model_checkpoint/model.tar.gz |
PhysioNet RadGraph v1.0.0 (requires credentialed access) |
RadCliQ-v1 (CXRMetric/radcliq-v1.pkl), RadCliQ-v0 normalizer and model are already included.
Config file (scripts/CXR-Report-Metric/config.py):
CHEXBERT_PATH = "CheXbert/models/chexbert.pth"
RADGRAPH_PATH = "radgraph/physionet.org/files/radgraph/1.0.0/models/model_checkpoint/model.tar.gz"
USE_IDF = FalseRun (from rexrank-metric/):
conda activate radgraph
python scripts/run_cxr_metrics.py \
--models MyModel \
--datasets mimic-cxr iu_xray chexpert_plus gradient_health \
--splits findings reports \
--data-root data --results-root resultsconda create -n green_score python=3.10 -y
conda activate green_score
pip install torch==2.1.0
pip install transformers==4.36.0
pip install pandas numpy
# RaTEScore (https://github.com/MAGIC-AI4Med/RaTEScore)
pip install ratescore
# GREEN (https://github.com/Stanford-AIMI/GREEN)
git clone https://github.com/Stanford-AIMI/GREEN.git
cd GREEN && pip install -e . && cd ..The GREEN model (StanfordAIMI/GREEN-radllama2-7b) is downloaded automatically from HuggingFace on first run. Requires ~14 GB disk and a GPU with >= 16 GB VRAM.
Run (from rexrank-metric/):
conda activate green_score
# Step 2: RaTEScore
python scripts/run_ratescore.py \
--models MyModel \
--datasets mimic-cxr iu_xray chexpert_plus gradient_health \
--splits findings reports \
--data-root data --results-root results
# Step 3: GREEN
python scripts/run_green.py \
--models MyModel \
--datasets mimic-cxr iu_xray chexpert_plus gradient_health \
--splits findings reports \
--data-root data --results-root resultsSteps 4 and 5 are pure pandas operations. They work in either of the above environments or a minimal one:
# Any env with pandas works (from rexrank-metric/)
pip install pandas numpy
# Step 4: Aggregate per-sample → per-model summary
python scripts/aggregate_metrics.py \
--models MyModel \
--datasets mimic-cxr iu_xray chexpert_plus gradient_health \
--splits findings reports \
--results-root results --output-root results/metric
# Step 5: Build ranked leaderboard
python scripts/summarize_leaderboard.py \
--datasets mimic-cxr iu_xray chexpert_plus gradient_health \
--splits findings reports \
--metric-root results/metric --output-dir results/metric/results_summary- Library:
fast_bleu - Weights:
(1/2, 1/2)(bigram) - Preprocessing: lowercase, add space around periods, tokenize on whitespace
- Computed per-sample by matching
study_id
- Library:
bert_score - Model:
distilroberta-base - Settings:
rescale_with_baseline=True,idf=False,batch_size=256,lang="en" - Preprocessing: collapse multiple spaces
- Reports the F1 score
- Embedding model: CheXbert (
chexbert.pth) - Encodes both ground-truth and predicted reports via
CheXbert/src/encode.py - Score = cosine similarity between embedding vectors
- Inference: DyGIE++ model from PhysioNet RadGraph v1.0.0
- Extracts entities and relations from both ground-truth and predicted reports
- Entity F1 = standard F1 over entity sets
- Relation F1 = standard F1 over relation sets
- Combined = (Entity F1 + Relation F1) / 2
- Input features:
[radgraph_combined, bertscore, semb_score, bleu_score] - Model: pre-trained linear regression loaded from
CXRMetric/radcliq-v1.pkl - No separate normalization (unlike RadCliQ-v0 which uses MinMaxScaler from
normalizer.pkl) - Lower RadCliQ-v1 values indicate better reports
- Computed as:
1.0 / mean(RadCliQ-v1)across all samples in a dataset - This inversion makes higher values = better, suitable for leaderboard ranking
- Library:
RaTEScore - Usage:
ratescore.compute_score(hypotheses, references) - Evaluates factual and temporal consistency
- Model:
StanfordAIMI/GREEN-radllama2-7b(LLaMA2-based) - Settings:
cpu=False,compute_summary_stats=False - Analyzes clinical errors in 6 categories:
- (a) False finding reported
- (b) Missing finding
- (c) Wrong anatomic location
- (d) Wrong severity
- (e) Extra comparison mentioned
- (f) Missing comparison
This walks through evaluating a model called MyModel on one dataset (mimic-cxr, findings split).
Place CSVs under data/:
data/findings/mimic-cxr/gt_reports_MyModel.csv
data/findings/mimic-cxr/predicted_reports_MyModel.csv
Both must have columns: study_id, case_id, report. The study_id values must match between the two files.
All commands below assume you are in the rexrank-metric/ directory.
# --- Steps 0-1: CXR metrics (conda: radgraph, GPU) ---
conda activate radgraph
python scripts/run_cxr_metrics.py \
--models MyModel --datasets mimic-cxr --splits findings \
--data-root data --results-root results
# Output: results/mimic-cxr_findings/report_scores_MyModel.csv
# --- Steps 2-3: RaTEScore + GREEN (conda: green_score, GPU) ---
conda activate green_score
python scripts/run_ratescore.py \
--models MyModel --datasets mimic-cxr --splits findings \
--data-root data --results-root results
# Output: results/mimic-cxr_findings/ratescore_MyModel.csv
python scripts/run_green.py \
--models MyModel --datasets mimic-cxr --splits findings \
--data-root data --results-root results
# Output: results/mimic-cxr_findings/results_green_MyModel.csv
# --- Steps 4-5: Aggregate & leaderboard (any env, no GPU) ---
python scripts/aggregate_metrics.py \
--models MyModel --datasets mimic-cxr --splits findings \
--results-root results --output-root results/metric
# Output: results/metric/mimic-cxr/findings/MyModel.csv
python scripts/summarize_leaderboard.py \
--datasets mimic-cxr --splits findings \
--metric-root results/metric --output-dir results/metric/results_summary
# Output: results/metric/results_summary/mimic-cxr.csvThe pipeline is split into two SLURM-compatible shell scripts, each with its own conda env and GPU requirement:
| Script | Conda Env | GPU | Steps |
|---|---|---|---|
run_cxr_metrics.sh |
radgraph |
Yes | 0 (JSON→CSV) + 1 (BLEU, BERTScore, SembScore, RadGraph, RadCliQ) |
run_green_ratescore.sh |
green_score |
Yes | 2 (RaTEScore) + 3 (GREEN) + 4 (Aggregate) + 5 (Leaderboard) |
# --- Option A: Submit as two separate SLURM jobs ---
sbatch scripts/run_cxr_metrics.sh \
--json data/submission_example.json \
--model MyModel --dataset iu_xray
# (after job 1 completes)
sbatch scripts/run_green_ratescore.sh \
--model MyModel --dataset iu_xray
# --- Option B: Run both interactively ---
bash scripts/run_all.sh \
--json data/submission_example.json \
--model MyModel --dataset iu_xrayrun_all.sh is a convenience wrapper that runs both scripts sequentially (switches conda envs automatically).
| Step | GPU | VRAM | Notes |
|---|---|---|---|
| Step 1: CXR-Report-Metric | Required | >= 8 GB | RadGraph (DyGIE++) inference |
| Step 2: RaTEScore | Optional | — | CPU is fine |
| Step 3: GREEN | Required | >= 16 GB | GREEN-radllama2-7b (7B LLM) |
| Steps 4-5: Aggregate | No | — | Pure pandas |
- RAM: >= 64 GB recommended
- Storage: ~15 GB for model checkpoints
If you use this evaluation pipeline, please cite:
@article{zhang2024rexrank,
title={ReXrank: A Public Leaderboard for AI-Powered Radiology Report Generation},
author={Zhang, Xiaoman and Zhou, Hong-Yu and Yang, Xiaoli and Banerjee, Oishi and Acosta, Juli{\'a}n N and Miller, Josh and Huang, Ouwen and Rajpurkar, Pranav},
journal={AAAI Bridge Program AIMedHealth},
year={2025}
}
@inproceedings{zhang2025rexgradient,
title={ReXGradient-160K: A Large-Scale Publicly Available Dataset of Chest Radiographs with Free-text Reports},
author={Zhang, Xiaoman and Acosta, Julián N. and Miller, Josh and Huang, Ouwen and Rajpurkar, Pranav},
booktitle={arXiv:2505.00228v1},
year={2025}
}