A Rust based CLI tool for estimating success rates when using LLM judges for evaluation.
- Overview
- Installation
- Quick Start
- How It Works
- CLI Reference
- Examples
- Building from Source
- Testing
- Contributing
- License
When using Large Language Models (LLMs) as judges to evaluate other models or systems, the judge's own biases and errors can significantly impact the reliability of the evaluation. rusty-llm-jury provides a command-line tool to estimate the true success rate of your system by correcting for LLM judge bias using bootstrap confidence intervals.
cargo install llm-jurygit clone https://github.com/udapy/rusty-llm-jury.git
cd rusty-llm-jury
cargo install --path .# Estimate true success rate with bias correction
llm-jury estimate \
--test-labels "1,1,0,0,1,0,1,0" \
--test-preds "1,0,0,1,1,0,1,0" \
--unlabeled-preds "1,1,0,1,0,1,0,1" \
--bootstrap-iterations 20000 \
--confidence-level 0.95
# Output:
# Estimated true pass rate: 0.625
# 95% Confidence interval: [0.234, 0.891]# Load data from CSV files
llm-jury estimate \
--test-labels-file test_labels.csv \
--test-preds-file test_preds.csv \
--unlabeled-preds-file unlabeled_preds.csv# Run TPR/TNR sensitivity analysis
llm-jury synth-experiment \
--true-failure-rate 0.1 \
--tpr-range 0.5,0.95 \
--tnr-range 0.5,0.95 \
--n-points 10 \
--output results.jsonThe tool implements a bias correction method based on the following steps:
- Judge Accuracy Estimation: Calculate the LLM judge's True Positive Rate (TPR) and True Negative Rate (TNR) using labeled test data
- Correction: Apply the Rogan-Gladen correction formula to account for judge bias:
where
θ̂ = (p_obs + TNR - 1) / (TPR + TNR - 1)p_obsis the observed pass rate from the judge - Bootstrap Confidence Intervals: Use bootstrap resampling to quantify uncertainty in the estimate
Estimate true pass rate with bias correction and confidence intervals.
Options:
--test-labels <VALUES>: Comma-separated 0/1 values (human labels on test set)--test-preds <VALUES>: Comma-separated 0/1 values (judge predictions on test set)--unlabeled-preds <VALUES>: Comma-separated 0/1 values (judge predictions on unlabeled data)--test-labels-file <FILE>: Load test labels from CSV file--test-preds-file <FILE>: Load test predictions from CSV file--unlabeled-preds-file <FILE>: Load unlabeled predictions from CSV file--bootstrap-iterations <N>: Number of bootstrap iterations (default: 20000)--confidence-level <LEVEL>: Confidence level between 0 and 1 (default: 0.95)--output <FILE>: Save results to JSON file--format <FORMAT>: Output format: text, json, csv (default: text)
Run synthetic sensitivity experiments.
Options:
--true-failure-rate <RATE>: True failure rate in unlabeled data (default: 0.1)--tpr-range <MIN,MAX>: TPR range to test (default: 0.5,1.0)--tnr-range <MIN,MAX>: TNR range to test (default: 0.5,1.0)--n-points <N>: Number of points in each range (default: 10)--n-test-positive <N>: Number of positive test examples (default: 100)--n-test-negative <N>: Number of negative test examples (default: 100)--n-unlabeled <N>: Number of unlabeled samples (default: 1000)--bootstrap-iterations <N>: Bootstrap iterations (default: 2000)--seed <SEED>: Random seed for reproducibility--output <FILE>: Output file (JSON or CSV based on extension)
# Step 1: Collect your data
echo "1,0,1,1,0,0,1,0" > test_labels.csv # Human evaluation
echo "1,0,0,1,1,0,1,0" > test_preds.csv # LLM judge on same data
echo "1,1,0,1,0,1,0,1,1,0" > unlabeled.csv # LLM judge on target data
# Step 2: Estimate true success rate
llm-jury estimate \
--test-labels-file test_labels.csv \
--test-preds-file test_preds.csv \
--unlabeled-preds-file unlabeled.csv \
--format json \
--output results.json
# Step 3: View results
cat results.json# Analyze how estimation varies with judge accuracy
llm-jury synth-experiment \
--true-failure-rate 0.2 \
--tpr-range 0.6,0.95 \
--tnr-range 0.6,0.95 \
--n-points 15 \
--seed 42 \
--output sensitivity_analysis.json- Rust 1.70+ (2021 edition)
- Cargo
# Clone repository
git clone https://github.com/ai-evals-course/rusty-llm-jury.git
cd rusty-llm-jury
# Build release version
make build
# Or using cargo directly
cargo build --release
# The binary will be at target/release/llm-jury# Format code
make fmt
# Run lints
make clippy
# Run tests
make test
# All checks
make checkRun the test suite:
cargo testRun with coverage (requires cargo-tarpaulin):
cargo install cargo-tarpaulin
cargo tarpaulin --out htmlContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Inspired by Python judgy package, I am learning during my AI evals course with shreya and hamel.
- The Rogan-Gladen correction method for bias correction in diagnostic tests
- Bootstrap methodology for confidence interval estimation
- The Rust ecosystem for excellent tooling and libraries
Note: This tool assumes that your LLM judge performs better than random chance (TPR + TNR > 1). If your judge's accuracy is too low, the correction method may not be applicable.