rusty-llm-jury

A Rust based CLI tool for estimating success rates when using LLM judges for evaluation.

Overview

When using Large Language Models (LLMs) as judges to evaluate other models or systems, the judge's own biases and errors can significantly impact the reliability of the evaluation. rusty-llm-jury provides a command-line tool to estimate the true success rate of your system by correcting for LLM judge bias using bootstrap confidence intervals.

Installation

From crates.io (when published)

cargo install llm-jury

From source

git clone https://github.com/udapy/rusty-llm-jury.git
cd rusty-llm-jury
cargo install --path .

Quick Start

Basic Estimation

# Estimate true success rate with bias correction
llm-jury estimate \
  --test-labels "1,1,0,0,1,0,1,0" \
  --test-preds "1,0,0,1,1,0,1,0" \
  --unlabeled-preds "1,1,0,1,0,1,0,1" \
  --bootstrap-iterations 20000 \
  --confidence-level 0.95

# Output:
# Estimated true pass rate: 0.625
# 95% Confidence interval: [0.234, 0.891]

Using Files

# Load data from CSV files
llm-jury estimate \
  --test-labels-file test_labels.csv \
  --test-preds-file test_preds.csv \
  --unlabeled-preds-file unlabeled_preds.csv

Synthetic Experiments

# Run TPR/TNR sensitivity analysis
llm-jury synth-experiment \
  --true-failure-rate 0.1 \
  --tpr-range 0.5,0.95 \
  --tnr-range 0.5,0.95 \
  --n-points 10 \
  --output results.json

How It Works

The tool implements a bias correction method based on the following steps:

Judge Accuracy Estimation: Calculate the LLM judge's True Positive Rate (TPR) and True Negative Rate (TNR) using labeled test data
Correction: Apply the Rogan-Gladen correction formula to account for judge bias:
```
θ̂ = (p_obs + TNR - 1) / (TPR + TNR - 1)
```
where p_obs is the observed pass rate from the judge
Bootstrap Confidence Intervals: Use bootstrap resampling to quantify uncertainty in the estimate

CLI Reference

`llm-jury estimate`

Estimate true pass rate with bias correction and confidence intervals.

Options:

--test-labels <VALUES>: Comma-separated 0/1 values (human labels on test set)
--test-preds <VALUES>: Comma-separated 0/1 values (judge predictions on test set)
--unlabeled-preds <VALUES>: Comma-separated 0/1 values (judge predictions on unlabeled data)
--test-labels-file <FILE>: Load test labels from CSV file
--test-preds-file <FILE>: Load test predictions from CSV file
--unlabeled-preds-file <FILE>: Load unlabeled predictions from CSV file
--bootstrap-iterations <N>: Number of bootstrap iterations (default: 20000)
--confidence-level <LEVEL>: Confidence level between 0 and 1 (default: 0.95)
--output <FILE>: Save results to JSON file
--format <FORMAT>: Output format: text, json, csv (default: text)

`llm-jury synth-experiment`

Run synthetic sensitivity experiments.

Options:

--true-failure-rate <RATE>: True failure rate in unlabeled data (default: 0.1)
--tpr-range <MIN,MAX>: TPR range to test (default: 0.5,1.0)
--tnr-range <MIN,MAX>: TNR range to test (default: 0.5,1.0)
--n-points <N>: Number of points in each range (default: 10)
--n-test-positive <N>: Number of positive test examples (default: 100)
--n-test-negative <N>: Number of negative test examples (default: 100)
--n-unlabeled <N>: Number of unlabeled samples (default: 1000)
--bootstrap-iterations <N>: Bootstrap iterations (default: 2000)
--seed <SEED>: Random seed for reproducibility
--output <FILE>: Output file (JSON or CSV based on extension)

Examples

Real-World Usage Pattern

# Step 1: Collect your data
echo "1,0,1,1,0,0,1,0" > test_labels.csv      # Human evaluation
echo "1,0,0,1,1,0,1,0" > test_preds.csv       # LLM judge on same data
echo "1,1,0,1,0,1,0,1,1,0" > unlabeled.csv    # LLM judge on target data

# Step 2: Estimate true success rate
llm-jury estimate \
  --test-labels-file test_labels.csv \
  --test-preds-file test_preds.csv \
  --unlabeled-preds-file unlabeled.csv \
  --format json \
  --output results.json

# Step 3: View results
cat results.json

Sensitivity Analysis

# Analyze how estimation varies with judge accuracy
llm-jury synth-experiment \
  --true-failure-rate 0.2 \
  --tpr-range 0.6,0.95 \
  --tnr-range 0.6,0.95 \
  --n-points 15 \
  --seed 42 \
  --output sensitivity_analysis.json

Building from Source

Prerequisites

Rust 1.70+ (2021 edition)
Cargo

Build Commands

# Clone repository
git clone https://github.com/ai-evals-course/rusty-llm-jury.git
cd rusty-llm-jury

# Build release version
make build

# Or using cargo directly
cargo build --release

# The binary will be at target/release/llm-jury

Development

# Format code
make fmt

# Run lints
make clippy

# Run tests
make test

# All checks
make check

Testing

Run the test suite:

cargo test

Run with coverage (requires cargo-tarpaulin):

cargo install cargo-tarpaulin
cargo tarpaulin --out html

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Inspired by Python judgy package, I am learning during my AI evals course with shreya and hamel.
The Rogan-Gladen correction method for bias correction in diagnostic tests
Bootstrap methodology for confidence interval estimation
The Rust ecosystem for excellent tooling and libraries

Note: This tool assumes that your LLM judge performs better than random chance (TPR + TNR > 1). If your judge's accuracy is too low, the correction method may not be applicable.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
Contributing.md		Contributing.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
clippy.toml		clippy.toml
rustfmt.toml		rustfmt.toml
test-run-example.sh		test-run-example.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

rusty-llm-jury

Table of Contents

Overview

Installation

From crates.io (when published)

From source

Quick Start

Basic Estimation

Using Files

Synthetic Experiments

How It Works

CLI Reference

`llm-jury estimate`

`llm-jury synth-experiment`

Examples

Real-World Usage Pattern

Sensitivity Analysis

Building from Source

Prerequisites

Build Commands

Development

Testing

Contributing

License

Acknowledgments

About

Uh oh!

Uh oh!

Languages

License

udapy/rusty-llm-jury

Folders and files

Latest commit

History

Repository files navigation

rusty-llm-jury

Table of Contents

Overview

Installation

From crates.io (when published)

From source

Quick Start

Basic Estimation

Using Files

Synthetic Experiments

How It Works

CLI Reference

llm-jury estimate

llm-jury synth-experiment

Examples

Real-World Usage Pattern

Sensitivity Analysis

Building from Source

Prerequisites

Build Commands

Development

Testing

Contributing

License

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages

`llm-jury estimate`

`llm-jury synth-experiment`