AutoMetrics: Automatically discover, generate, and aggregate evaluation metrics for NLP tasks.
Autometrics helps you evaluate text generation systems by:
- Generating task-specific candidate metrics with LLMs (LLM-as-a-judge, rubric/code generated metrics)
- Retrieving the most relevant metrics from a bank of 40+ built-in metrics
- Evaluating all metrics on your dataset (reference-free and reference-based)
- Selecting the top metrics using regression
- Aggregating them into a single, optimized metric and producing a report card
The repository includes simple scripts, examples, notebooks, and a full library to run the end-to-end pipeline.
Install the published package (recommended):
pip install autometrics-aiInstall with extras (examples):
pip install "autometrics-ai[mauve]"
pip install "autometrics-ai[bleurt,bert-score,rouge]"
pip install "autometrics-ai[reward-models,gpu]" # reward models + GPU accelDeveloper install (from source):
pip install -e .Optional extras (summary)
- fasttext: FastText classifiers — metrics: FastTextEducationalValue, FastTextToxicity, FastTextNSFW
- lens: LENS metrics — metrics: LENS, LENS_SALSA
- parascore: Paraphrase metrics — metrics: ParaScore, ParaScoreFree
- bert-score: metrics: BERTScore
- bleurt: metrics: BLEURT
- moverscore: metrics: MOVERScore (adds pyemd)
- rouge: metrics: ROUGE, UpdateROUGE
- meteor: metrics: METEOR (adds beautifulsoup4)
- infolm: metrics: InfoLM (adds torchmetrics)
- mauve: metrics: MAUVE (evaluate + mauve-text)
- spacy: metrics: SummaQA (requires
spacymodel; install withpython -m spacy download en_core_web_sm) - hf-evaluate: HF evaluate wrappers — metrics: Toxicity; also used by some wrappers
- reward-models: Large HF reward models — metrics: PRMRewardModel, INFORMRewardModel, LDLRewardModel, GRMRewardModel
- readability: metrics: FKGL (textstat)
- gpu: FlashAttention + NV libs (optional acceleration; benefits large reward models)
- Install dependencies
pip install autometrics-ai-
Ensure Java 21 is installed (required by some retrieval components). See Java section below.
-
Set an API key for an OpenAI-compatible endpoint (for LLM-based generation/judging):
export OPENAI_API_KEY="your-api-key-here"- Run the simplest end-to-end example with sensible defaults:
python autometrics_simple_example.pyThis will:
- load the
HelpSteerdataset - generate and retrieve metrics
- select top-k via regression
- print a summary and report card
For a power-user example with customization, run:
python autometrics_example.py- Simple script with all defaults:
examples/autometrics_simple_example.py - Power-user/custom configuration:
examples/autometrics_example.py - Notebook tutorials:
examples/tutorial.ipynb,demo.ipynb - Text walkthrough tutorial:
examples/TUTORIAL.mdand runnableexamples/tutorial.py
If you prefer an experiments-style entry point with CLI arguments, see:
python analysis/main_experiments/run_main_autometrics.py <dataset_name> <target_name> <seed> <output_dir>There are also convenience scripts in analysis/ for ablations and scaling.
autometrics/dataset/datasets: Built-in datasets (e.g.,helpsteer,simplification,evalgen,iclr, ...). The main dataset interface lives inautometrics/dataset/Dataset.py.autometrics/metrics: Metric implementations and utilities. Seeautometrics/metrics/README.mdfor how to write new metrics.autometrics/metrics/llm_judge: LLM-as-a-judge rubric generators (e.g., G-Eval, Prometheus-style, example-based).autometrics/aggregator/regression: Regression-based selection/aggregation (Lasso, Ridge, ElasticNet, PLS, etc.).autometrics/recommend: Metric retrieval modules (BM25/ColBERT/LLMRec andPipelinedRec).autometrics/test: Unit and integration tests, including caching behavior and generator tests.analysis/: Experiment drivers (CLI), ablations, robustness/scaling studies, and utilities.
import os
import dspy
from autometrics.autometrics import Autometrics
from autometrics.dataset.datasets.helpsteer.helpsteer import HelpSteer
os.environ["OPENAI_API_KEY"] = "your-key-here"
dataset = HelpSteer()
generator_llm = dspy.LM("openai/gpt-4o-mini")
judge_llm = dspy.LM("openai/gpt-4o-mini")
autometrics = Autometrics()
results = autometrics.run(
dataset=dataset,
target_measure="helpfulness",
generator_llm=generator_llm,
judge_llm=judge_llm,
)
print([m.get_name() for m in results['top_metrics']])
print(results['regression_metric'].get_name())For more advanced configuration (custom generators, retrieval pipelines, priors, parallelism), see TUTORIAL.md.
Install dependencies:
pip install -r requirements.txtSome metrics require GPUs. You can inspect GPU memory needs by checking gpu_mem on metric classes. Many metrics run on CPU.
This package requires Java Development Kit (JDK) 21 for some of its search functionality.
sudo apt update
sudo apt install openjdk-21-jdkbrew install openjdk@21Download and install from https://www.oracle.com/java/technologies/downloads/#java21 or use Chocolatey:
choco install openjdk21Verify:
java -versionYou should see something like:
openjdk version "21.0.x"
OpenJDK Runtime Environment ...
OpenJDK 64-Bit Server VM ...
Note: Java 17 or lower versions will not work as Pyserini requires Java 21.
Built-in datasets are in autometrics/dataset/datasets (e.g., HelpSteer, SimpDA, ICLR, RealHumanEval, etc.). You can also construct your own via the Dataset class.
Minimal custom dataset example:
import pandas as pd
from autometrics.dataset.Dataset import Dataset
df = pd.DataFrame({
'id': ['1', '2'],
'input': ['prompt 1', 'prompt 2'],
'output': ['response 1', 'response 2'],
'reference': ['ref 1', 'ref 2'],
'human_score': [4.5, 3.2]
})
dataset = Dataset(
dataframe=df,
target_columns=['human_score'],
ignore_columns=['id'],
metric_columns=[],
name="MyCustomDataset",
data_id_column="id",
input_column="input",
output_column="output",
reference_columns=['reference'],
task_description="Evaluate response quality",
)The library implements disk caching for all metrics to improve performance when running scripts multiple times. Key features:
- All metrics cache results by default in the
./autometrics_cachedirectory (configurable viaAUTOMETRICS_CACHE_DIR) - Cache keys include input/output/references and all initialization parameters
- Non-behavioral parameters are excluded automatically (name, description, cache config)
- You can exclude additional parameters via
self.exclude_from_cache_key() - Disable per-metric with
use_cache=False - Very fast metrics like BLEU/SARI may disable cache by default
See examples in autometrics/test/custom_metric_caching_example.py. For guidance on writing new metrics, see autometrics/metrics/README.md.
- Read the tutorial:
examples/TUTORIAL.md(andexamples/tutorial.ipynb) - Browse built-in metrics under
autometrics/metrics/ - Explore experiment drivers in
analysis/
If you use this software, please cite it as below.
@software{Ryan_Autometrics_2025,
author = {Ryan, Michael J. and Zhang, Yanzhe and Salunkhe, Amol and Chu, Yi and Xu, Di and Yang, Diyi},
license = {MIT},
title = {{Autometrics}},
url = {https://github.com/XenonMolecule/autometrics},
version = {1.0.0},
year = {2025}
}