|
| 1 | +# Evaluation Results |
| 2 | + |
| 3 | +> [!WARNING] |
| 4 | +> This is a work in progress feature. |
| 5 | +
|
| 6 | +The Hub provides a decentralized system for tracking model evaluation results. Benchmark datasets host leaderboards, and model repos store evaluation scores that automatically appear on both the model page and the benchmark's leaderboard. |
| 7 | + |
| 8 | +## Benchmark Datasets |
| 9 | + |
| 10 | +Dataset repos can be defined as **Benchmarks** (e.g., [AIME](https://huggingface.co/datasets/aime-ai/aime), [HLE](https://huggingface.co/datasets/cais/hle), [GPQA](https://huggingface.co/datasets/Idavidrein/gpqa)). These display a "Benchmark" tag and automatically aggregate evaluation results from model repos across the Hub and display a leaderboard of top models. |
| 11 | + |
| 12 | + |
| 13 | + |
| 14 | +### Registering a Benchmark |
| 15 | + |
| 16 | +To register your dataset as a benchmark: |
| 17 | + |
| 18 | +1. Create a dataset repo containing your evaluation data |
| 19 | +2. Add an `eval.yaml` file to the repo root with your benchmark configuration |
| 20 | +3. The file is validated at push time |
| 21 | +4. (**Beta**) Get in touch so we can add it to the allow-list. |
| 22 | + |
| 23 | +The `eval.yaml` format is based on [Inspect AI](https://inspect.aisi.org.uk/), enabling reproducible evaluations. See the [Evaluating models with Inspect](https://huggingface.co/docs/inference-providers/guides/evaluation-inspect-ai) guide for details on running evaluations. |
| 24 | + |
| 25 | +<!-- TODO: Add example of eval.yaml file --> |
| 26 | + |
| 27 | +## Model Evaluation Results |
| 28 | + |
| 29 | +Evaluation scores are stored in model repos as YAML files in the `.eval_results/` folder. These results: |
| 30 | + |
| 31 | +- Appear on the model page with links to the benchmark leaderboard |
| 32 | +- Are aggregated into the benchmark dataset's leaderboards |
| 33 | +- Can be submitted via PRs and marked as "community-provided" |
| 34 | + |
| 35 | + |
| 36 | + |
| 37 | +### Adding Evaluation Results |
| 38 | + |
| 39 | +To add evaluation results to a model, you can submit a PR to the model repo with a YAML file in the `.eval_results/` folder. |
| 40 | + |
| 41 | +Create a YAML file in `.eval_results/*.yaml` in your model repo: |
| 42 | + |
| 43 | +```yaml |
| 44 | +- dataset: |
| 45 | + id: cais/hle # Required. Hub dataset ID (must be a Benchmark) |
| 46 | + task_id: default # Optional, in case there are multiple tasks or leaderboards for this dataset. |
| 47 | + revision: <hash> # Optional. Dataset revision hash |
| 48 | + value: 20.90 # Required. Metric value |
| 49 | + verifyToken: <token> # Optional. Cryptographic proof of auditable evaluation |
| 50 | + date: 2025-01-15T10:30:00Z # Optional. ISO-8601 datetime (defaults to git commit time) |
| 51 | + source: # Optional. Attribution for the result |
| 52 | + url: https://huggingface.co/datasets/cais/hle # Required if source provided |
| 53 | + name: CAIS HLE # Optional. Display name |
| 54 | + user: cais # Optional. HF username/org |
| 55 | +``` |
| 56 | +
|
| 57 | +Or, with only the required attributes: |
| 58 | +
|
| 59 | +```yaml |
| 60 | +- dataset: |
| 61 | + id: Idavidrein/gpqa |
| 62 | + task_id: gpqa_diamond |
| 63 | + value: 0.412 |
| 64 | +``` |
| 65 | +
|
| 66 | +Results display badges based on their metadata in the YAML file: |
| 67 | +
|
| 68 | +| Badge | Condition | |
| 69 | +|-------|-----------| |
| 70 | +| verified | A `verifyToken` is valid (evaluation ran in HF Jobs with inspect-ai) | |
| 71 | +| community-provided | Result submitted via open PR (not merged to main) | |
| 72 | +| leaderboard | Links to the benchmark dataset | |
| 73 | +| source | Links to evaluation logs or external source | |
| 74 | + |
| 75 | +For more details on how to format this data, check out the [Eval Results](https://github.com/huggingface/hub-docs/blob/main/eval_results.yaml) specifications. |
| 76 | + |
| 77 | +### Community Contributions |
| 78 | + |
| 79 | +Anyone can submit evaluation results to any model via Pull Request: |
| 80 | + |
| 81 | +1. Go to the model page and click on the "Community" tab and open a Pull Request. |
| 82 | +3. Add a `.eval_results/*.yaml` file with your results. |
| 83 | +4. The PR will show as "community-provided" on the model page while open. |
| 84 | + |
| 85 | +For help evaluating a model, see the [Evaluating models with Inspect](https://huggingface.co/docs/inference-providers/guides/evaluation-inspect-ai) guide. |
0 commit comments