Skip to content

Commit 3a34c0f

Browse files
julien-clhoestqburtenshaw
authored
Metadata proposal for eval_results (#2107)
* Metadata proposal for `eval_results` * Apply suggestion from @lhoestq Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Ok let's do this!! Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * [enhancement] eval-results - add a page and menu (#2109) * add a page and menu * Update docs/hub/_toctree.yml Co-authored-by: Julien Chaumond <julien@huggingface.co> * move eval results in menu to after model cards * add beta warning step * add link to eval results spec * use branch link for yaml spec --------- Co-authored-by: Julien Chaumond <julien@huggingface.co> * Better TOC * rm metric mention * rm mention of metrics here too * more correct dataset ids? * link from previous doc to new doc! * Also link from modelcard.md * move this as it's a bit stale * Apply suggestion from @burtenshaw Co-authored-by: burtenshaw <ben.burtenshaw@gmail.com> * consistency for datasets as well * Final tweaks to spec * Final tweaks to spec --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Co-authored-by: burtenshaw <ben.burtenshaw@gmail.com>
1 parent 447d977 commit 3a34c0f

File tree

8 files changed

+140
-25
lines changed

8 files changed

+140
-25
lines changed

datasetcard.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
---
22
# Example metadata to be added to a dataset card.
3-
# Full dataset card template at https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md
43
language:
54
- {lang_0} # Example: fr
65
- {lang_1} # Example: en
@@ -99,4 +98,4 @@ extra_gated_prompt: {extra_gated_prompt} # Example for speech datasets: By clic
9998

10099
Valid license identifiers can be found in [our docs](https://huggingface.co/docs/hub/repositories-licenses).
101100

102-
For the full dataset card template, see: [datasetcard_template.md file](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md).
101+
For a template for the human-readable portion of the dataset card, see: [datasetcard_template.md file](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md).

docs/hub/_toctree.yml

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -91,16 +91,12 @@
9191
- local: model-cards
9292
title: Model Cards
9393
sections:
94-
- local: model-card-annotated
95-
title: Annotated Model Card
9694
- local: model-cards-co2
9795
title: Carbon Emissions
98-
- local: model-card-guidebook
99-
title: Model Card Guidebook
100-
- local: model-card-landscape-analysis
101-
title: Landscape Analysis
10296
- local: model-cards-components
10397
title: Card Components
98+
- local: eval-results
99+
title: Eval Results
104100
- local: models-gated
105101
title: Gated Models
106102
- local: models-uploading
@@ -185,6 +181,13 @@
185181
title: Local Apps
186182
- local: models-faq
187183
title: Frequently Asked Questions
184+
sections:
185+
- local: model-card-annotated
186+
title: Annotated Model Card
187+
- local: model-card-guidebook
188+
title: Model Card Guidebook
189+
- local: model-card-landscape-analysis
190+
title: Model Card Landscape
188191
- local: models-advanced
189192
title: Advanced Topics
190193
sections:

docs/hub/eval-results.md

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# Evaluation Results
2+
3+
> [!WARNING]
4+
> This is a work in progress feature.
5+
6+
The Hub provides a decentralized system for tracking model evaluation results. Benchmark datasets host leaderboards, and model repos store evaluation scores that automatically appear on both the model page and the benchmark's leaderboard.
7+
8+
## Benchmark Datasets
9+
10+
Dataset repos can be defined as **Benchmarks** (e.g., [AIME](https://huggingface.co/datasets/aime-ai/aime), [HLE](https://huggingface.co/datasets/cais/hle), [GPQA](https://huggingface.co/datasets/Idavidrein/gpqa)). These display a "Benchmark" tag and automatically aggregate evaluation results from model repos across the Hub and display a leaderboard of top models.
11+
12+
![Benchmark Dataset](https://huggingface.co/huggingface/documentation-images/resolve/main/evaluation-results/benchmark-preview.png)
13+
14+
### Registering a Benchmark
15+
16+
To register your dataset as a benchmark:
17+
18+
1. Create a dataset repo containing your evaluation data
19+
2. Add an `eval.yaml` file to the repo root with your benchmark configuration
20+
3. The file is validated at push time
21+
4. (**Beta**) Get in touch so we can add it to the allow-list.
22+
23+
The `eval.yaml` format is based on [Inspect AI](https://inspect.aisi.org.uk/), enabling reproducible evaluations. See the [Evaluating models with Inspect](https://huggingface.co/docs/inference-providers/guides/evaluation-inspect-ai) guide for details on running evaluations.
24+
25+
<!-- TODO: Add example of eval.yaml file -->
26+
27+
## Model Evaluation Results
28+
29+
Evaluation scores are stored in model repos as YAML files in the `.eval_results/` folder. These results:
30+
31+
- Appear on the model page with links to the benchmark leaderboard
32+
- Are aggregated into the benchmark dataset's leaderboards
33+
- Can be submitted via PRs and marked as "community-provided"
34+
35+
![Model Evaluation Results](https://huggingface.co/huggingface/documentation-images/resolve/main/evaluation-results/eval-results-previw.png)
36+
37+
### Adding Evaluation Results
38+
39+
To add evaluation results to a model, you can submit a PR to the model repo with a YAML file in the `.eval_results/` folder.
40+
41+
Create a YAML file in `.eval_results/*.yaml` in your model repo:
42+
43+
```yaml
44+
- dataset:
45+
id: cais/hle # Required. Hub dataset ID (must be a Benchmark)
46+
task_id: default # Optional, in case there are multiple tasks or leaderboards for this dataset.
47+
revision: <hash> # Optional. Dataset revision hash
48+
value: 20.90 # Required. Metric value
49+
verifyToken: <token> # Optional. Cryptographic proof of auditable evaluation
50+
date: 2025-01-15T10:30:00Z # Optional. ISO-8601 datetime (defaults to git commit time)
51+
source: # Optional. Attribution for the result
52+
url: https://huggingface.co/datasets/cais/hle # Required if source provided
53+
name: CAIS HLE # Optional. Display name
54+
user: cais # Optional. HF username/org
55+
```
56+
57+
Or, with only the required attributes:
58+
59+
```yaml
60+
- dataset:
61+
id: Idavidrein/gpqa
62+
task_id: gpqa_diamond
63+
value: 0.412
64+
```
65+
66+
Results display badges based on their metadata in the YAML file:
67+
68+
| Badge | Condition |
69+
|-------|-----------|
70+
| verified | A `verifyToken` is valid (evaluation ran in HF Jobs with inspect-ai) |
71+
| community-provided | Result submitted via open PR (not merged to main) |
72+
| leaderboard | Links to the benchmark dataset |
73+
| source | Links to evaluation logs or external source |
74+
75+
For more details on how to format this data, check out the [Eval Results](https://github.com/huggingface/hub-docs/blob/main/eval_results.yaml) specifications.
76+
77+
### Community Contributions
78+
79+
Anyone can submit evaluation results to any model via Pull Request:
80+
81+
1. Go to the model page and click on the "Community" tab and open a Pull Request.
82+
3. Add a `.eval_results/*.yaml` file with your results.
83+
4. The PR will show as "community-provided" on the model page while open.
84+
85+
For help evaluating a model, see the [Evaluating models with Inspect](https://huggingface.co/docs/inference-providers/guides/evaluation-inspect-ai) guide.

docs/hub/model-cards-co2.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ The math is pretty simple! ➕
3636

3737
First, you take the *carbon intensity* of the electric grid used for the training -- this is how much CO<sub>2</sub> is produced by KwH of electricity used. The carbon intensity depends on the location of the hardware and the [energy mix](https://electricitymap.org/) used at that location -- whether it's renewable energy like solar 🌞, wind 🌬️ and hydro 💧, or non-renewable energy like coal ⚫ and natural gas 💨. The more renewable energy gets used for training, the less carbon-intensive it is!
3838

39-
Then, you take the power consumption of the GPU during training using the `pynvml` library.
39+
Then, you take the power consumption of the GPUs during training using the `pynvml` library.
4040

4141
Finally, you multiply the power consumption and carbon intensity by the training time of the model, and you have an estimate of the CO<sub>2</sub> emission.
4242

docs/hub/model-cards.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ The metadata you add to the model card supports discovery and easier use of your
3030
* Displaying the model's license.
3131
* Adding datasets to the metadata will add a message reading `Datasets used to train:` to your model page and link the relevant datasets, if they're available on the Hub.
3232

33-
Dataset, metric, and language identifiers are those listed on the [Datasets](https://huggingface.co/datasets), [Metrics](https://huggingface.co/metrics) and [Languages](https://huggingface.co/languages) pages.
33+
Dataset and language identifiers are those listed on the [Datasets](https://huggingface.co/datasets) and [Languages](https://huggingface.co/languages) pages.
3434

3535

3636
### Adding metadata to your model card
@@ -72,9 +72,6 @@ license: "any valid license identifier"
7272
datasets:
7373
- dataset1
7474
- dataset2
75-
metrics:
76-
- metric1
77-
- metric2
7875
base_model: "base model Hub identifier"
7976
---
8077
```
@@ -101,7 +98,7 @@ tags:
10198
If it's not specified, the Hub will try to automatically detect the library type. However, this approach is discouraged, and repo creators should use the explicit `library_name` as much as possible.
10299

103100
1. By looking into the presence of files such as `*.nemo` or `*.mlmodel`, the Hub can determine if a model is from NeMo or CoreML.
104-
2. In the past, if nothing was detected and there was a `config.json` file, it was assumed the library was `transformers`. For model repos created after August 2024, this is not the case anymore – so you need to `library_name: transformers` explicitly.
101+
2. In the past, if nothing was detected and there was a `config.json` file, it was assumed the library was `transformers`. For model repos created after August 2024, this is not the case anymore, so you need to set `library_name: transformers` explicitly.
105102

106103
### Specifying a base model
107104

@@ -181,8 +178,8 @@ You can specify the datasets used to train your model in the model card metadata
181178

182179
```yaml
183180
datasets:
184-
- imdb
185-
- HuggingFaceH4/no_robots
181+
- stanfordnlp/imdb
182+
- HuggingFaceFW/fineweb
186183
```
187184

188185
### Specifying a task (`pipeline_tag`)
@@ -217,9 +214,12 @@ You can specify your **model's evaluation results** in a structured way in the m
217214
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/eval-results-v2-dark.png"/>
218215
</div>
219216

220-
The metadata spec was based on Papers with code's [model-index specification](https://github.com/paperswithcode/model-index). This allow us to directly index the results into Papers with code's leaderboards when appropriate. You can also link the source from where the eval results has been computed.
217+
The initial metadata spec was based on Papers with code's [model-index specification](https://github.com/paperswithcode/model-index). This allowed us to directly index the results into Papers with code's leaderboards when appropriate. You could also link the source from where the eval results has been computed.
221218

222-
Here is a partial example to describe [01-ai/Yi-34B](https://huggingface.co/01-ai/Yi-34B)'s score on the ARC benchmark. The result comes from the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) which is defined as the `source`:
219+
> [!TIP]
220+
> NEW: We have a new, simpler metadata format for eval results. Check it out in [the dedicated doc page](./eval-results).
221+
222+
Here is a partial example of a model-index that was describing [01-ai/Yi-34B](https://huggingface.co/01-ai/Yi-34B)'s score on the ARC benchmark. The result came from the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) which is defined as the `source`:
223223

224224
```yaml
225225
---
@@ -263,7 +263,7 @@ Read more about Paper pages [here](./paper-pages).
263263

264264
## Model Card text
265265

266-
Details on how to fill out a human-readable model card without Hub-specific metadata (so that it may be printed out, cut+pasted, etc.) is available in the [Annotated Model Card](./model-card-annotated).
266+
Details on how to fill out the human-readable portion of the model card (so that it may be printed out, cut+pasted, etc.) is available in the [Annotated Model Card](./model-card-annotated).
267267

268268
## FAQ
269269

docs/hub/models.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,14 @@ The Hugging Face Hub hosts many models for a [variety of machine learning tasks]
77
- [The Model Hub](./models-the-hub)
88
- [Model Cards](./model-cards)
99
- [CO<sub>2</sub> emissions](./model-cards-co2)
10-
- [Gated models](./models-gated)
11-
- [Libraries](./models-libraries)
10+
- [Eval Results](./eval-results)
11+
- [Gated models](./models-gated)
1212
- [Uploading Models](./models-uploading)
1313
- [Downloading Models](./models-downloading)
14+
- [Libraries](./models-libraries)
1415
- [Widgets](./models-widgets)
1516
- [Widget Examples](./models-widgets-examples)
16-
- [Inference API](./models-inference)
17+
- [Model Inference](./models-inference)
1718
- [Frequently Asked Questions](./models-faq)
1819
- [Advanced Topics](./models-advanced)
1920
- [Integrating libraries with the Hub](./models-adding-libraries)

eval_results.yaml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
- dataset:
2+
id: cais/hle # Required. A valid dataset id from the Hub, which should have a "Benchmark" tag.
3+
# ^Basically, this is where the leaderboard lives.
4+
task_id: {task_id} # Optional, in case there are multiple tasks or leaderboards for this dataset.
5+
# It is defined in the benchmark dataset's eval.yaml file. Example: gpqa_diamond
6+
# It can usually be a dataset config (aka subset) or split name.
7+
revision: {dataset_revision} # Optional. Example: 5503434ddd753f426f4b38109466949a1217c2bb
8+
9+
value: {metric_value} # Required. Example: 20.90
10+
11+
verifyToken: {verify_token} # Optional. If present, this is a signature that can be used to prove that evaluation is provably auditable and reproducible.
12+
# (For example, was run in a HF Job using inspect-ai or lighteval)
13+
14+
date: {date} # Optional. When was this eval run (ISO-8601 datetime). If not provided, can default to this file creation time in git.
15+
16+
source: # Optional. The source for this result, for instance a dataset repo.
17+
url: {source_url} # Required if source is provided. A link to the source. Example: https://huggingface.co/spaces/SaylorTwift/smollm3-mmlu-pro.
18+
name: {source_name} # Optional. The name of the source. Example: Eval Logs.
19+
user: {username} # Optional. A HF user or org name.
20+
21+
# or, with only the required attributes:
22+
23+
- dataset:
24+
id: Idavidrein/gpqa
25+
task_id: gpqa_diamond
26+
value: 0.412
27+

modelcard.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
---
22
# Example metadata to be added to a model card.
3-
# Full model card template at https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md
43
language:
54
- {lang_0} # Example: fr
65
- {lang_1} # Example: en
@@ -20,6 +19,7 @@ metrics:
2019
base_model: {base_model} # Example: stabilityai/stable-diffusion-xl-base-1.0. Can also be a list (for merges)
2120

2221
# Optional. Add this if you want to encode your eval results in a structured way.
22+
# There is a newer, simpler version of this metadata format in ./eval_results.yaml
2323
model-index:
2424
- name: {model_id}
2525
results:
@@ -48,7 +48,7 @@ model-index:
4848
url: {source_url} # Required if source is provided. A link to the source. Example: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard.
4949
---
5050

51-
This markdown file contains the spec for the modelcard metadata regarding evaluation parameters. When present, and only then, 'model-index', 'datasets' and 'license' contents will be verified when git pushing changes to your README.md file.
51+
This markdown file contains the spec for the modelcard metadata. Properties will be validated by the Hub when git pushing changes to your README.md file.
5252
Valid license identifiers can be found in [our docs](https://huggingface.co/docs/hub/repositories-licenses).
5353

54-
For the full model card template, see: [modelcard_template.md file](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md).
54+
For a template for the human-readable portion of the model card, see: [modelcard_template.md file](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md).

0 commit comments

Comments
 (0)