Metadata proposal for eval_results (#2107)

julien-c · lhoestq · burtenshaw · web-flow · commit 3a34c0fe74a4 · 2025-12-16T12:31:32.000+01:00
* Metadata proposal for `eval_results` * Apply suggestion from @lhoestq Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Ok let's do this!! Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * [enhancement] eval-results - add a page and menu (#2109) * add a page and menu * Update docs/hub/_toctree.yml Co-authored-by: Julien Chaumond <julien@huggingface.co> * move eval results in menu to after model cards * add beta warning step * add link to eval results spec * use branch link for yaml spec --------- Co-authored-by: Julien Chaumond <julien@huggingface.co> * Better TOC * rm metric mention * rm mention of metrics here too * more correct dataset ids? * link from previous doc to new doc! * Also link from modelcard.md * move this as it's a bit stale * Apply suggestion from @burtenshaw Co-authored-by: burtenshaw <ben.burtenshaw@gmail.com> * consistency for datasets as well * Final tweaks to spec * Final tweaks to spec --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Co-authored-by: burtenshaw <ben.burtenshaw@gmail.com>
diff --git a/datasetcard.md b/datasetcard.md
@@ -1,6 +1,5 @@
 ---
 # Example metadata to be added to a dataset card.  
-# Full dataset card template at https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md
 language:
 - {lang_0}  # Example: fr
 - {lang_1}  # Example: en
@@ -99,4 +98,4 @@ extra_gated_prompt: {extra_gated_prompt}  # Example for speech datasets: By clic
 
 Valid license identifiers can be found in [our docs](https://huggingface.co/docs/hub/repositories-licenses).
 
-For the full dataset card template, see: [datasetcard_template.md file](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md).
+For a template for the human-readable portion of the dataset card, see: [datasetcard_template.md file](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md).
diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml
@@ -91,16 +91,12 @@
   - local: model-cards
     title: Model Cards
     sections:
-    - local: model-card-annotated
-      title: Annotated Model Card
     - local: model-cards-co2
       title: Carbon Emissions
-    - local: model-card-guidebook
-      title: Model Card Guidebook
-    - local: model-card-landscape-analysis
-      title: Landscape Analysis
     - local: model-cards-components
       title: Card Components
+  - local: eval-results
+    title: Eval Results
   - local: models-gated
     title: Gated Models
   - local: models-uploading
@@ -185,6 +181,13 @@
     title: Local Apps
   - local: models-faq
     title: Frequently Asked Questions
+    sections:
+      - local: model-card-annotated
+        title: Annotated Model Card
+      - local: model-card-guidebook
+        title: Model Card Guidebook
+      - local: model-card-landscape-analysis
+        title: Model Card Landscape
   - local: models-advanced
     title: Advanced Topics
     sections:
diff --git a/docs/hub/eval-results.md b/docs/hub/eval-results.md
@@ -0,0 +1,85 @@
+# Evaluation Results
+
+> [!WARNING]
+> This is a work in progress feature.
+
+The Hub provides a decentralized system for tracking model evaluation results. Benchmark datasets host leaderboards, and model repos store evaluation scores that automatically appear on both the model page and the benchmark's leaderboard.
+
+## Benchmark Datasets
+
+Dataset repos can be defined as **Benchmarks** (e.g., [AIME](https://huggingface.co/datasets/aime-ai/aime), [HLE](https://huggingface.co/datasets/cais/hle), [GPQA](https://huggingface.co/datasets/Idavidrein/gpqa)). These display a "Benchmark" tag and automatically aggregate evaluation results from model repos across the Hub and display a leaderboard of top models.
+
+![Benchmark Dataset](https://huggingface.co/huggingface/documentation-images/resolve/main/evaluation-results/benchmark-preview.png)
+
+### Registering a Benchmark
+
+To register your dataset as a benchmark:
+
+1. Create a dataset repo containing your evaluation data
+2. Add an `eval.yaml` file to the repo root with your benchmark configuration
+3. The file is validated at push time
+4. (**Beta**) Get in touch so we can add it to the allow-list.
+
+The `eval.yaml` format is based on [Inspect AI](https://inspect.aisi.org.uk/), enabling reproducible evaluations. See the [Evaluating models with Inspect](https://huggingface.co/docs/inference-providers/guides/evaluation-inspect-ai) guide for details on running evaluations.
+
+<!-- TODO: Add example of eval.yaml file -->
+
+## Model Evaluation Results
+
+Evaluation scores are stored in model repos as YAML files in the `.eval_results/` folder. These results:
+
+- Appear on the model page with links to the benchmark leaderboard
+- Are aggregated into the benchmark dataset's leaderboards
+- Can be submitted via PRs and marked as "community-provided"
+
+![Model Evaluation Results](https://huggingface.co/huggingface/documentation-images/resolve/main/evaluation-results/eval-results-previw.png)
+
+### Adding Evaluation Results
+
+To add evaluation results to a model, you can submit a PR to the model repo with a YAML file in the `.eval_results/` folder.
+
+Create a YAML file in `.eval_results/*.yaml` in your model repo:
+
+```yaml
+- dataset:
+    id: cais/hle                  # Required. Hub dataset ID (must be a Benchmark)
+    task_id: default              # Optional, in case there are multiple tasks or leaderboards for this dataset.
+    revision: <hash>              # Optional. Dataset revision hash
+  value: 20.90                    # Required. Metric value
+  verifyToken: <token>            # Optional. Cryptographic proof of auditable evaluation
+  date: 2025-01-15T10:30:00Z      # Optional. ISO-8601 datetime (defaults to git commit time)
+  source:                         # Optional. Attribution for the result
+    url: https://huggingface.co/datasets/cais/hle  # Required if source provided
+    name: CAIS HLE                # Optional. Display name
+    user: cais                    # Optional. HF username/org
+```
+
+Or, with only the required attributes:
+
+```yaml
+- dataset:
+    id: Idavidrein/gpqa
+    task_id: gpqa_diamond
+  value: 0.412
+```
+
+Results display badges based on their metadata in the YAML file:
+
+| Badge | Condition |
+|-------|-----------|
+| verified | A `verifyToken` is valid (evaluation ran in HF Jobs with inspect-ai) |
+| community-provided | Result submitted via open PR (not merged to main) |
+| leaderboard | Links to the benchmark dataset |
+| source | Links to evaluation logs or external source |
+
+For more details on how to format this data, check out the [Eval Results](https://github.com/huggingface/hub-docs/blob/main/eval_results.yaml) specifications.
+
+### Community Contributions
+
+Anyone can submit evaluation results to any model via Pull Request:
+
+1. Go to the model page and click on the "Community" tab and open a Pull Request.
+3. Add a `.eval_results/*.yaml` file with your results.
+4. The PR will show as "community-provided" on the model page while open.
+
+For help evaluating a model, see the [Evaluating models with Inspect](https://huggingface.co/docs/inference-providers/guides/evaluation-inspect-ai) guide.
diff --git a/docs/hub/model-cards-co2.md b/docs/hub/model-cards-co2.md
@@ -36,7 +36,7 @@ The math is pretty simple! ➕
 
 First, you take the *carbon intensity* of the electric grid used for the training -- this is how much CO<sub>2</sub> is produced by KwH of electricity used. The carbon intensity depends on the location of the hardware and the [energy mix](https://electricitymap.org/) used at that location -- whether it's renewable energy like solar 🌞, wind 🌬️ and hydro 💧, or non-renewable energy like coal ⚫ and natural gas 💨. The more renewable energy gets used for training, the less carbon-intensive it is!
  
-Then, you take the power consumption of the GPU during training using the `pynvml` library.
+Then, you take the power consumption of the GPUs during training using the `pynvml` library.
 
 Finally, you multiply the power consumption and carbon intensity by the training time of the model, and you have an estimate of the CO<sub>2</sub> emission.
 
diff --git a/docs/hub/model-cards.md b/docs/hub/model-cards.md
@@ -30,7 +30,7 @@ The metadata you add to the model card supports discovery and easier use of your
 * Displaying the model's license.
 * Adding datasets to the metadata will add a message reading `Datasets used to train:` to your model page and link the relevant datasets, if they're available on the Hub.
 
-Dataset, metric, and language identifiers are those listed on the [Datasets](https://huggingface.co/datasets), [Metrics](https://huggingface.co/metrics) and [Languages](https://huggingface.co/languages) pages.
+Dataset and language identifiers are those listed on the [Datasets](https://huggingface.co/datasets) and [Languages](https://huggingface.co/languages) pages.
 
 
 ### Adding metadata to your model card
@@ -72,9 +72,6 @@ license: "any valid license identifier"
 datasets:
 - dataset1
 - dataset2
-metrics:
-- metric1
-- metric2
 base_model: "base model Hub identifier"
 ---
 ```
@@ -101,7 +98,7 @@ tags:
 If it's not specified, the Hub will try to automatically detect the library type. However, this approach is discouraged, and repo creators should use the explicit `library_name` as much as possible. 
 
 1. By looking into the presence of files such as `*.nemo` or `*.mlmodel`, the Hub can determine if a model is from NeMo or CoreML.
-2. In the past, if nothing was detected and there was a `config.json` file, it was assumed the library was `transformers`. For model repos created after August 2024, this is not the case anymore – so you need to `library_name: transformers` explicitly.
+2. In the past, if nothing was detected and there was a `config.json` file, it was assumed the library was `transformers`. For model repos created after August 2024, this is not the case anymore, so you need to set `library_name: transformers` explicitly.
 
 ### Specifying a base model
 
@@ -181,8 +178,8 @@ You can specify the datasets used to train your model in the model card metadata
 
 ```yaml
 datasets:
-- imdb
-- HuggingFaceH4/no_robots
+- stanfordnlp/imdb
+- HuggingFaceFW/fineweb
 ```
 
 ### Specifying a task (`pipeline_tag`)
@@ -217,9 +214,12 @@ You can specify your **model's evaluation results** in a structured way in the m
 <img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/eval-results-v2-dark.png"/>
 </div>
 
-The metadata spec was based on Papers with code's [model-index specification](https://github.com/paperswithcode/model-index). This allow us to directly index the results into Papers with code's leaderboards when appropriate. You can also link the source from where the eval results has been computed.
+The initial metadata spec was based on Papers with code's [model-index specification](https://github.com/paperswithcode/model-index). This allowed us to directly index the results into Papers with code's leaderboards when appropriate. You could also link the source from where the eval results has been computed.
 
-Here is a partial example to describe [01-ai/Yi-34B](https://huggingface.co/01-ai/Yi-34B)'s score on the ARC benchmark. The result comes from the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) which is defined as the `source`:
+> [!TIP]
+> NEW: We have a new, simpler metadata format for eval results. Check it out in [the dedicated doc page](./eval-results).
+
+Here is a partial example of a model-index that was describing [01-ai/Yi-34B](https://huggingface.co/01-ai/Yi-34B)'s score on the ARC benchmark. The result came from the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) which is defined as the `source`:
 
 ```yaml
 ---
@@ -263,7 +263,7 @@ Read more about Paper pages [here](./paper-pages).
 
 ## Model Card text
 
-Details on how to fill out a human-readable model card without Hub-specific metadata (so that it may be printed out, cut+pasted, etc.) is available in the [Annotated Model Card](./model-card-annotated).
+Details on how to fill out the human-readable portion of the model card (so that it may be printed out, cut+pasted, etc.) is available in the [Annotated Model Card](./model-card-annotated).
 
 ## FAQ
 
diff --git a/docs/hub/models.md b/docs/hub/models.md
@@ -7,13 +7,14 @@ The Hugging Face Hub hosts many models for a [variety of machine learning tasks]
 - [The Model Hub](./models-the-hub)
 - [Model Cards](./model-cards)
   - [CO<sub>2</sub> emissions](./model-cards-co2)
-  - [Gated models](./models-gated)
-- [Libraries](./models-libraries)
+- [Eval Results](./eval-results)
+- [Gated models](./models-gated)
 - [Uploading Models](./models-uploading)
 - [Downloading Models](./models-downloading)
+- [Libraries](./models-libraries)
 - [Widgets](./models-widgets)
   - [Widget Examples](./models-widgets-examples)
-- [Inference API](./models-inference)
+- [Model Inference](./models-inference)
 - [Frequently Asked Questions](./models-faq)
 - [Advanced Topics](./models-advanced)
   - [Integrating libraries with the Hub](./models-adding-libraries)
diff --git a/eval_results.yaml b/eval_results.yaml
@@ -0,0 +1,27 @@
+- dataset:
+    id: cais/hle                  # Required. A valid dataset id from the Hub, which should have a "Benchmark" tag.
+                                  # ^Basically, this is where the leaderboard lives.
+    task_id: {task_id}            # Optional, in case there are multiple tasks or leaderboards for this dataset.
+                                  # It is defined in the benchmark dataset's eval.yaml file. Example: gpqa_diamond
+                                  # It can usually be a dataset config (aka subset) or split name.
+    revision: {dataset_revision}  # Optional. Example: 5503434ddd753f426f4b38109466949a1217c2bb
+
+  value: {metric_value}           # Required. Example: 20.90
+
+  verifyToken: {verify_token}     # Optional. If present, this is a signature that can be used to prove that evaluation is provably auditable and reproducible. 
+                                  # (For example, was run in a HF Job using inspect-ai or lighteval)
+
+  date: {date}                    # Optional. When was this eval run (ISO-8601 datetime). If not provided, can default to this file creation time in git.
+
+  source:                         # Optional. The source for this result, for instance a dataset repo.
+    url: {source_url}             # Required if source is provided. A link to the source. Example: https://huggingface.co/spaces/SaylorTwift/smollm3-mmlu-pro.
+    name: {source_name}           # Optional. The name of the source. Example: Eval Logs.
+    user: {username}              # Optional. A HF user or org name.
+
+# or, with only the required attributes:
+
+- dataset:
+    id: Idavidrein/gpqa
+    task_id: gpqa_diamond
+  value: 0.412
+
diff --git a/modelcard.md b/modelcard.md
@@ -1,6 +1,5 @@
 ---
 # Example metadata to be added to a model card.  
-# Full model card template at https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md
 language:
 - {lang_0}  # Example: fr
 - {lang_1}  # Example: en
@@ -20,6 +19,7 @@ metrics:
 base_model: {base_model}  # Example: stabilityai/stable-diffusion-xl-base-1.0. Can also be a list (for merges)
 
 # Optional. Add this if you want to encode your eval results in a structured way.
+# There is a newer, simpler version of this metadata format in ./eval_results.yaml
 model-index:
 - name: {model_id}
   results:
@@ -48,7 +48,7 @@ model-index:
       url: {source_url}             # Required if source is provided. A link to the source. Example: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard.
 ---
 
-This markdown file contains the spec for the modelcard metadata regarding evaluation parameters. When present, and only then, 'model-index', 'datasets' and 'license' contents will be verified when git pushing changes to your README.md file.
+This markdown file contains the spec for the modelcard metadata. Properties will be validated by the Hub when git pushing changes to your README.md file.
 Valid license identifiers can be found in [our docs](https://huggingface.co/docs/hub/repositories-licenses).
 
-For the full model card template, see: [modelcard_template.md file](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md).
+For a template for the human-readable portion of the model card, see: [modelcard_template.md file](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md).