Name	Name	Last commit message	Last commit date
parent directory ..
notebooks	notebooks
project	project
.DS_Store	.DS_Store
README.md	README.md
automatic_benchmarks.md	automatic_benchmarks.md
custom_evaluation.md	custom_evaluation.md

Name

Last commit message

Last commit date

Evaluation

This module covers evaluation approaches for your smol model, including both standard benchmarks and domain-specific evaluation methods.

In this module we will use the library lighteval. It's made at Hugging Face and it's integrated with the Hugging Face ecosystem. If you want to go deeper into the topic of evaluation with the authors of lighteval, you can check the evaluation guidebook.

Module Overview

Evaluating language models focuses on assessing core capabilities:

Task Performance: How well the model performs on specific tasks like question answering, summarization, etc.
Output Quality: Measuring factors like coherence, relevance, and factual accuracy
Safety & Bias: Checking for harmful outputs, biases, and toxic content
Domain Expertise: Testing specialized knowledge and capabilities in specific fields

Automatic Benchmarks

Learn how to evaluate your model using standardized benchmarks and metrics:

Common benchmarks (MMLU, TruthfulQA, etc.)
Evaluation metrics and settings
Best practices for reproducible evaluation

Custom Domain Evaluation

Create custom evaluation pipelines for your specific use case:

Designing evaluation tasks
Implementing custom metrics
Creating evaluation datasets

Domain Evaluation Project

A complete example of building a domain-specific evaluation pipeline:

Generate evaluation datasets
Annotate data with Argilla
Create standardized datasets
Evaluate models with LightEval

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Evaluation

Module Overview

Contents

Automatic Benchmarks

Custom Domain Evaluation

Domain Evaluation Project

Resources

FilesExpand file tree

4_evaluation

Directory actions

More options

Directory actions

More options

Latest commit

History

4_evaluation

Folders and files

parent directory

README.md

Evaluation

Module Overview

Contents

Automatic Benchmarks

Custom Domain Evaluation

Domain Evaluation Project

Resources