Benchmark evaluation of pretrained and fine-tuned models using OpenLLM-France's lighteval fork.
| Script | Description |
|---|---|
evaluate_experiment.py |
Evaluate all checkpoints of an experiment (submits SLURM array job) |
auto_eval.py |
Full pipeline: convert → evaluate → plot (with SLURM job chaining) |
plot_results.py |
Plot and compare evaluation results across models |
agg_score.py |
Aggregate evaluation scores |
run_all_tasks.sh |
Run all benchmark suites for an experiment |
Set environment variables:
export SLURM_ACCOUNT_GPU="your_account@h100"
export SLURM_ACCOUNT_CPU="your_account@cpu"Create a dedicated environment and install our lighteval fork:
module purge
module load anaconda-py3/2024.06
module load git
conda create -n eval-env python=3.10
conda activate eval-env
git clone git@github.com:OpenLLM-France/lighteval.git
cd lighteval/
pip install -e .[multilingual,vllm]
pip install language_data langdetect syllapy seaborn python-slugify
pip install unbabel-comet>=2.2.0
pip install jieba fuzzywuzzy rouge # For longbench eval
module load cuda/12.8.0 # or CUDA version used to compile torch (see `python -c "import torch; print(torch.version.cuda)"`)
pip install --no-cache-dir --no-build-isolation mamba-ssm[causal-conv1d]Note: The
mamba-ssmcompilation can fail on front-end nodes due to memory. Use a compute node instead:srun -p compil_h100 -c 24 --hint=nomultithread --pty -A $SLURM_ACCOUNT_CPU bash
If you have issues with installation, you can try module load gcc/14.2.0
Evaluations run on H100 partition (no internet). Pre-cache everything first:
bash preload_eval_assets.sh # LM judges, nltk assets
bash preload_eval_datasets.sh # Benchmark datasets
bash preload_hf_models.sh # Baseline models for comparisonTask files in tasks/ define which benchmarks to run:
| File | Content |
|---|---|
tasks/en.txt |
English benchmarks |
tasks/fr.txt |
French benchmarks |
tasks/multilingual.txt |
Multilingual benchmarks |
tasks/mmlu.txt |
MMLU |
tasks/gsm8k.txt |
GSM8K math |
tasks/ruler_*.txt |
RULER long-context benchmarks |
You can create custom task files — one lighteval task per line.
python evaluate_experiment.py $experiment_path tasks/en.txt
# With multilingual tasks
python evaluate_experiment.py $experiment_path tasks/fr.txt --custom_tasks multilingual
# Limit to specific checkpoint intervals
python evaluate_experiment.py $experiment_path tasks/en.txt --multiple_of 5000
# Limit number of samples per task
python evaluate_experiment.py $experiment_path tasks/fr.txt --custom_tasks multilingual --max_samples 1000The experiment path must contain a huggingface_checkpoints/ folder.
bash run_all_tasks.sh $experiment_path --multiple_of 5000 --gpus 2Download baseline models first:
bash preload_hf_models.shThen evaluate:
python evaluate_experiment.py $experiment_path tasks/en.txt --hf_model OpenLLM-BPI/Luciole-7BResults are saved in $experiment_path/evaluation/.
auto_eval.py chains conversion → evaluation → plotting in a single command:
python auto_eval.py $experiment_path --arch nemotronh --eval_type pretrain --email user@example.comEvaluation types: pretrain, finetune, ruler, context_extension.
Compare models on evaluation benchmarks:
# Basic comparison
python plot_results.py $model1 $model2 $model3 --group all --output_path ./figs
# With FLOPS on x-axis
python plot_results.py $model1 $model2 --group all --output_path ./figs --flops
# Limit samples
python plot_results.py $model1 $model2 --group all --output_path ./figs --flops --max_samples 1000Plot groups: all, en, fr, multilingual, ruler, finetune.
models="\
$OpenLLM_OUTPUT/pretrain/luciole_serie/luciole_nemotron1b \
$OpenLLM_OUTPUT/pretrain/luciole_serie/luciole_nemotron1b_phase2 \
$OpenLLM_OUTPUT/pretrain/compared_models/OLMo-2-0425-1B \
$OpenLLM_OUTPUT/pretrain/compared_models/EuroLLM-1.7B \
"
python plot_results.py $models --group all \
--output_path $OpenLLM_OUTPUT/pretrain/luciole_serie/luciole_nemotron1b_phase2/figs --flopsSee ruler/README.md for RULER benchmark setup and usage.
Follow BFCL README
module purge
module load arch/h100
module load anaconda-py3/2024.06
module load gcc/14.2.0
# Create a new Conda environment with Python 3.10
conda create -n BFCL python=3.10
conda activate BFCL
# Clone the Gorilla repository
git clone https://github.com/ShishirPatil/gorilla.git
# Change directory to the `berkeley-function-call-leaderboard`
cd gorilla/berkeley-function-call-leaderboard
# # Install the package in editable mode
# pip install -e .
# Vllm for local model
pip install -e .[oss_eval_vllm]
pip install -U "vllm>=0.10" transformersmodule purge
module load arch/a100 anaconda-py3/2024.06 gcc/14.2.0 # on a100
module load arch/h100 anaconda-py3/2024.06 gcc/14.2.0 # on h100
conda activate BFCL
# JZ
export MODEL_PATH=...
NAME=${MODEL_PATH##*/}
CATS=non_live,live
RESULT_DIR="$SCRATCH/result/${NAME}"
SCORE_DIR="$SCRATCH/score/${NAME}"
mkdir -p "$RESULT_DIR" "$SCORE_DIR"
# 1. Generate model responses (vLLM inference — same flags you've been using)
bfcl generate \
--model linagora/nemotron3-1b-FC \
--local-model-path "$MODEL_PATH" \
--backend vllm \
--num-gpus 1 \
--gpu-memory-utilization 0.9 \
--test-category "$CATS" \
--result-dir $RESULT_DIR
# 2. Score the responses
bfcl evaluate \
--model linagora/nemotron3-1b-FC \
--test-category "$CATS" \
--result-dir $RESULT_DIR \
--score-dir $SCORE_DIR