Skip to content

swiss-ai/MLLM-eval-suite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLLM Evaluation Suite

Overview

This repository is a unified orchestration layer for evaluating multimodal models with both VLMEvalKit and lmms-eval.

It does not merge, fork, or reimplement either evaluation framework. Instead, both frameworks are kept as pinned Git submodules under third_party/, while this repository owns the shared launcher interface, configuration layout, container metadata, Slurm templates, logs, results, and post-processing utilities.

Goals

  • Provide a common launcher interface for multiple evaluation frameworks.
  • Make evaluation runs reproducible through explicit configs, task lists, and metadata.
  • Centralize TOML configs for framework-specific and shared settings.
  • Manage Dockerfiles used to build or document evaluation environments.
  • Provide Slurm job templates for batch execution.
  • Abstract model serving backends for vLLM, SGLang, and Hugging Face.
  • Keep logs and results in structured, predictable locations.
  • Normalize and compare outputs across evaluation tools where possible.

Repository Structure

  • third_party/: Git submodules for upstream evaluation frameworks.
  • dockerfiles/: Dockerfile definitions and build documentation.
  • toml/: Centralized TOML configuration files, split by framework plus shared settings.
  • launchers/: Shell entrypoints for local or scripted evaluation runs.
  • launchers/backends/: Backend adapters for vLLM, SGLang, and Hugging Face.
  • slurm/: Slurm templates and shared Slurm environment snippets.
  • task_suites/: Suite files for lmms-eval and VLMEvalKit. Pass these paths directly to --tasks.
  • task_lists/: Legacy plain-text task lists for ad hoc or common subsets.
  • cache/: Local cache root. Image-token and framework data caches are split under cache/lmms-eval/ and cache/VLMEvalKit/; shared runtime caches use common folders such as cache/hf, cache/nltk_data, cache/xdg, cache/vllm, and cache/models. Generated contents are ignored.
  • results/: Evaluation outputs, separated by framework. Generated contents are ignored.
  • logs/: Runtime logs, separated by framework. Generated contents are ignored.
  • scripts/: Utility scripts for result normalization, comparison, and log collection.

Submodules

Initialize all submodules after cloning:

git submodule update --init --recursive

Update submodules to their configured branch tips:

git submodule update --remote --merge

The submodules are configured as branch-tracking submodules:

Example Usage

Run lmms-eval through the unified launcher:

bash launchers/run_eval.sh --tool lmms-eval --backend vllm --config toml/lmms-eval/apertus-vllm-lmms-eval-prod.toml --tasks task_suites/lmms-eval/visual_smoke.txt --output results/lmms-eval/example_run

Run VLMEvalKit through the unified launcher:

bash launchers/run_eval.sh --tool VLMEvalKit --backend vllm --config toml/VLMEvalKit/example.toml --tasks task_suites/VLMEvalKit/smoke.txt --output results/VLMEvalKit/example_run

Submit lmms-eval production jobs through the combined production launcher:

bash launchers/eval.sh --eval-framework lmms-eval --model /path/to/model --suite smoke
bash launchers/eval.sh --eval-framework lmms-eval --model /path/to/model --tasks task_suites/lmms-eval/visual_full.txt

Run lmms-eval audio benchmarks through the same launcher by selecting an audio suite or a specific audio task:

bash launchers/eval.sh --eval-framework lmms-eval --model /path/to/model --suite audio-smoke
bash launchers/eval.sh --eval-framework lmms-eval --model /path/to/model --suite audio-full
bash launchers/eval.sh --eval-framework lmms-eval --model /path/to/model --tasks google_fleurs

For local audio task changes in third_party/lmms-eval, run interactively with LMMS_EVAL_DEV_PATH and pass vLLM/audio-tokenizer args after --:

export MODEL="/capstor/store/cscs/swissai/infra01/hf-checkpoints/Apertus-1p5-8B-sft-capfilter-lr6e-5-constant-innovator-fix-it23409"
export TOK="/capstor/store/cscs/swissai/infra01/MLLM/tokenizer/apertus_emu3.5_wavtok_instruct_thinking_token_fixed"
export VLLM_APERTUS_AUDIO_TOKENIZER_CODEBASE="/workspace/benchmark-audio-tokenizer"

LMMS_EVAL_DEV_PATH="$PWD/third_party/lmms-eval" \
ENABLE_WANDB=false \
bash launchers/eval.sh \
  --eval-framework lmms-eval \
  --model "$MODEL" \
  --tasks fleurs_en_us \
  --submit-mode interactive \
  -- \
  --tokenizer-path "$TOK" \
  --gpu-memory-utilization 0.75 \
  --trust-remote-code True \
  --extra-model-args 'allowed_local_media_path=/,limit_mm_per_prompt={"audio":1,"image":1},mm_processor_kwargs={"apertus_audio_tokenizer_path":"/capstor/store/cscs/swissai/infra01/MLLM/wavtokenizer"}'

The lmms-eval launcher automatically uses $TOK/chat_template.jinja when TOK is the default Apertus tokenizer path. For any other tokenizer, pass the template explicitly after --:

bash launchers/eval.sh --eval-framework lmms-eval --model "$MODEL" --tasks google_fleurs -- \
  --tokenizer-path "$TOK" \
  --chat-template "$TOK/chat_template.jinja"

Submit VLMEvalKit production jobs through the combined production launcher:

bash launchers/eval.sh --eval-framework VLMEvalKit --suite smoke --model Apertus-1p5-8B
bash launchers/eval.sh --eval-framework VLMEvalKit --tasks task_suites/VLMEvalKit/full.txt --model Apertus-1p5-8B

Each production launcher call creates one shared run directory under both the framework results and logs folders. All per-task Slurm jobs submitted by that call write into that same result/log run directory. Override RUN_ID to choose the directory name explicitly.

Image-token cache defaults are framework-specific and persistent under cache/lmms-eval/ or cache/VLMEvalKit/. Jobs use the shared cache directly with local copy disabled, preload enabled, read access enabled, and write-misses enabled. lmms-eval defaults to --mode fill.

The default batch size is 512 for both production launchers unless overridden with --batch-size after -- or via framework-specific environment variables.

The combined launcher prefetches BAAI/Emu3.5-VisionTokenizer into cache/models/BAAI/Emu3.5-VisionTokenizer before it submits jobs, so the tokenizer files are present before evaluation starts.

Suites

Suite files live under task_suites/ and can be passed directly to the launchers with --tasks.

  • task_suites/lmms-eval/visual_smoke.txt: gqa,mmstar,pope
  • task_suites/lmms-eval/visual_full.txt: full lmms-eval visual evaluation suite.
  • task_suites/lmms-eval/audio_smoke.txt: fleurs
  • task_suites/lmms-eval/audio_full.txt: full lmms-eval audio evaluation suite.
  • task_suites/lmms-eval/audio_llm_eval.txt: audio tasks intended for LLM-eval style runs.
  • task_suites/VLMEvalKit/: suite files copied from the VLMEvalKit Apertus vLLM scripts.

Recommended Run Metadata

Every run should create a run_meta.json in its output directory with:

  • tool
  • tool_commit
  • launcher_commit
  • model
  • backend
  • config
  • tasks
  • container_image
  • slurm_job_id
  • date
  • output_dir

Development Notes

Changes to evaluation framework code should happen inside the corresponding submodule branch. This repository should own configs, launchers, containers, logs, results, Slurm templates, and utility scripts.

Framework-specific production launchers under launchers/lmms-eval/ and launchers/VLMEvalKit/ expect ORCH_REPO_ROOT to be set by launchers/eval.sh.

The production runtime expects lmms-eval and VLMEvalKit to be available under /workspace inside the job container. If you need a custom implementation, make the changes inside the matching third_party/ checkout, install or update that version from third_party/, and use --submit-mode interactive so the job runs with the current shell and node allocation.

For local third_party/lmms-eval changes to be used by the lmms-eval Slurm wrapper, set LMMS_EVAL_DEV_PATH to this checkout. Otherwise the job prepends /workspace/lmms-eval to PYTHONPATH and your local task/model changes may not be visible:

LMMS_EVAL_DEV_PATH="$PWD/third_party/lmms-eval" \
bash launchers/eval.sh --eval-framework lmms-eval --model /path/to/model --tasks google_fleurs --submit-mode interactive

You can sanity-check task registration before launching a full run, but this requires python to be available in the active environment:

PYTHONPATH="$PWD/third_party/lmms-eval:/workspace/lmms-eval:${PYTHONPATH:-}" \
python -m lmms_eval --tasks list | grep google_fleurs

When you want to install the custom checkouts from this repository directly, use:

cd third_party/lmms-eval
uv pip install --python /opt/venv/bin/python --no-build-isolation --editable . ".[all]"
cd third_party/VLMEvalKit
uv pip install --python /opt/venv/bin/python --no-deps --editable .

The combined launcher accepts common top-level arguments such as --model, --tasks, --suite, --mode, --submit-mode, and --run-id. Framework-specific options can be passed after --.

Use --submit-mode interactive with either framework when you want the launcher to run the job script directly with bash on the current node allocation instead of submitting a new Slurm job.

About

Unified orchestration repo for running VLMEvalKit and lmms-eval benchmarks with shared configs, launchers, SQSH images, logs, and result management.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages