MLLM Evaluation Suite

Overview

This repository is a unified orchestration layer for evaluating multimodal models with both VLMEvalKit and lmms-eval.

It does not merge, fork, or reimplement either evaluation framework. Instead, both frameworks are kept as pinned Git submodules under third_party/, while this repository owns the shared launcher interface, configuration layout, container metadata, Slurm templates, logs, results, and post-processing utilities.

Goals

Provide a common launcher interface for multiple evaluation frameworks.
Make evaluation runs reproducible through explicit configs, task lists, and metadata.
Centralize TOML configs for framework-specific and shared settings.
Manage Dockerfiles used to build or document evaluation environments.
Provide Slurm job templates for batch execution.
Abstract model serving backends for vLLM, SGLang, and Hugging Face.
Keep logs and results in structured, predictable locations.
Normalize and compare outputs across evaluation tools where possible.

Repository Structure

third_party/: Git submodules for upstream evaluation frameworks.
dockerfiles/: Dockerfile definitions and build documentation.
toml/: Centralized TOML configuration files, split by framework plus shared settings.
launchers/: Shell entrypoints for local or scripted evaluation runs.
launchers/backends/: Backend adapters for vLLM, SGLang, and Hugging Face.
slurm/: Slurm templates and shared Slurm environment snippets.
task_suites/: Suite files for lmms-eval and VLMEvalKit. Pass these paths directly to --tasks.
task_lists/: Legacy plain-text task lists for ad hoc or common subsets.
cache/: Local cache root. Image-token and framework data caches are split under cache/lmms-eval/ and cache/VLMEvalKit/; shared runtime caches use common folders such as cache/hf, cache/nltk_data, cache/xdg, cache/vllm, and cache/models. Generated contents are ignored.
results/: Evaluation outputs, separated by framework. Generated contents are ignored.
logs/: Runtime logs, separated by framework. Generated contents are ignored.
scripts/: Utility scripts for result normalization, comparison, and log collection.

Submodules

Initialize all submodules after cloning:

git submodule update --init --recursive

Update submodules to their configured branch tips:

git submodule update --remote --merge

The submodules are configured as branch-tracking submodules:

third_party/lmms-eval: github.com/swiss-ai/lmms-eval, branch apertus-1p5-eval
third_party/VLMEvalKit: github.com/swiss-ai/VLMEvalKit, branch apertus-1p5-eval

Example Usage

Run lmms-eval through the unified launcher:

bash launchers/run_eval.sh --tool lmms-eval --backend vllm --config toml/lmms-eval/apertus-vllm-lmms-eval-prod.toml --tasks task_suites/lmms-eval/visual_smoke.txt --output results/lmms-eval/example_run

Run VLMEvalKit through the unified launcher:

bash launchers/run_eval.sh --tool VLMEvalKit --backend vllm --config toml/VLMEvalKit/example.toml --tasks task_suites/VLMEvalKit/smoke.txt --output results/VLMEvalKit/example_run

Submit lmms-eval production jobs through the combined production launcher:

bash launchers/eval.sh --eval-framework lmms-eval --model /path/to/model --suite smoke
bash launchers/eval.sh --eval-framework lmms-eval --model /path/to/model --tasks task_suites/lmms-eval/visual_full.txt

Run lmms-eval audio benchmarks through the same launcher by selecting an audio suite or a specific audio task:

bash launchers/eval.sh --eval-framework lmms-eval --model /path/to/model --suite audio-smoke
bash launchers/eval.sh --eval-framework lmms-eval --model /path/to/model --suite audio-full
bash launchers/eval.sh --eval-framework lmms-eval --model /path/to/model --tasks google_fleurs

For local audio task changes in third_party/lmms-eval, run interactively with LMMS_EVAL_DEV_PATH and pass vLLM/audio-tokenizer args after --:

export MODEL="/capstor/store/cscs/swissai/infra01/hf-checkpoints/Apertus-1p5-8B-sft-capfilter-lr6e-5-constant-innovator-fix-it23409"
export TOK="/capstor/store/cscs/swissai/infra01/MLLM/tokenizer/apertus_emu3.5_wavtok_instruct_thinking_token_fixed"
export VLLM_APERTUS_AUDIO_TOKENIZER_CODEBASE="/workspace/benchmark-audio-tokenizer"

LMMS_EVAL_DEV_PATH="$PWD/third_party/lmms-eval" \
ENABLE_WANDB=false \
bash launchers/eval.sh \
  --eval-framework lmms-eval \
  --model "$MODEL" \
  --tasks fleurs_en_us \
  --submit-mode interactive \
  -- \
  --tokenizer-path "$TOK" \
  --gpu-memory-utilization 0.75 \
  --trust-remote-code True \
  --extra-model-args 'allowed_local_media_path=/,limit_mm_per_prompt={"audio":1,"image":1},mm_processor_kwargs={"apertus_audio_tokenizer_path":"/capstor/store/cscs/swissai/infra01/MLLM/wavtokenizer"}'

The lmms-eval launcher automatically uses $TOK/chat_template.jinja when TOK is the default Apertus tokenizer path. For any other tokenizer, pass the template explicitly after --:

bash launchers/eval.sh --eval-framework lmms-eval --model "$MODEL" --tasks google_fleurs -- \
  --tokenizer-path "$TOK" \
  --chat-template "$TOK/chat_template.jinja"

Submit VLMEvalKit production jobs through the combined production launcher:

bash launchers/eval.sh --eval-framework VLMEvalKit --suite smoke --model Apertus-1p5-8B
bash launchers/eval.sh --eval-framework VLMEvalKit --tasks task_suites/VLMEvalKit/full.txt --model Apertus-1p5-8B

Each production launcher call creates one shared run directory under both the framework results and logs folders. All per-task Slurm jobs submitted by that call write into that same result/log run directory. Override RUN_ID to choose the directory name explicitly.

Image-token cache defaults are framework-specific and persistent under cache/lmms-eval/ or cache/VLMEvalKit/. Jobs use the shared cache directly with local copy disabled, preload enabled, read access enabled, and write-misses enabled. lmms-eval defaults to --mode fill.

The default batch size is 512 for both production launchers unless overridden with --batch-size after -- or via framework-specific environment variables.

The combined launcher prefetches BAAI/Emu3.5-VisionTokenizer into cache/models/BAAI/Emu3.5-VisionTokenizer before it submits jobs, so the tokenizer files are present before evaluation starts.

Suites

Suite files live under task_suites/ and can be passed directly to the launchers with --tasks.

task_suites/lmms-eval/visual_smoke.txt: gqa,mmstar,pope
task_suites/lmms-eval/visual_full.txt: full lmms-eval visual evaluation suite.
task_suites/lmms-eval/audio_smoke.txt: fleurs
task_suites/lmms-eval/audio_full.txt: full lmms-eval audio evaluation suite.
task_suites/lmms-eval/audio_llm_eval.txt: audio tasks intended for LLM-eval style runs.
task_suites/VLMEvalKit/: suite files copied from the VLMEvalKit Apertus vLLM scripts.

Recommended Run Metadata

Every run should create a run_meta.json in its output directory with:

tool
tool_commit
launcher_commit
model
backend
config
tasks
container_image
slurm_job_id
date
output_dir

Development Notes

Changes to evaluation framework code should happen inside the corresponding submodule branch. This repository should own configs, launchers, containers, logs, results, Slurm templates, and utility scripts.

Framework-specific production launchers under launchers/lmms-eval/ and launchers/VLMEvalKit/ expect ORCH_REPO_ROOT to be set by launchers/eval.sh.

The production runtime expects lmms-eval and VLMEvalKit to be available under /workspace inside the job container. If you need a custom implementation, make the changes inside the matching third_party/ checkout, install or update that version from third_party/, and use --submit-mode interactive so the job runs with the current shell and node allocation.

For local third_party/lmms-eval changes to be used by the lmms-eval Slurm wrapper, set LMMS_EVAL_DEV_PATH to this checkout. Otherwise the job prepends /workspace/lmms-eval to PYTHONPATH and your local task/model changes may not be visible:

LMMS_EVAL_DEV_PATH="$PWD/third_party/lmms-eval" \
bash launchers/eval.sh --eval-framework lmms-eval --model /path/to/model --tasks google_fleurs --submit-mode interactive

You can sanity-check task registration before launching a full run, but this requires python to be available in the active environment:

PYTHONPATH="$PWD/third_party/lmms-eval:/workspace/lmms-eval:${PYTHONPATH:-}" \
python -m lmms_eval --tasks list | grep google_fleurs

When you want to install the custom checkouts from this repository directly, use:

cd third_party/lmms-eval
uv pip install --python /opt/venv/bin/python --no-build-isolation --editable . ".[all]"

cd third_party/VLMEvalKit
uv pip install --python /opt/venv/bin/python --no-deps --editable .

The combined launcher accepts common top-level arguments such as --model, --tasks, --suite, --mode, --submit-mode, and --run-id. Framework-specific options can be passed after --.

Use --submit-mode interactive with either framework when you want the launcher to run the job script directly with bash on the current node allocation instead of submitting a new Slurm job.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MLLM Evaluation Suite

Overview

Goals

Repository Structure

Submodules

Example Usage

Suites

Recommended Run Metadata

Development Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
cache		cache
dockerfiles		dockerfiles
launchers		launchers
logs		logs
results		results
scripts		scripts
slurm		slurm
task_suites		task_suites
third_party		third_party
toml		toml
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MLLM Evaluation Suite

Overview

Goals

Repository Structure

Submodules

Example Usage

Suites

Recommended Run Metadata

Development Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages