This repository is a unified orchestration layer for evaluating multimodal models with both VLMEvalKit and lmms-eval.
It does not merge, fork, or reimplement either evaluation framework. Instead, both frameworks are kept as pinned Git submodules under third_party/, while this repository owns the shared launcher interface, configuration layout, container metadata, Slurm templates, logs, results, and post-processing utilities.
- Provide a common launcher interface for multiple evaluation frameworks.
- Make evaluation runs reproducible through explicit configs, task lists, and metadata.
- Centralize TOML configs for framework-specific and shared settings.
- Manage Dockerfiles used to build or document evaluation environments.
- Provide Slurm job templates for batch execution.
- Abstract model serving backends for vLLM, SGLang, and Hugging Face.
- Keep logs and results in structured, predictable locations.
- Normalize and compare outputs across evaluation tools where possible.
third_party/: Git submodules for upstream evaluation frameworks.dockerfiles/: Dockerfile definitions and build documentation.toml/: Centralized TOML configuration files, split by framework plus shared settings.launchers/: Shell entrypoints for local or scripted evaluation runs.launchers/backends/: Backend adapters for vLLM, SGLang, and Hugging Face.slurm/: Slurm templates and shared Slurm environment snippets.task_suites/: Suite files for lmms-eval and VLMEvalKit. Pass these paths directly to--tasks.task_lists/: Legacy plain-text task lists for ad hoc or common subsets.cache/: Local cache root. Image-token and framework data caches are split undercache/lmms-eval/andcache/VLMEvalKit/; shared runtime caches use common folders such ascache/hf,cache/nltk_data,cache/xdg,cache/vllm, andcache/models. Generated contents are ignored.results/: Evaluation outputs, separated by framework. Generated contents are ignored.logs/: Runtime logs, separated by framework. Generated contents are ignored.scripts/: Utility scripts for result normalization, comparison, and log collection.
Initialize all submodules after cloning:
git submodule update --init --recursiveUpdate submodules to their configured branch tips:
git submodule update --remote --mergeThe submodules are configured as branch-tracking submodules:
third_party/lmms-eval: github.com/swiss-ai/lmms-eval, branchapertus-1p5-evalthird_party/VLMEvalKit: github.com/swiss-ai/VLMEvalKit, branchapertus-1p5-eval
Run lmms-eval through the unified launcher:
bash launchers/run_eval.sh --tool lmms-eval --backend vllm --config toml/lmms-eval/apertus-vllm-lmms-eval-prod.toml --tasks task_suites/lmms-eval/visual_smoke.txt --output results/lmms-eval/example_runRun VLMEvalKit through the unified launcher:
bash launchers/run_eval.sh --tool VLMEvalKit --backend vllm --config toml/VLMEvalKit/example.toml --tasks task_suites/VLMEvalKit/smoke.txt --output results/VLMEvalKit/example_runSubmit lmms-eval production jobs through the combined production launcher:
bash launchers/eval.sh --eval-framework lmms-eval --model /path/to/model --suite smoke
bash launchers/eval.sh --eval-framework lmms-eval --model /path/to/model --tasks task_suites/lmms-eval/visual_full.txtRun lmms-eval audio benchmarks through the same launcher by selecting an audio suite or a specific audio task:
bash launchers/eval.sh --eval-framework lmms-eval --model /path/to/model --suite audio-smoke
bash launchers/eval.sh --eval-framework lmms-eval --model /path/to/model --suite audio-full
bash launchers/eval.sh --eval-framework lmms-eval --model /path/to/model --tasks google_fleursFor local audio task changes in third_party/lmms-eval, run interactively with LMMS_EVAL_DEV_PATH and pass vLLM/audio-tokenizer args after --:
export MODEL="/capstor/store/cscs/swissai/infra01/hf-checkpoints/Apertus-1p5-8B-sft-capfilter-lr6e-5-constant-innovator-fix-it23409"
export TOK="/capstor/store/cscs/swissai/infra01/MLLM/tokenizer/apertus_emu3.5_wavtok_instruct_thinking_token_fixed"
export VLLM_APERTUS_AUDIO_TOKENIZER_CODEBASE="/workspace/benchmark-audio-tokenizer"
LMMS_EVAL_DEV_PATH="$PWD/third_party/lmms-eval" \
ENABLE_WANDB=false \
bash launchers/eval.sh \
--eval-framework lmms-eval \
--model "$MODEL" \
--tasks fleurs_en_us \
--submit-mode interactive \
-- \
--tokenizer-path "$TOK" \
--gpu-memory-utilization 0.75 \
--trust-remote-code True \
--extra-model-args 'allowed_local_media_path=/,limit_mm_per_prompt={"audio":1,"image":1},mm_processor_kwargs={"apertus_audio_tokenizer_path":"/capstor/store/cscs/swissai/infra01/MLLM/wavtokenizer"}'The lmms-eval launcher automatically uses $TOK/chat_template.jinja when TOK is the default Apertus tokenizer path. For any other tokenizer, pass the template explicitly after --:
bash launchers/eval.sh --eval-framework lmms-eval --model "$MODEL" --tasks google_fleurs -- \
--tokenizer-path "$TOK" \
--chat-template "$TOK/chat_template.jinja"Submit VLMEvalKit production jobs through the combined production launcher:
bash launchers/eval.sh --eval-framework VLMEvalKit --suite smoke --model Apertus-1p5-8B
bash launchers/eval.sh --eval-framework VLMEvalKit --tasks task_suites/VLMEvalKit/full.txt --model Apertus-1p5-8BEach production launcher call creates one shared run directory under both the framework results and logs folders. All per-task Slurm jobs submitted by that call write into that same result/log run directory. Override RUN_ID to choose the directory name explicitly.
Image-token cache defaults are framework-specific and persistent under cache/lmms-eval/ or cache/VLMEvalKit/. Jobs use the shared cache directly with local copy disabled, preload enabled, read access enabled, and write-misses enabled. lmms-eval defaults to --mode fill.
The default batch size is 512 for both production launchers unless overridden with --batch-size after -- or via framework-specific environment variables.
The combined launcher prefetches BAAI/Emu3.5-VisionTokenizer into cache/models/BAAI/Emu3.5-VisionTokenizer before it submits jobs, so the tokenizer files are present before evaluation starts.
Suite files live under task_suites/ and can be passed directly to the launchers with --tasks.
task_suites/lmms-eval/visual_smoke.txt:gqa,mmstar,popetask_suites/lmms-eval/visual_full.txt: full lmms-eval visual evaluation suite.task_suites/lmms-eval/audio_smoke.txt:fleurstask_suites/lmms-eval/audio_full.txt: full lmms-eval audio evaluation suite.task_suites/lmms-eval/audio_llm_eval.txt: audio tasks intended for LLM-eval style runs.task_suites/VLMEvalKit/: suite files copied from the VLMEvalKit Apertus vLLM scripts.
Every run should create a run_meta.json in its output directory with:
tooltool_commitlauncher_commitmodelbackendconfigtaskscontainer_imageslurm_job_iddateoutput_dir
Changes to evaluation framework code should happen inside the corresponding submodule branch. This repository should own configs, launchers, containers, logs, results, Slurm templates, and utility scripts.
Framework-specific production launchers under launchers/lmms-eval/ and launchers/VLMEvalKit/ expect ORCH_REPO_ROOT to be set by launchers/eval.sh.
The production runtime expects lmms-eval and VLMEvalKit to be available under /workspace inside the job container. If you need a custom implementation, make the changes inside the matching third_party/ checkout, install or update that version from third_party/, and use --submit-mode interactive so the job runs with the current shell and node allocation.
For local third_party/lmms-eval changes to be used by the lmms-eval Slurm wrapper, set LMMS_EVAL_DEV_PATH to this checkout. Otherwise the job prepends /workspace/lmms-eval to PYTHONPATH and your local task/model changes may not be visible:
LMMS_EVAL_DEV_PATH="$PWD/third_party/lmms-eval" \
bash launchers/eval.sh --eval-framework lmms-eval --model /path/to/model --tasks google_fleurs --submit-mode interactiveYou can sanity-check task registration before launching a full run, but this requires python to be available in the active environment:
PYTHONPATH="$PWD/third_party/lmms-eval:/workspace/lmms-eval:${PYTHONPATH:-}" \
python -m lmms_eval --tasks list | grep google_fleursWhen you want to install the custom checkouts from this repository directly, use:
cd third_party/lmms-eval
uv pip install --python /opt/venv/bin/python --no-build-isolation --editable . ".[all]"cd third_party/VLMEvalKit
uv pip install --python /opt/venv/bin/python --no-deps --editable .The combined launcher accepts common top-level arguments such as --model, --tasks, --suite, --mode, --submit-mode, and --run-id. Framework-specific options can be passed after --.
Use --submit-mode interactive with either framework when you want the launcher to run the job script directly with bash on the current node allocation instead of submitting a new Slurm job.