NVIDIA-NeMo · Jorjeous · Mar 7, 2026 · Mar 7, 2026 · Mar 9, 2026 · melllinia
diff --git a/docs/evaluation/eval-kit.md b/docs/evaluation/eval-kit.md
@@ -0,0 +1,282 @@
+# VLMEvalKit Integration (eval_kit)
+
+This page explains how to run VLMEvalKit benchmarks through NeMo Skills using the `eval_kit` generation module. This enables evaluating Megatron multimodal models on VLMEvalKit's benchmark collection (MMBench, LibriSpeech, TedLium, etc.) without leaving the NeMo Skills pipeline.
+
+## Overview
+
+Two inference modes are available:
+
+| Mode | How it works | When to use |
+|------|-------------|-------------|
+| **mcore** | Megatron model loaded in-process via `torchrun` (no HTTP server) | Megatron checkpoints |
+| **vllm** | NeMo Skills starts a vLLM server, VLMEvalKit connects as client | HF models served by vLLM |
+
+Both modes use the same pipeline command — the only difference is the `++model_type` flag.
+
+## Prerequisites
+
+Before running eval_kit benchmarks, you need four things set up:
+
+### 1. VLMEvalKit source code (local)
+
+The `vlmeval/` directory from VLMEvalKit gets packaged and shipped to the cluster automatically. You need a local clone:
+
+```bash
+# Clone VLMEvalKit (NVIDIA internal fork with MultiModalMCore support)
+git clone VLMEvalKitMcore /path/to/VLMEvalKitMcore
+```
+
+Then set the environment variable **before running any `ns eval` command**:
+
+```bash
+export NEMO_SKILLS_VLMEVALKIT_PATH=/path/to/VLMEvalKitMcore
+```
+
+!!! important
+    This path is read **locally at submission time**. The pipeline packages the `vlmeval/` subdirectory and rsyncs it to the cluster. It does NOT need to exist on the cluster.
+
+### 2. eval_kit container on the cluster
+
+The eval_kit container must have PyTorch, Megatron, and VLMEvalKit dependencies pre-installed. Add it to your cluster config:
+This container can be found in container storage
+
+```yaml
+# cluster_configs/my_cluster.yaml
+containers:
+  eval_kit: /path/to/eval-kit-nemo-skills.sqsh
+  # ... other containers
+```
+
+### 3. Megatron path (for mcore mode)
+
+The container needs access to a Megatron-LM installation. Set it in your cluster config:
+
+```yaml
+env_vars:
+  - MEGATRON_PATH=/path/to/megatron-lm
+  - PYTHONPATH=/path/to/megatron-lm
+```
+
+And ensure the path is mounted:
+
+```yaml
+mounts:
+  - /host/path/to/megatron-lm:/host/path/to/megatron-lm
+```
+
+### 4. VLMEvalKit dataset cache (for benchmarks that download from HuggingFace)
+
+VLMEvalKit downloads benchmark data on first use. Set a persistent cache directory:
+
+```yaml
+env_vars:
+  - LMUData=/path/to/vlmevalkit_cache
+```
+
+## Running eval_kit Benchmarks
+
+### Mode 1: Megatron in-process (mcore)
+
+This is the primary mode. The model runs directly inside the `torchrun` process — no separate server.
+
+```bash
+export NEMO_SKILLS_VLMEVALKIT_PATH=/path/to/VLMEvalKitMcore
+
+ns eval \
+    --cluster=my_cluster \
+    --output_dir=/path/to/results \
+    --benchmarks=eval_kit.LibriSpeech_test_clean \
+    --server_type=megatron \
+    --server_gpus=8 \
+    --server_container=/path/to/eval-kit-nemo-skills.sqsh \
+    ++model_type=mcore \
+    ++model_config=/path/to/config.yaml \
+    ++load_dir=/path/to/checkpoint/TP_1/
+```
+
+Key parameters:
+
+| Parameter | Purpose |
+|-----------|---------|
+| `--benchmarks=eval_kit.<dataset>` | VLMEvalKit dataset name (e.g., `LibriSpeech_test_clean`, `MMBench_DEV_EN`, `TedLium_ASR_Test`) |
+| `++model_type=mcore` | Triggers self-contained mode (no HTTP server, model loaded in-process) |
+| `++model_config=` | Path to Megatron model YAML config on the cluster |
+| `++load_dir=` | Path to Megatron checkpoint directory on the cluster |
+| `--server_gpus=8` | Number of GPUs allocated to the torchrun process |
+| `--server_container=` | Container with Megatron + VLMEvalKit dependencies |
+
+!!! note
+    `--server_gpus` controls GPU allocation even though no server is started. In mcore mode, these GPUs go directly to the `torchrun` main task.
+
+!!! note
+    `--model` is **not needed** for mcore mode — the model is specified via `++model_config` and `++load_dir`.
+
+### Mode 2: vLLM server
+
+The pipeline starts a vLLM server, and VLMEvalKit's `VLLMLocal` client connects to it.
+
+```bash
+export NEMO_SKILLS_VLMEVALKIT_PATH=/path/to/VLMEvalKitMcore
+
+ns eval \
+    --cluster=my_cluster \
+    --output_dir=/path/to/results \
+    --benchmarks=eval_kit.MMBench_DEV_EN \
+    --model=Qwen/Qwen2-Audio-7B-Instruct \
+    --server_type=vllm \
+    --server_gpus=2 \
+    --server_container=/path/to/vllm-audio.sqsh \
+    --main_container=/path/to/eval-kit-nemo-skills.sqsh \
+    --server_args="--max-model-len 8192 --trust-remote-code" \
+    ++model_type=vllm \
+    ++model_name=qwen2-audio-7b
+```
+
+Key differences from mcore mode:
+
+| Parameter | Purpose |
+|-----------|---------|
+| `--model=` | HuggingFace model name or path (vLLM downloads/loads it) |
+| `++model_type=vllm` | VLMEvalKit uses its `VLLMLocal` client |
+| `++model_name=` | Model identifier used by VLMEvalKit for result naming |
+| `--main_container=` | Container for the eval_kit client (must have `vlmeval`). Separate from the vLLM server container |
+| `--server_container=` | Container for the vLLM server |
+
+!!! warning
+    The vLLM server container and the eval_kit client container are different. Use `--server_container` for vLLM and `--main_container` for the eval_kit client that needs `vlmeval`.
+
+## Available Benchmarks
+
+Any VLMEvalKit dataset can be used with the `eval_kit.` prefix. Examples:
+
+### Audio / ASR
+
+| Benchmark name | Dataset |
+|---|---|
+| `eval_kit.LibriSpeech_test_clean` | LibriSpeech test-clean (2,620 samples) |
+| `eval_kit.LibriSpeech_test_other` | LibriSpeech test-other |
+| `eval_kit.TedLium_ASR_Test` | TED-LIUM |
+| `eval_kit.GigaSpeech_ASR_test` | GigaSpeech |
+| `eval_kit.VoxPopuli_ASR_test` | VoxPopuli |
+| `eval_kit.AMI_ASR_Test` | AMI meeting transcription |
+| `eval_kit.SPGISpeech_ASR_test` | SPGISpeech |
+| `eval_kit.Earnings22_ASR_Test` | Earnings22 |
+
+### Vision-Language
+
+| Benchmark name | Dataset |
+|---|---|
+| `eval_kit.MMBench_DEV_EN` | MMBench English dev |
+| `eval_kit.MME` | MME perception + cognition |
+| `eval_kit.MMMU_DEV_VAL` | MMMU dev+val |
+| `eval_kit.MathVista_MINI` | MathVista mini |
+
+The full list depends on your VLMEvalKit version. Check `vlmeval/dataset/` for all supported datasets.
+
+## mcore_skills: NeMo Skills Data + Megatron In-Process
+
+For benchmarks that already have NeMo Skills JSONL data (like `asr-leaderboard`), you can use the `mcore_skills` generation type. This reads NeMo Skills data and prompts but uses MultiModalMCore for inference (no server).
+
+```bash
+export NEMO_SKILLS_VLMEVALKIT_PATH=/path/to/VLMEvalKitMcore
+
+ns eval \
+    --cluster=my_cluster \
+    --output_dir=/path/to/results \
+    --benchmarks=asr-leaderboard \
+    --split=librispeech_clean \
+    --data_dir=/data \
+    --generation_type=mcore_skills \
+    --server_type=megatron \
+    --server_gpus=8 \
+    --server_container=/path/to/eval-kit-nemo-skills.sqsh \
+    ++model_config=/path/to/config.yaml \
+    ++load_dir=/path/to/checkpoint/TP_1/ \
+    ++tokenizer=/path/to/tokenizer
+```
+
+Key differences from eval_kit:
+
+| | eval_kit | mcore_skills |
+|---|---|---|
+| Data source | VLMEvalKit downloads from HuggingFace | NeMo Skills JSONL from `--data_dir` |
+| Prompts | VLMEvalKit's built-in prompts | NeMo Skills prompt templates |
+| Evaluation | VLMEvalKit's `dataset.evaluate()` | ASR WER via VLMEvalKit's `asr_wer()` |
+| Benchmarks | Any VLMEvalKit dataset | Any NeMo Skills benchmark with JSONL |
+| Flag | `--benchmarks=eval_kit.<name>` | `--generation_type=mcore_skills` |
+
+## Cluster Config Example
+
+Here is a complete cluster config section for eval_kit support:
+
+```yaml
+containers:
+  eval_kit: /path/to/eval-kit-nemo-skills.sqsh
+  megatron: /path/to/megatron-container.sqsh
+  vllm: /path/to/vllm-container.sqsh
+  # ... other containers
+
+mounts:
+  - /path/to/megatron-lm:/path/to/megatron-lm
+  - /path/to/data:/data
+  - /path/to/hf_cache:/workspace_hf/hf_cache
+  - /path/to/vlmevalkit_cache:/path/to/vlmevalkit_cache
+
+env_vars:
+  - MEGATRON_PATH=/path/to/megatron-lm
+  - PYTHONPATH=/path/to/megatron-lm
+  - LMUData=/path/to/vlmevalkit_cache
+  - HF_HOME=/workspace_hf/hf_cache
+  - HYDRA_FULL_ERROR=1
+  - CUDA_DEVICE_MAX_CONNECTIONS=1
+```
+
+## Understanding Results
+
+After evaluation completes, results are in `<output_dir>/eval-results/`:
+
+```text
+<output_dir>/
+└── eval-results/
+    └── eval_kit.LibriSpeech_test_clean/
+        ├── output.jsonl              # Per-sample results (generation + expected_answer)
+        ├── eval_kit_metrics.json     # Aggregate metrics from VLMEvalKit
+        └── metrics.json              # NeMo Skills summary
+```
+
+The `eval_kit_metrics.json` contains VLMEvalKit's computed metrics. For ASR benchmarks this is typically:
+
+```json
+{
+  "result": "              Dataset   WER (%) Metric\n0  LibriSpeechDataset  1.555811    WER"
+}
+```
+
+## Troubleshooting
+
+### `No module named 'megatron.core'`
+
+The `MEGATRON_PATH` or `PYTHONPATH` is not set correctly in the cluster config `env_vars`. Ensure both point to a Megatron-LM installation that contains `megatron/core/`.
+
+### `env variable RD_TABLEBENCH_SRC is missing`
+
+Some VLMEvalKit versions have a hard assert on this environment variable at import time. Fix: use the stable VLMEvalKitMcore version, or set `RD_TABLEBENCH_SRC=/tmp` in your cluster config env_vars.
+
+### `ModuleNotFoundError: No module named 'vlmeval'`
+
+The `NEMO_SKILLS_VLMEVALKIT_PATH` was not set when you ran `ns eval`, so the `vlmeval/` directory was not packaged. Set it and re-run:
+
+```bash
+export NEMO_SKILLS_VLMEVALKIT_PATH=/path/to/VLMEvalKitMcore
+ns eval ...
+```
+
+### Installation command for missing dependencies
+
+If the eval_kit container is missing some Python packages, use `--installation_command`:
+
+```bash
+--installation_command "pip install --no-deps pylatexenc==2.10"
+```
+
+This runs inside the container before the main task starts.
diff --git a/docs/evaluation/index.md b/docs/evaluation/index.md
@@ -12,6 +12,7 @@ We support many popular benchmarks and it's easy to add new in the future. The f
 - [**Multilingual**](./multilingual.md): e.g. [mmlu-prox](./multilingual.md#mmlu-prox), [flores-200](./multilingual.md#flores-200), [wmt24pp](./multilingual.md#wmt24pp)
 - [**Speech & Audio**](./speech-audio.md): e.g. [asr-leaderboard](./speech-audio.md#asr-leaderboard), [mmau-pro](./speech-audio.md#mmau-pro)
 - [**Vision-Language Models (VLM)**](./vlm.md): e.g. [mmmu-pro](./vlm.md#mmmu-pro)
+- [**VLMEvalKit Integration (eval_kit)**](./eval-kit.md): Run VLMEvalKit benchmarks via Megatron in-process or vLLM
 - [**Speculative Decoding (SD)**](./speculative-decoding.md): e.g. [SPEED-Bench](./speculative-decoding.md#SPEED-Bench)
 
 See [nemo_skills/dataset](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset) where each folder is a benchmark we support.

diff --git a/nemo_skills/dataset/eval_kit/__init__.py b/nemo_skills/dataset/eval_kit/__init__.py
@@ -0,0 +1,45 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# VLMEvalKit integration module.
+# Benchmarks are referenced as eval_kit.<VLMEvalKit_dataset_name>, e.g. eval_kit.MMBench_DEV_EN
+# The sub-benchmark name after eval_kit. is dynamically resolved and passed to VLMEvalKit.
+
+GENERATION_MODULE = "nemo_skills.inference.eval.eval_kit"
+METRICS_TYPE = "eval_kit"
+GENERATION_ARGS = ""
+NUM_SAMPLES = 0  # VLMEvalKit inference is deterministic; no random seeds
+
+# No JSONL input file; VLMEvalKit manages its own data via build_dataset()
+SKIP_INPUT_FILE = True
+
+# Note: SELF_CONTAINED_TASK is NOT set here because it depends on model_type.
+# For mcore mode (Megatron in-process), the pipeline sets self_contained_task=True
+# at runtime based on ++model_type=mcore in extra_arguments.
+# For vllm mode, the standard NeMo Skills server/client flow is used.
+
+
+def get_extra_generation_args(benchmark):
+    """Return extra generation args for the given benchmark name.
+
+    Extracts the VLMEvalKit dataset name from the dotted benchmark name
+    (e.g. eval_kit.MMBench_DEV_EN -> ++vlm_dataset=MMBench_DEV_EN).
+    """
+    if "." not in benchmark:
+        raise ValueError(
+            f"eval_kit benchmark must be in 'eval_kit.<dataset_name>' format, got '{benchmark}'. "
+            f"Example: eval_kit.MMBench_DEV_EN, eval_kit.LibriSpeech_test_clean"
+        )
+    sub = benchmark.split(".", 1)[1]
+    return f" ++vlm_dataset={sub} "
diff --git a/nemo_skills/dataset/utils.py b/nemo_skills/dataset/utils.py
@@ -161,6 +161,13 @@ def _load_external_dataset(dataset_path):
 
 def get_default_dataset_module(dataset):
     data_path = "/nemo_run/code/nemo_skills/dataset"
+
+    # For dotted names like eval_kit.MMBench_DEV_EN, import the parent package.
+    # The sub-benchmark part is handled by the module's get_extra_generation_args().
+    if dataset.startswith("eval_kit."):
+        dataset_module = importlib.import_module("nemo_skills.dataset.eval_kit")
+        return dataset_module, data_path
+
     dataset_module = importlib.import_module(f"nemo_skills.dataset.{dataset}")
 
     return dataset_module, data_path