Adding MMAU-Pro Benchmark Evaluation and Megatron-LM Server Support #1022

melllinia · 2025-11-03T18:31:54Z

MMAU-Pro Evaluation and Megatron-LM Server Support

Summary

This PR adds configuration and scripts to enable MMAU-Pro evaluation for multimodal language models using the NeMo framework and the Megatron-LM inference server.

This is the first SpeechLM benchmark to be supported in NeMo Skills with the Megatron-LM server, providing a standardized evaluation pipeline for audio-language tasks.

Motivation

The MMAU-Pro benchmark provides an evaluation for multimodal audio-language models. By integrating it with NeMo-Skills and the Megatron-LM server, we enable reproducible evaluation of model performance on audio question answering tasks. This lays the groundwork for future SpeechLM benchmarks to be supported under the same evaluation infrastructure.

What's New

Added MMAU-Pro evaluation support in NeMo Skills.
Configured Megatron-LM server for inference, incorporating modifications by Steve Huang.
Provided an example to run evaluation via ns eval.

Implementation Notes

The benchmark consists of three main evaluation groups, each handled with separate test.jsonl files and corresponding jobs:

Open-ended questions – evaluated by LLM judge (qwen2.5-7b-instruct).
Closed-form questions – evaluated via embedding matching (nv embed).
Instruction following – evaluated by straightforward functions.

All scores are then combined into a final score.

Preparing the Dataset

To download test.jsonl data and generate subgroup __init__.py files, run:

python nemo_skills/dataset/mmau-pro/prepare.py

To also download the audio files (~50GB), run:

python nemo_skills/dataset/mmau-pro/prepare.py --with-audio --download-dir /path/to/dir

Example: Running the Evaluation

export MEGATRON_PATH="/lustre/fsw/portfolios/llmservice/users/heh/code/megatron-lm-vlm2-dev"
export CKPT_PATH="/lustre/fsw/portfolios/convai/users/mmkrtchyan/avlm_checkpoints/draco-mcore-nm_5p5_h_9b-cradio-parakeet-nemo-stage2-alm-nodes16-seq16384-bz2048-tp4-vlm2-branch-1009-tp1"
export MODEL_CFG_PATH="/lustre/fsw/portfolios/convai/users/mmkrtchyan/avlm_checkpoints/draco-mcore-nm_5p5_h_9b-cradio-parakeet-nemo-stage2-alm-nodes16-seq16384-bz2048-tp4-vlm2-branch-1009-tp1/config.yaml"

export OUTPUT_DIR="mmau-pro_eval"
export SERVER_ENTRYPOINT="$MEGATRON_PATH/examples/multimodal/run_avlm_text_generation_server.py"
export SERVER_CONTAINER="/lustre/fsw/portfolios/llmservice/users/trintamaki/workspace/containers/megatron-dev-img-05142025-pytorch-dev-te-cd37379-editable-energon-mamba-fix-vlmeval-av.sqsh"

export HF_TOKEN='YOUR_HF_TOKEN'
export WANDB='YOUR_WAND_KEY'
export NVIDIA_API_KEY='YOUR_NVIDIA_API_KEY' 

export WANDB_API_KEY=${WANDB} && \
export HF_TOKEN=${HF_TOKEN} && \
export NVIDIA_API_KEY=${NVIDIA_API_KEY} && \
export MEGATRON_PATH="$MEGATRON_PATH" && \
ns eval \
  --cluster oci_iad \
  --output_dir /path/to/output/dir/$OUTPUT_DIR \
  --benchmarks mmau-pro \
  --server_type megatron \
  --server_gpus 1 \
  --model $CKPT_PATH \
  --server_entrypoint $SERVER_ENTRYPOINT \
  --server_container $SERVER_CONTAINER \
  --data_dir="/dataset" \
  --installation_command "pip install sacrebleu" \
  ++prompt_suffix='/no_think' \
  --server_args "--inference-max-requests 1 \
                 --model-config ${MODEL_CFG_PATH} \
                 --num-tokens-to-generate 256 --temperature 1.0 --top_p 1.0"

Notes

Closed-form evaluation uses the NV-Embed model. NV-Embed requires a lower version of transformers than Megatron-LM and some additional packages. Transformers version is decreased but it doesn't affect the generation, because happens after it.
The evaluation logic is organized by judge_type in eval.py to allow GPU-based scoring.
Since the benchmark runs in subgroups, each subgroup's chunk size is specified in its __init__.py.
Used custom megatron-LM container where sacrebleu package was missing, adding that with --installation_command.

Kipok · 2025-11-04T20:55:48Z

nemo_skills/dataset/mmau-pro/prepare.py

+def _get_mmau_pro_subgroup_configs():
+    """Define the __init__.py configurations for each MMAU-Pro subgroup."""
+    return {
+        "closed_form": """# Closed-form questions evaluated with NV Embed similarity matching


I think it should work if we have it as an actual init.py inside a subfolder, not created on the fly by prepare.py. Will be a bit more explicit and easier to manage I think

Kipok · 2025-11-04T20:59:36Z

nemo_skills/dataset/mmau-pro/prepare.py

+GENERATION_ARGS = "++prompt_format=openai"
+
+# Split into 10 chunks for parallel processing of large dataset
+NUM_CHUNKS = 10


I think this isn't a good default, we should always use 1. People can override this if needed. Number of chunks is very model dependent, e.g. if using a small model 10 is likely an overkill

Kipok · 2025-11-04T21:01:53Z

nemo_skills/dataset/mmau-pro/prepare.py

+def main():
+    parser = argparse.ArgumentParser(description="Prepare MMAU-Pro dataset for nemo-skills")
+    parser.add_argument("--split", default="test", choices=["validation", "test"])
+    parser.add_argument("--with-audio", action="store_true", help="Download audio files (requires HF_TOKEN)")


Please add a new documentation page in docs/evaluation and add this benchmark there and describe what this parameter (with-audio) is doing and generally how the benchmark works along with some example command and reference eval scores

Kipok · 2025-11-04T21:02:22Z

nemo_skills/dataset/mmau-pro/prepare.py

+    parser = argparse.ArgumentParser(description="Prepare MMAU-Pro dataset for nemo-skills")
+    parser.add_argument("--split", default="test", choices=["validation", "test"])
+    parser.add_argument("--with-audio", action="store_true", help="Download audio files (requires HF_TOKEN)")
+    parser.add_argument("--download-dir", help="Directory for audio files (required with --with-audio)")


can we instead reuse a common data_dir parameter of prepare_data / eval pipelines?

Kipok · 2025-11-04T21:03:38Z

nemo_skills/dataset/utils.py

    }
+
+
+def generate_subgroup_init_files(output_dir, subgroup_configs):


please test if it works with just putting init.py as explicit files (not generating on the fly) and if so we can remove this new function

Kipok · 2025-11-04T21:28:06Z

nemo_skills/pipeline/eval.py

+
+                # Install required packages for NVEmbed evaluation
+                install_cmd = (
+                    "pip install -q -e /nemo_run/code && pip install -q datasets einops transformers==4.42.4 && "


do we need pip install -q -e /nemo_run/code ? the repo might not be installable (as it's not always that this repo is being used, users can run this from inside another git repo and then we upload their repo instead). If you just cd into that folder hopefully everything should work without installation as there will always be a nemo_skills subfolder there

Kipok · 2025-11-04T21:30:37Z

nemo_skills/pipeline/eval.py

+                    eval_cmd = (
+                        # First verify generation completed by checking for .done file
+                        f'if [ ! -f "{src_file}.done" ]; then '
+                        f'  echo "Error: Generation not complete: {src_file}.done not found"; '


I don't think we need an explicit check, can just fail with normal error of not finding an input file in case previous job isn't finished. So I'd probably just run your command assuming it's there and if not it should be clear from error message that previous step has an issue

Kipok · 2025-11-04T21:31:46Z

nemo_skills/pipeline/eval.py

+                        f"  exit 1; "
+                        f"fi && "
+                        # Copy and evaluate only if generation succeeded
+                        f"mkdir -p {output_dir_path} && "


consider moving all of this logic into a separate script and then just calling that script (including installation commands) so that we don't need to have a lot of logic build-up in the general eval.py but can just do something like python -m nemo_skills.evaluation.evaluator.mmau_pro_nvembed_judge <parameters> and all the rest is handled inside or something like this

Kipok · 2025-11-04T21:32:24Z

nemo_skills/pipeline/eval.py

+                    cmd=run_cmd,
+                    task_name=f"{expname}-{benchmark}-nvembed-judge",
+                    log_dir=log_dir + "/judge",
+                    container=server_parameters.get("server_container"),


should we pick a specific container instead? Users can provide their own containers and it's not guaranteed that they will have all dependencies you need. Maybe we just use vllm e.g. if that's what you tested this with irrespective of main server (or any other container, but I'd probably have it fixed)

even better if we can use main nemo-skills client container as it's the most lightweights (maybe with extra on-the-fly install of whatever you need) unless it takes too much time

At first, I tried to use the main nemo-skills container, but since nv-embed requires torch, I needed to install it there. However even after installing the correct cuda-compatible version of torch, cuda still shows as unavailable.

In the recent changes I switched to the vllm container since it already had torch installed, and then installed the required packages there.

Kipok · 2025-11-04T21:36:26Z

nemo_skills/evaluation/metrics/speechlm_metrics.py

+        super().__init__(compute_no_answer=compute_no_answer)
+        self.max_k = max_k
+
+    def _extract_judge_result(self, judgement_text: str) -> bool:


can we reuse the existing is_correct_judgement function here?

Signed-off-by: mmkrtchyan <[email protected]>

Signed-off-by: mmkrtchyan <[email protected]> Signed-off-by: melllinia <[email protected]>

Signed-off-by: melllinia <[email protected]>

melllinia force-pushed the mmau-pro-eval branch 4 times, most recently from b8879c7 to cd6887c Compare November 4, 2025 13:44

karpnv requested a review from Kipok November 4, 2025 15:09

Kipok reviewed Nov 4, 2025

View reviewed changes

melllinia and others added 3 commits November 6, 2025 13:29

Adding MMAU-Pro Benchmark Evaluation and Megatron-LM Server Support

2b1f1d9

Signed-off-by: mmkrtchyan <[email protected]>

Adding MMAU-Pro Benchmark Evaluation and Megatron-LM Server Support

a6076e4

Signed-off-by: mmkrtchyan <[email protected]> Signed-off-by: melllinia <[email protected]>

Fix trailing whitespace via pre-commit

5b924a3

Signed-off-by: melllinia <[email protected]>

melllinia force-pushed the mmau-pro-eval branch 2 times, most recently from eac0395 to 3aba179 Compare November 6, 2025 10:49

Added Requested Changes and Documentation

5477a4d

Signed-off-by: melllinia <[email protected]>

melllinia force-pushed the mmau-pro-eval branch from 3aba179 to 5477a4d Compare November 6, 2025 15:37

		}


		def generate_subgroup_init_files(output_dir, subgroup_configs):

Adding MMAU-Pro Benchmark Evaluation and Megatron-LM Server Support #1022

Are you sure you want to change the base?

Adding MMAU-Pro Benchmark Evaluation and Megatron-LM Server Support #1022

Conversation

melllinia commented Nov 3, 2025

MMAU-Pro Evaluation and Megatron-LM Server Support

Summary

Motivation

What's New

Implementation Notes

Preparing the Dataset

Example: Running the Evaluation

Notes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants