Skip to content

Conversation

@melllinia
Copy link
Collaborator

MMAU-Pro Evaluation and Megatron-LM Server Support

Summary

This PR adds configuration and scripts to enable MMAU-Pro evaluation for multimodal language models using the NeMo framework and the Megatron-LM inference server.

This is the first SpeechLM benchmark to be supported in NeMo Skills with the Megatron-LM server, providing a standardized evaluation pipeline for audio-language tasks.

Motivation

The MMAU-Pro benchmark provides an evaluation for multimodal audio-language models. By integrating it with NeMo-Skills and the Megatron-LM server, we enable reproducible evaluation of model performance on audio question answering tasks. This lays the groundwork for future SpeechLM benchmarks to be supported under the same evaluation infrastructure.

What's New

  • Added MMAU-Pro evaluation support in NeMo Skills.
  • Configured Megatron-LM server for inference, incorporating modifications by Steve Huang.
  • Provided an example to run evaluation via ns eval.

Implementation Notes

The benchmark consists of three main evaluation groups, each handled with separate test.jsonl files and corresponding jobs:

  1. Open-ended questions – evaluated by LLM judge (qwen2.5-7b-instruct).
  2. Closed-form questions – evaluated via embedding matching (nv embed).
  3. Instruction following – evaluated by straightforward functions.

All scores are then combined into a final score.

Preparing the Dataset

  • To download test.jsonl data and generate subgroup __init__.py files, run:
python nemo_skills/dataset/mmau-pro/prepare.py
  • To also download the audio files (~50GB), run:
python nemo_skills/dataset/mmau-pro/prepare.py --with-audio --download-dir /path/to/dir

Example: Running the Evaluation

export MEGATRON_PATH="/lustre/fsw/portfolios/llmservice/users/heh/code/megatron-lm-vlm2-dev"
export CKPT_PATH="/lustre/fsw/portfolios/convai/users/mmkrtchyan/avlm_checkpoints/draco-mcore-nm_5p5_h_9b-cradio-parakeet-nemo-stage2-alm-nodes16-seq16384-bz2048-tp4-vlm2-branch-1009-tp1"
export MODEL_CFG_PATH="/lustre/fsw/portfolios/convai/users/mmkrtchyan/avlm_checkpoints/draco-mcore-nm_5p5_h_9b-cradio-parakeet-nemo-stage2-alm-nodes16-seq16384-bz2048-tp4-vlm2-branch-1009-tp1/config.yaml"

export OUTPUT_DIR="mmau-pro_eval"
export SERVER_ENTRYPOINT="$MEGATRON_PATH/examples/multimodal/run_avlm_text_generation_server.py"
export SERVER_CONTAINER="/lustre/fsw/portfolios/llmservice/users/trintamaki/workspace/containers/megatron-dev-img-05142025-pytorch-dev-te-cd37379-editable-energon-mamba-fix-vlmeval-av.sqsh"

export HF_TOKEN='YOUR_HF_TOKEN'
export WANDB='YOUR_WAND_KEY'
export NVIDIA_API_KEY='YOUR_NVIDIA_API_KEY' 

export WANDB_API_KEY=${WANDB} && \
export HF_TOKEN=${HF_TOKEN} && \
export NVIDIA_API_KEY=${NVIDIA_API_KEY} && \
export MEGATRON_PATH="$MEGATRON_PATH" && \
ns eval \
  --cluster oci_iad \
  --output_dir /path/to/output/dir/$OUTPUT_DIR \
  --benchmarks mmau-pro \
  --server_type megatron \
  --server_gpus 1 \
  --model $CKPT_PATH \
  --server_entrypoint $SERVER_ENTRYPOINT \
  --server_container $SERVER_CONTAINER \
  --data_dir="/dataset" \
  --installation_command "pip install sacrebleu" \
  ++prompt_suffix='/no_think' \
  --server_args "--inference-max-requests 1 \
                 --model-config ${MODEL_CFG_PATH} \
                 --num-tokens-to-generate 256 --temperature 1.0 --top_p 1.0"

Notes

  • Closed-form evaluation uses the NV-Embed model. NV-Embed requires a lower version of transformers than Megatron-LM and some additional packages. Transformers version is decreased but it doesn't affect the generation, because happens after it.
  • The evaluation logic is organized by judge_type in eval.py to allow GPU-based scoring.
  • Since the benchmark runs in subgroups, each subgroup's chunk size is specified in its __init__.py.
  • Used custom megatron-LM container where sacrebleu package was missing, adding that with --installation_command.

@melllinia melllinia force-pushed the mmau-pro-eval branch 4 times, most recently from b8879c7 to cd6887c Compare November 4, 2025 13:44
@karpnv karpnv requested a review from Kipok November 4, 2025 15:09
def _get_mmau_pro_subgroup_configs():
"""Define the __init__.py configurations for each MMAU-Pro subgroup."""
return {
"closed_form": """# Closed-form questions evaluated with NV Embed similarity matching
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should work if we have it as an actual init.py inside a subfolder, not created on the fly by prepare.py. Will be a bit more explicit and easier to manage I think

GENERATION_ARGS = "++prompt_format=openai"
# Split into 10 chunks for parallel processing of large dataset
NUM_CHUNKS = 10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this isn't a good default, we should always use 1. People can override this if needed. Number of chunks is very model dependent, e.g. if using a small model 10 is likely an overkill

def main():
parser = argparse.ArgumentParser(description="Prepare MMAU-Pro dataset for nemo-skills")
parser.add_argument("--split", default="test", choices=["validation", "test"])
parser.add_argument("--with-audio", action="store_true", help="Download audio files (requires HF_TOKEN)")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a new documentation page in docs/evaluation and add this benchmark there and describe what this parameter (with-audio) is doing and generally how the benchmark works along with some example command and reference eval scores

parser = argparse.ArgumentParser(description="Prepare MMAU-Pro dataset for nemo-skills")
parser.add_argument("--split", default="test", choices=["validation", "test"])
parser.add_argument("--with-audio", action="store_true", help="Download audio files (requires HF_TOKEN)")
parser.add_argument("--download-dir", help="Directory for audio files (required with --with-audio)")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we instead reuse a common data_dir parameter of prepare_data / eval pipelines?

}


def generate_subgroup_init_files(output_dir, subgroup_configs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please test if it works with just putting init.py as explicit files (not generating on the fly) and if so we can remove this new function


# Install required packages for NVEmbed evaluation
install_cmd = (
"pip install -q -e /nemo_run/code && pip install -q datasets einops transformers==4.42.4 && "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need pip install -q -e /nemo_run/code ? the repo might not be installable (as it's not always that this repo is being used, users can run this from inside another git repo and then we upload their repo instead). If you just cd into that folder hopefully everything should work without installation as there will always be a nemo_skills subfolder there

eval_cmd = (
# First verify generation completed by checking for .done file
f'if [ ! -f "{src_file}.done" ]; then '
f' echo "Error: Generation not complete: {src_file}.done not found"; '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need an explicit check, can just fail with normal error of not finding an input file in case previous job isn't finished. So I'd probably just run your command assuming it's there and if not it should be clear from error message that previous step has an issue

f" exit 1; "
f"fi && "
# Copy and evaluate only if generation succeeded
f"mkdir -p {output_dir_path} && "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider moving all of this logic into a separate script and then just calling that script (including installation commands) so that we don't need to have a lot of logic build-up in the general eval.py but can just do something like python -m nemo_skills.evaluation.evaluator.mmau_pro_nvembed_judge <parameters> and all the rest is handled inside or something like this

cmd=run_cmd,
task_name=f"{expname}-{benchmark}-nvembed-judge",
log_dir=log_dir + "/judge",
container=server_parameters.get("server_container"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we pick a specific container instead? Users can provide their own containers and it's not guaranteed that they will have all dependencies you need. Maybe we just use vllm e.g. if that's what you tested this with irrespective of main server (or any other container, but I'd probably have it fixed)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even better if we can use main nemo-skills client container as it's the most lightweights (maybe with extra on-the-fly install of whatever you need) unless it takes too much time

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first, I tried to use the main nemo-skills container, but since nv-embed requires torch, I needed to install it there. However even after installing the correct cuda-compatible version of torch, cuda still shows as unavailable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the recent changes I switched to the vllm container since it already had torch installed, and then installed the required packages there.

super().__init__(compute_no_answer=compute_no_answer)
self.max_k = max_k

def _extract_judge_result(self, judgement_text: str) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse the existing is_correct_judgement function here?

@melllinia melllinia force-pushed the mmau-pro-eval branch 2 times, most recently from eac0395 to 3aba179 Compare November 6, 2025 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants