-
Notifications
You must be signed in to change notification settings - Fork 108
Adding MMAU-Pro Benchmark Evaluation and Megatron-LM Server Support #1022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b8879c7 to
cd6887c
Compare
| def _get_mmau_pro_subgroup_configs(): | ||
| """Define the __init__.py configurations for each MMAU-Pro subgroup.""" | ||
| return { | ||
| "closed_form": """# Closed-form questions evaluated with NV Embed similarity matching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should work if we have it as an actual init.py inside a subfolder, not created on the fly by prepare.py. Will be a bit more explicit and easier to manage I think
| GENERATION_ARGS = "++prompt_format=openai" | ||
| # Split into 10 chunks for parallel processing of large dataset | ||
| NUM_CHUNKS = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this isn't a good default, we should always use 1. People can override this if needed. Number of chunks is very model dependent, e.g. if using a small model 10 is likely an overkill
| def main(): | ||
| parser = argparse.ArgumentParser(description="Prepare MMAU-Pro dataset for nemo-skills") | ||
| parser.add_argument("--split", default="test", choices=["validation", "test"]) | ||
| parser.add_argument("--with-audio", action="store_true", help="Download audio files (requires HF_TOKEN)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a new documentation page in docs/evaluation and add this benchmark there and describe what this parameter (with-audio) is doing and generally how the benchmark works along with some example command and reference eval scores
| parser = argparse.ArgumentParser(description="Prepare MMAU-Pro dataset for nemo-skills") | ||
| parser.add_argument("--split", default="test", choices=["validation", "test"]) | ||
| parser.add_argument("--with-audio", action="store_true", help="Download audio files (requires HF_TOKEN)") | ||
| parser.add_argument("--download-dir", help="Directory for audio files (required with --with-audio)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we instead reuse a common data_dir parameter of prepare_data / eval pipelines?
nemo_skills/dataset/utils.py
Outdated
| } | ||
|
|
||
|
|
||
| def generate_subgroup_init_files(output_dir, subgroup_configs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please test if it works with just putting init.py as explicit files (not generating on the fly) and if so we can remove this new function
nemo_skills/pipeline/eval.py
Outdated
|
|
||
| # Install required packages for NVEmbed evaluation | ||
| install_cmd = ( | ||
| "pip install -q -e /nemo_run/code && pip install -q datasets einops transformers==4.42.4 && " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need pip install -q -e /nemo_run/code ? the repo might not be installable (as it's not always that this repo is being used, users can run this from inside another git repo and then we upload their repo instead). If you just cd into that folder hopefully everything should work without installation as there will always be a nemo_skills subfolder there
nemo_skills/pipeline/eval.py
Outdated
| eval_cmd = ( | ||
| # First verify generation completed by checking for .done file | ||
| f'if [ ! -f "{src_file}.done" ]; then ' | ||
| f' echo "Error: Generation not complete: {src_file}.done not found"; ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need an explicit check, can just fail with normal error of not finding an input file in case previous job isn't finished. So I'd probably just run your command assuming it's there and if not it should be clear from error message that previous step has an issue
nemo_skills/pipeline/eval.py
Outdated
| f" exit 1; " | ||
| f"fi && " | ||
| # Copy and evaluate only if generation succeeded | ||
| f"mkdir -p {output_dir_path} && " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider moving all of this logic into a separate script and then just calling that script (including installation commands) so that we don't need to have a lot of logic build-up in the general eval.py but can just do something like python -m nemo_skills.evaluation.evaluator.mmau_pro_nvembed_judge <parameters> and all the rest is handled inside or something like this
nemo_skills/pipeline/eval.py
Outdated
| cmd=run_cmd, | ||
| task_name=f"{expname}-{benchmark}-nvembed-judge", | ||
| log_dir=log_dir + "/judge", | ||
| container=server_parameters.get("server_container"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we pick a specific container instead? Users can provide their own containers and it's not guaranteed that they will have all dependencies you need. Maybe we just use vllm e.g. if that's what you tested this with irrespective of main server (or any other container, but I'd probably have it fixed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even better if we can use main nemo-skills client container as it's the most lightweights (maybe with extra on-the-fly install of whatever you need) unless it takes too much time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first, I tried to use the main nemo-skills container, but since nv-embed requires torch, I needed to install it there. However even after installing the correct cuda-compatible version of torch, cuda still shows as unavailable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the recent changes I switched to the vllm container since it already had torch installed, and then installed the required packages there.
| super().__init__(compute_no_answer=compute_no_answer) | ||
| self.max_k = max_k | ||
|
|
||
| def _extract_judge_result(self, judgement_text: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we reuse the existing is_correct_judgement function here?
Signed-off-by: mmkrtchyan <[email protected]>
Signed-off-by: mmkrtchyan <[email protected]> Signed-off-by: melllinia <[email protected]>
Signed-off-by: melllinia <[email protected]>
eac0395 to
3aba179
Compare
Signed-off-by: melllinia <[email protected]>
3aba179 to
5477a4d
Compare
MMAU-Pro Evaluation and Megatron-LM Server Support
Summary
This PR adds configuration and scripts to enable MMAU-Pro evaluation for multimodal language models using the NeMo framework and the Megatron-LM inference server.
This is the first SpeechLM benchmark to be supported in NeMo Skills with the Megatron-LM server, providing a standardized evaluation pipeline for audio-language tasks.
Motivation
The MMAU-Pro benchmark provides an evaluation for multimodal audio-language models. By integrating it with NeMo-Skills and the Megatron-LM server, we enable reproducible evaluation of model performance on audio question answering tasks. This lays the groundwork for future SpeechLM benchmarks to be supported under the same evaluation infrastructure.
What's New
ns eval.Implementation Notes
The benchmark consists of three main evaluation groups, each handled with separate
test.jsonlfiles and corresponding jobs:qwen2.5-7b-instruct).nv embed).All scores are then combined into a final score.
Preparing the Dataset
test.jsonldata and generate subgroup__init__.pyfiles, run:Example: Running the Evaluation
Notes
transformersthan Megatron-LM and some additional packages. Transformers version is decreased but it doesn't affect the generation, because happens after it.judge_typeineval.pyto allow GPU-based scoring.__init__.py.sacrebleupackage was missing, adding that with--installation_command.