Skip to content
Merged
Show file tree
Hide file tree
Changes from 35 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
7690e27
adding eval support for lcb-c++
wasiahmad Oct 3, 2025
1b8fe33
setting language default
wasiahmad Oct 3, 2025
4931f45
prefix for release version is only needed for python
wasiahmad Oct 3, 2025
07deb25
for c++, we need python interpretor for corresponding eval harness
wasiahmad Oct 3, 2025
7e6720a
code eval docs updated
wasiahmad Oct 3, 2025
0c5ca9f
Merge branch 'main' into livecodebench_cpp
wasiahmad Oct 3, 2025
ffdad7d
minor eval docs update
wasiahmad Oct 3, 2025
ee346da
minor eval docs update
wasiahmad Oct 3, 2025
84a8a0e
debugging generate with sandbox
wasiahmad Oct 3, 2025
535c8af
debugging generate with sandbox
wasiahmad Oct 4, 2025
e29d05d
debugging generate with sandbox
wasiahmad Oct 4, 2025
d3f83ea
debugging generate with sandbox
wasiahmad Oct 4, 2025
ac2dbda
debugging generate with sandbox
wasiahmad Oct 4, 2025
a846042
debugging generate with sandbox
wasiahmad Oct 4, 2025
dc91b99
debugging generate with sandbox
wasiahmad Oct 4, 2025
05b94d4
getting back all changes
wasiahmad Oct 4, 2025
6547325
setting KEEP_MOUNTS_FOR_SANDBOX = True for lcb eval
wasiahmad Oct 4, 2025
e3cc020
Merge remote-tracking branch 'origin/main' into livecodebench_cpp
wasiahmad Oct 4, 2025
5db2f25
Merge branch 'main' into livecodebench_cpp
wasiahmad Oct 6, 2025
3bfa508
Merge branch 'main' into livecodebench_cpp
wasiahmad Oct 6, 2025
02b613f
Increase sandbox client timeouts and skip code re-execution on timeou…
i-vainn Oct 7, 2025
8c4fbea
Control max_concurrent_requests in subclasses with parallel generatio…
Kipok Oct 7, 2025
e6bdf63
Token count for BFCL (#896)
shtoshni Oct 7, 2025
96dfea3
MT datasets FLORES200 and WMT24pp (#892)
AlexGrinch Oct 7, 2025
3af8f58
Add responses api type (#889)
smahdavi4 Oct 7, 2025
bf5be80
Add concurrent semaphore control to llm base class (#907)
smahdavi4 Oct 7, 2025
8ed94f0
Add qos slurm parameter (#906)
wedu-nvidia Oct 7, 2025
2f355b2
Fix typo in default tools parameter for token count (#910)
Kipok Oct 8, 2025
4c1863e
resolving conflicts
wasiahmad Oct 9, 2025
428be52
Slurm tests for code execution timeouts (#905)
i-vainn Oct 9, 2025
b1df71b
Add copyright checks workflow (#912)
activatedgeek Oct 9, 2025
308907a
Context error recovery (#914)
shtoshni Oct 9, 2025
a8a0952
Fix MCP tests (#916)
gwarmstrong Oct 9, 2025
03ff6a9
Addding BFCL headers (#917)
shtoshni Oct 9, 2025
9c1060b
updated year in copyright message
wasiahmad Oct 9, 2025
1bc7c26
lcb-cpp eval without sandbox
wasiahmad Oct 9, 2025
f5ee7b6
minor doc update
wasiahmad Oct 9, 2025
d8db2c7
sanbodx use logic updated with retries
wasiahmad Oct 12, 2025
8b48105
resolving conflicts
wasiahmad Oct 12, 2025
f311171
minor bug fix
wasiahmad Oct 12, 2025
bd19311
removing unwanted print statement
wasiahmad Oct 13, 2025
bf31d28
Merge branch 'main' into livecodebench_cpp
wasiahmad Oct 13, 2025
02ad4b3
Merge branch 'main' into livecodebench_cpp
wasiahmad Oct 13, 2025
aea6c79
added a comment
wasiahmad Oct 13, 2025
54a2aae
Merge branch 'main' into livecodebench_cpp
wasiahmad Oct 13, 2025
e4b3cb8
ojbench uses tries when using sandbox
wasiahmad Oct 13, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .github/workflows/copyright-check.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
name: Copyright check

on:
pull_request:

jobs:
copyright-check:
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_copyright_check.yml@v0.2.0
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Here are some of the features we support:
- [**Instruction following**](https://nvidia.github.io/NeMo-Skills/evaluation/instruction-following): e.g. [ifbench](https://nvidia.github.io/NeMo-Skills/evaluation/instruction-following/#ifbench), [ifeval](https://nvidia.github.io/NeMo-Skills/evaluation/instruction-following/#ifeval)
- [**Long-context**](https://nvidia.github.io/NeMo-Skills/evaluation/long-context): e.g. [ruler](https://nvidia.github.io/NeMo-Skills/evaluation/long-context/#ruler), [mrcr](https://nvidia.github.io/NeMo-Skills/evaluation/long-context/#mrcr), [aalcr](https://nvidia.github.io/NeMo-Skills/evaluation/long-context/#aalcr)
- [**Tool-calling**](https://nvidia.github.io/NeMo-Skills/evaluation/tool-calling): e.g. [bfcl_v3](https://nvidia.github.io/NeMo-Skills/evaluation/tool-calling/#bfcl_v3)
- [**Multilingual**](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual): e.g. [mmlu-prox](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#mmlu-prox)
- [**Multilingual**](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual): e.g. [mmlu-prox](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#mmlu-prox), [FLORES-200](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#FLORES-200), [wmt24pp](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#wmt24pp)
- Easily parallelize each evaluation across many slurm jobs, self-host LLM judges, bring your own prompts or change benchmark configuration in any other way.
- [Model training](https://nvidia.github.io/NeMo-Skills/pipelines/training): Train models using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner/), [NeMo-RL](https://github.com/NVIDIA/NeMo-RL/) or [verl](https://github.com/volcengine/verl).

Expand Down
6 changes: 4 additions & 2 deletions docs/basics/inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,13 @@ Click on :material-plus-circle: symbols in the snippet below to learn more detai
```python
from nemo_skills.inference.model import get_model
from nemo_skills.prompt.utils import get_prompt
import asyncio

llm = get_model(model="meta-llama/Llama-3.1-8B-Instruct", server_type="vllm") # localhost by default
prompt_obj = get_prompt('generic/default') # (1)!
prompt = prompt_obj.fill({'question': "What's 2 + 2?"})
print(prompt) # (2)!
output = llm.generate_sync(prompt=prompt)
output = asyncio.run(llm.generate_async(prompt=prompt))
print(output["generation"]) # (3)!
```

Expand Down Expand Up @@ -69,6 +70,7 @@ Click on :material-plus-circle: symbols in the snippet below to learn more detai
```python
from nemo_skills.inference.model import get_model
from nemo_skills.prompt.utils import get_prompt
import asyncio

llm = get_model( # (1)!
server_type="openai", # NIM models are using OpenAI API
Expand All @@ -80,7 +82,7 @@ Click on :material-plus-circle: symbols in the snippet below to learn more detai
prompt = prompt_obj.fill({'question': "What's 2 + 2?"})

print(prompt) # (3)!
output = llm.generate_sync(prompt=prompt)
output = asyncio.run(llm.generate_async(prompt=prompt))
print(output["generation"]) # (4)!
```

Expand Down
2 changes: 1 addition & 1 deletion docs/basics/prompt-format.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ which outputs

#### Example 2 - Prompt formatted as a string

If you want to use completions API, you can set `++use_completions_api=True`. This will use model's tokenizer to format
If you want to use completions API, you can set `++inference.endpoint_type=text`. This will use model's tokenizer to format
messages as a string (you can specify a custom tokenizer with `++tokenizer=...` argument).

Here is an example of the input to completions api
Expand Down
9 changes: 9 additions & 0 deletions docs/evaluation/code.md
Original file line number Diff line number Diff line change
Expand Up @@ -260,6 +260,13 @@ Due to variance between runs, you can automatically repeat the evaluation and av
--benchmarks=livecodebench:3
```

### livecodebench-cpp

- Benchmark is defined in [`nemo_skills/dataset/livecodebench-cpp/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/livecodebench-cpp/__init__.py)
- Original benchmark source is [here](https://huggingface.co/datasets/nvidia/LiveCodeBench-CPP).
- Data preparation and evaluation: you can prepare the dataset by running `ns prepare_data livecodebench-cpp`. The command will generate two dataset splits: `v5_2408_2501.jsonl` and `v6_2408_2505.jsonl`. When evaluating, make sure to target the C++ benchmark entrypoint (`--benchmarks=livecodebench-cpp`) and set `--split` to either `v5_2408_2501` or `v6_2408_2505`. The remaining flags mirror the livecodebench instructions above.

Comment thread
wasiahmad marked this conversation as resolved.

### livecodebench-pro

- Benchmark is defined in [`nemo_skills/dataset/livecodebench-pro/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/livecodebench-pro/__init__.py)
Expand Down Expand Up @@ -317,6 +324,8 @@ ns eval \
--split=test_python \
--data_dir=<DATA_DIR> \
--output_dir=<OUTPUT_DIR> \
--with_sandbox \
Comment thread
wasiahmad marked this conversation as resolved.
Outdated
--keep_mounts_for_sandbox \
++inference.temperature=0.6 \
++inference.top_p=0.95 \
++inference.tokens_to_generate=32768
Expand Down
4 changes: 2 additions & 2 deletions docs/evaluation/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ We support many popular benchmarks and it's easy to add new in the future. The f
- [**Instruction following**](./instruction-following.md): e.g. [ifbench](./instruction-following.md#ifbench), [ifeval](./instruction-following.md#ifeval)
- [**Long-context**](./long-context.md): e.g. [ruler](./long-context.md#ruler), [mrcr](./long-context.md#mrcr)
- [**Tool-calling**](./tool-calling.md): e.g. [bfcl_v3](./tool-calling.md#bfcl_v3)
- [**Multilingual**](./multilingual.md): e.g. [mmlu-prox](./multilingual.md#mmlu-prox)
- [**Multilingual**](./multilingual.md): e.g. [mmlu-prox](./multilingual.md#mmlu-prox), [flores-200](./multilingual.md#FLORES-200), [wmt24pp](./multilingual.md#wmt24pp)

See [nemo_skills/dataset](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset) where each folder is a benchmark we support.

Expand Down Expand Up @@ -246,4 +246,4 @@ To create a new benchmark follow this process:
prompt config in `GENERATION_ARGS` and evaluation / metric parameters. But if extra customization is needed for the generation, you can provide
a fully custom generation module. See [scicode](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/scicode/__init__.py) or [swe-bench](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/swe-bench/__init__.py) for examples of this.
4. Create a new [evaluation class](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/evaluation/evaluator/__init__.py) (if cannot re-use existing one).
5. Create a new [metrics class](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/evaluation/metrics/map_metrics.py) ( if cannot re-use existing one).
5. Create a new [metrics class](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/evaluation/metrics/map_metrics.py) ( if cannot re-use existing one).
2 changes: 1 addition & 1 deletion docs/evaluation/long-context.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,4 +49,4 @@ ns eval \
The results, including per-category scores, are stored in metrics.json. Detailed breakdowns by category and sequence length are also available via
```
ns summarize_results --cluster=<cluster_config> <folder_of_output_json>
```
```
152 changes: 149 additions & 3 deletions docs/evaluation/multilingual.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Multilingual

Our multilingual benchmarks cover things like multilingual reasoning as well as machine translation (to be added).
Our multilingual benchmarks cover things like multilingual reasoning as well as machine translation.

All benchmarks in this category will have an extra `--language` argument with its associated `ns prepare` command, which allows you to choose which language(s) of the benchmark to run.
Once prepared, the `ns eval` command will run on all languages prepared, and the summarized results generated with `ns eval` will include per-language breakdowns.
Expand All @@ -9,7 +9,7 @@ Once prepared, the `ns eval` command will run on all languages prepared, and the

### mmlu-prox

- Benchmark is defined in [`nemo_skills/dataset/mmlu-pro/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/mmlu-prox/__init__.py)
- Benchmark is defined in [`nemo_skills/dataset/mmlu-prox/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/mmlu-prox/__init__.py)
- Original benchmark source is [here](https://huggingface.co/datasets/li-lab/MMLU-ProX).

Our evaluation template and answer extraction mechanism tries to match the configration in [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/mmlu_prox).
Expand Down Expand Up @@ -68,4 +68,150 @@ Some reference numbers for reference and commands for reproduction:
++inference.temperature=0.6 \
++inference.top_k=20 \
++inference.tokens_to_generate=38912
```
```

### FLORES-200

- Benchmark is defined in [`nemo_skills/dataset/flores200/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/flores200/__init__.py)
- Original benchmark source is [here](https://huggingface.co/datasets/openlanguagedata/flores_plus).

Some reference numbers for devtest split (xx corresponds to average over 5 languages: de, es, fr, it, ja):

| Model | en->xx | xx->en | xx->xx |
|:-----------------------|------:|------:|------:|
| Nemotron-NanoV2-9B-v2 | 32.5 | 34 | 25.9 |
| Qwen3-8B | 31.5 | 34.6 | 25.7 |
| Qwen3-30B-A3B | 33.3 | 35.5 | 27.1 |
| gpt-oss-20B | 32.4 | 34.1 | 25 |

=== "Nemotron-NanoV2-9B-v2"

```bash
ns eval \
--cluster=[cluster] \
--model=NVIDIA/Nemotron-Nano-9B-v2 \
--benchmarks flores200 \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--split=devtest \
++inference.tokens_to_generate=512
++system_message='/no_think'
```

=== "Qwen3-8B"

```bash
ns eval \
--cluster=[cluster] \
--model=Qwen/Qwen3-8B \
--benchmarks flores200 \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--split=devtest \
++inference.tokens_to_generate=512
++prompt_suffix='/no_think'
```

=== "Qwen3-30B-A3B"

```bash
ns eval \
--cluster=[cluster] \
--model=Qwen/Qwen3-30B-A3B \
--benchmarks flores200 \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--split=devtest \
++inference.tokens_to_generate=512
++prompt_suffix='/no_think'
```

=== "gpt-oss-20B"

```bash
ns eval \
--cluster=[cluster] \
--model=openai/gpt-oss-20b \
--benchmarks flores200 \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--split=devtest \
++inference.tokens_to_generate=2048
```

### wmt24pp

- Benchmark is defined in [`nemo_skills/dataset/wmt24pp/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/wmt24pp/__init__.py)
- Original benchmark source is [here](https://huggingface.co/datasets/google/wmt24pp).

Some reference numbers for test split (xx corresponds to average over 5 languages: de, es, fr, it, ja):

| Model | en->de | en->es | en->fr | en->it | en->ja | en->xx |
|:-----------------------|------:|------:|------:|------:|------:|------:|
| Nemotron-NanoV2-9B-v2 | 25.3 | 37.7 | 33.4 | 33.8 | 20.9 | 30.2 |
| Qwen3-8B | 26.2 | 38.5 | 33.1 | 33.1 | 21.7 | 30.5 |
| Qwen3-30B-A3B | 28.5 | 40 | 35.1 | 36 | 23.2 | 32.5 |
| gpt-oss-20B | 27.3 | 42.3 | 32.8 | 34.9 | 25.2 | 32.5 |

=== "Nemotron-NanoV2-9B-v2"

```bash
ns eval \
--cluster=[cluster] \
--model=NVIDIA/Nemotron-Nano-9B-v2 \
--benchmarks wmt24pp \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--split=test \
++inference.tokens_to_generate=512
++system_message='/no_think'
```

=== "Qwen3-8B"

```bash
ns eval \
--cluster=[cluster] \
--model=Qwen/Qwen3-8B \
--benchmarks wmt24pp \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--split=test \
++inference.tokens_to_generate=512
++prompt_suffix='/no_think'
```

=== "Qwen3-30B-A3B"

```bash
ns eval \
--cluster=[cluster] \
--model=Qwen/Qwen3-30B-A3B \
--benchmarks wmt24pp \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--split=test \
++inference.tokens_to_generate=512
++prompt_suffix='/no_think'
```

=== "gpt-oss-20B"

```bash
ns eval \
--cluster=[cluster] \
--model=openai/gpt-oss-20b \
--benchmarks wmt24pp \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--split=test \
++inference.tokens_to_generate=2048
```
3 changes: 2 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@ Here are some of the features we support:
- [**Instruction following**](./evaluation/instruction-following.md): e.g. [ifbench](./evaluation/instruction-following.md#ifbench), [ifeval](./evaluation/instruction-following.md#ifeval)
- [**Long-context**](./evaluation/long-context.md): e.g. [ruler](./evaluation/long-context.md#ruler), [mrcr](./evaluation/long-context.md#mrcr)
- [**Tool-calling**](./evaluation/tool-calling.md): e.g. [bfcl_v3](./evaluation/tool-calling.md#bfcl_v3)
- [**Robustness Evaluation**](./evaluation/robustness.md): Evaluate model sensitvity against changes in prompt.
- [**Multilingual capabilities**](./evaluation/multilingual.md): e.g. [mmlu-prox](./evaluation/multilingual.md#mmlu-prox), [flores-200](./evaluation/multilingual.md#FLORES-200), [wmt24pp](./evaluation/multilingual.md#wmt24pp)
- [**Robustness evaluation**](./evaluation/robustness.md): Evaluate model sensitvity against changes in prompt.
- Easily parallelize each evaluation across many Slurm jobs, self-host LLM judges, bring your own prompts or change benchmark configuration in any other way.
- [Model training](pipelines/training.md): Train models using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner/), [NeMo-RL](https://github.com/NVIDIA/NeMo-RL/) or [verl](https://github.com/volcengine/verl).

Expand Down
Loading