NVIDIA-NeMo · wasiahmad · Oct 13, 2025 · Oct 3, 2025 · Oct 3, 2025 · Oct 3, 2025
diff --git a/.github/workflows/copyright-check.yml b/.github/workflows/copyright-check.yml
@@ -0,0 +1,21 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+name: Copyright check
+
+on:
+  pull_request:
+
+jobs:
+  copyright-check:
+    uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_copyright_check.yml@v0.2.0
diff --git a/README.md b/README.md
@@ -17,7 +17,7 @@ Here are some of the features we support:
     - [**Instruction following**](https://nvidia.github.io/NeMo-Skills/evaluation/instruction-following): e.g. [ifbench](https://nvidia.github.io/NeMo-Skills/evaluation/instruction-following/#ifbench), [ifeval](https://nvidia.github.io/NeMo-Skills/evaluation/instruction-following/#ifeval)
     - [**Long-context**](https://nvidia.github.io/NeMo-Skills/evaluation/long-context): e.g. [ruler](https://nvidia.github.io/NeMo-Skills/evaluation/long-context/#ruler), [mrcr](https://nvidia.github.io/NeMo-Skills/evaluation/long-context/#mrcr), [aalcr](https://nvidia.github.io/NeMo-Skills/evaluation/long-context/#aalcr)
     - [**Tool-calling**](https://nvidia.github.io/NeMo-Skills/evaluation/tool-calling): e.g. [bfcl_v3](https://nvidia.github.io/NeMo-Skills/evaluation/tool-calling/#bfcl_v3)
-    - [**Multilingual**](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual): e.g. [mmlu-prox](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#mmlu-prox)
+    - [**Multilingual**](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual): e.g. [mmlu-prox](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#mmlu-prox), [FLORES-200](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#FLORES-200), [wmt24pp](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#wmt24pp)
   - Easily parallelize each evaluation across many slurm jobs, self-host LLM judges, bring your own prompts or change benchmark configuration in any other way.
 - [Model training](https://nvidia.github.io/NeMo-Skills/pipelines/training): Train models using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner/), [NeMo-RL](https://github.com/NVIDIA/NeMo-RL/) or [verl](https://github.com/volcengine/verl).
 

diff --git a/docs/basics/inference.md b/docs/basics/inference.md
@@ -34,12 +34,13 @@ Click on :material-plus-circle: symbols in the snippet below to learn more detai
     ```python
     from nemo_skills.inference.model import get_model
     from nemo_skills.prompt.utils import get_prompt
+    import asyncio
 
     llm = get_model(model="meta-llama/Llama-3.1-8B-Instruct", server_type="vllm")  # localhost by default
     prompt_obj = get_prompt('generic/default') # (1)!
     prompt = prompt_obj.fill({'question': "What's 2 + 2?"})
     print(prompt) # (2)!
-    output = llm.generate_sync(prompt=prompt)
+    output = asyncio.run(llm.generate_async(prompt=prompt))
     print(output["generation"]) # (3)!
     ```
 
@@ -69,6 +70,7 @@ Click on :material-plus-circle: symbols in the snippet below to learn more detai
     ```python
     from nemo_skills.inference.model import get_model
     from nemo_skills.prompt.utils import get_prompt
+    import asyncio
 
     llm = get_model( # (1)!
         server_type="openai",  # NIM models are using OpenAI API
@@ -80,7 +82,7 @@ Click on :material-plus-circle: symbols in the snippet below to learn more detai
     prompt = prompt_obj.fill({'question': "What's 2 + 2?"})
 
     print(prompt) # (3)!
-    output = llm.generate_sync(prompt=prompt)
+    output = asyncio.run(llm.generate_async(prompt=prompt))
     print(output["generation"]) # (4)!
     ```
 

diff --git a/docs/basics/prompt-format.md b/docs/basics/prompt-format.md
@@ -108,7 +108,7 @@ which outputs
 
 #### Example 2 - Prompt formatted as a string
 
-If you want to use completions API, you can set `++use_completions_api=True`. This will use model's tokenizer to format
+If you want to use completions API, you can set `++inference.endpoint_type=text`. This will use model's tokenizer to format
 messages as a string (you can specify a custom tokenizer with `++tokenizer=...` argument).
 
 Here is an example of the input to completions api

diff --git a/docs/evaluation/code.md b/docs/evaluation/code.md
@@ -260,6 +260,13 @@ Due to variance between runs, you can automatically repeat the evaluation and av
 --benchmarks=livecodebench:3
 ```
 
+### livecodebench-cpp
+
+- Benchmark is defined in [`nemo_skills/dataset/livecodebench-cpp/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/livecodebench-cpp/__init__.py)
+- Original benchmark source is [here](https://huggingface.co/datasets/nvidia/LiveCodeBench-CPP).
+- Data preparation and evaluation: you can prepare the dataset by running `ns prepare_data livecodebench-cpp`. The command will generate two dataset splits: `v5_2408_2501.jsonl` and `v6_2408_2505.jsonl`. When evaluating, make sure to target the C++ benchmark entrypoint (`--benchmarks=livecodebench-cpp`) and set `--split` to either `v5_2408_2501` or `v6_2408_2505`. The remaining flags mirror the livecodebench instructions above.
+
+
 ### livecodebench-pro
 
 - Benchmark is defined in [`nemo_skills/dataset/livecodebench-pro/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/livecodebench-pro/__init__.py)
@@ -317,6 +324,8 @@ ns eval \
     --split=test_python \
     --data_dir=<DATA_DIR> \
     --output_dir=<OUTPUT_DIR> \
+    --with_sandbox \
+    --keep_mounts_for_sandbox \
     ++inference.temperature=0.6 \
     ++inference.top_p=0.95 \
     ++inference.tokens_to_generate=32768

diff --git a/docs/evaluation/index.md b/docs/evaluation/index.md
@@ -9,7 +9,7 @@ We support many popular benchmarks and it's easy to add new in the future. The f
 - [**Instruction following**](./instruction-following.md): e.g. [ifbench](./instruction-following.md#ifbench), [ifeval](./instruction-following.md#ifeval)
 - [**Long-context**](./long-context.md): e.g. [ruler](./long-context.md#ruler), [mrcr](./long-context.md#mrcr)
 - [**Tool-calling**](./tool-calling.md): e.g. [bfcl_v3](./tool-calling.md#bfcl_v3)
-- [**Multilingual**](./multilingual.md): e.g. [mmlu-prox](./multilingual.md#mmlu-prox)
+- [**Multilingual**](./multilingual.md): e.g. [mmlu-prox](./multilingual.md#mmlu-prox), [flores-200](./multilingual.md#FLORES-200), [wmt24pp](./multilingual.md#wmt24pp)
 
 See [nemo_skills/dataset](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset) where each folder is a benchmark we support.
 
@@ -246,4 +246,4 @@ To create a new benchmark follow this process:
    prompt config in `GENERATION_ARGS` and evaluation / metric parameters. But if extra customization is needed for the generation, you can provide
    a fully custom generation module. See [scicode](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/scicode/__init__.py) or [swe-bench](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/swe-bench/__init__.py) for examples of this.
 4. Create a new [evaluation class](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/evaluation/evaluator/__init__.py) (if cannot re-use existing one).
-5. Create a new [metrics class](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/evaluation/metrics/map_metrics.py) ( if cannot re-use existing one).
+5. Create a new [metrics class](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/evaluation/metrics/map_metrics.py) ( if cannot re-use existing one).
diff --git a/docs/evaluation/long-context.md b/docs/evaluation/long-context.md
@@ -49,4 +49,4 @@ ns eval \
 The results, including per-category scores, are stored in metrics.json. Detailed breakdowns by category and sequence length are also available via
 ```
 ns summarize_results --cluster=<cluster_config> <folder_of_output_json>
-```
+```
diff --git a/docs/evaluation/multilingual.md b/docs/evaluation/multilingual.md
@@ -1,6 +1,6 @@
 # Multilingual
 
-Our multilingual benchmarks cover things like multilingual reasoning as well as machine translation (to be added).
+Our multilingual benchmarks cover things like multilingual reasoning as well as machine translation.
 
 All benchmarks in this category will have an extra `--language` argument with its associated `ns prepare` command, which allows you to choose which language(s) of the benchmark to run.
 Once prepared, the `ns eval` command will run on all languages prepared, and the summarized results generated with `ns eval` will include per-language breakdowns.
@@ -9,7 +9,7 @@ Once prepared, the `ns eval` command will run on all languages prepared, and the
 
 ### mmlu-prox
 
-- Benchmark is defined in [`nemo_skills/dataset/mmlu-pro/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/mmlu-prox/__init__.py)
+- Benchmark is defined in [`nemo_skills/dataset/mmlu-prox/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/mmlu-prox/__init__.py)
 - Original benchmark source is [here](https://huggingface.co/datasets/li-lab/MMLU-ProX).
 
 Our evaluation template and answer extraction mechanism tries to match the configration in [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/mmlu_prox).
@@ -68,4 +68,150 @@ Some reference numbers for reference and commands for reproduction:
         ++inference.temperature=0.6 \
         ++inference.top_k=20 \
         ++inference.tokens_to_generate=38912
-    ```
+    ```
+
+### FLORES-200
+
+- Benchmark is defined in [`nemo_skills/dataset/flores200/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/flores200/__init__.py)
+- Original benchmark source is [here](https://huggingface.co/datasets/openlanguagedata/flores_plus).
+
+Some reference numbers for devtest split (xx corresponds to average over 5 languages: de, es, fr, it, ja):
+
+| Model                  | en->xx | xx->en | xx->xx |
+|:-----------------------|------:|------:|------:|
+| Nemotron-NanoV2-9B-v2  | 32.5 |  34  | 25.9 |
+| Qwen3-8B               | 31.5 | 34.6 | 25.7 |
+| Qwen3-30B-A3B          | 33.3 | 35.5 | 27.1 |
+| gpt-oss-20B            | 32.4 | 34.1 |  25  |
+
+=== "Nemotron-NanoV2-9B-v2"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=NVIDIA/Nemotron-Nano-9B-v2 \
+        --benchmarks flores200 \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=devtest \
+        ++inference.tokens_to_generate=512
+        ++system_message='/no_think'
+    ```
+
+=== "Qwen3-8B"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=Qwen/Qwen3-8B \
+        --benchmarks flores200 \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=devtest \
+        ++inference.tokens_to_generate=512
+        ++prompt_suffix='/no_think'
+    ```
+
+=== "Qwen3-30B-A3B"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=Qwen/Qwen3-30B-A3B \
+        --benchmarks flores200 \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=devtest \
+        ++inference.tokens_to_generate=512
+        ++prompt_suffix='/no_think'
+    ```
+
+=== "gpt-oss-20B"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=openai/gpt-oss-20b \
+        --benchmarks flores200 \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=devtest \
+        ++inference.tokens_to_generate=2048
+    ```
+
+### wmt24pp
+
+- Benchmark is defined in [`nemo_skills/dataset/wmt24pp/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/wmt24pp/__init__.py)
+- Original benchmark source is [here](https://huggingface.co/datasets/google/wmt24pp).
+
+Some reference numbers for test split (xx corresponds to average over 5 languages: de, es, fr, it, ja):
+
+| Model                  | en->de | en->es | en->fr | en->it | en->ja | en->xx |
+|:-----------------------|------:|------:|------:|------:|------:|------:|
+| Nemotron-NanoV2-9B-v2  | 25.3 | 37.7 | 33.4 | 33.8 | 20.9 |  30.2  |
+| Qwen3-8B               | 26.2 | 38.5 | 33.1 | 33.1 | 21.7 | 30.5 |
+| Qwen3-30B-A3B          | 28.5 |  40  | 35.1 |  36  | 23.2 | 32.5 |
+| gpt-oss-20B            | 27.3 | 42.3 | 32.8 | 34.9 | 25.2 | 32.5 |
+
+=== "Nemotron-NanoV2-9B-v2"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=NVIDIA/Nemotron-Nano-9B-v2 \
+        --benchmarks wmt24pp \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=test \
+        ++inference.tokens_to_generate=512
+        ++system_message='/no_think'
+    ```
+
+=== "Qwen3-8B"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=Qwen/Qwen3-8B \
+        --benchmarks wmt24pp \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=test \
+        ++inference.tokens_to_generate=512
+        ++prompt_suffix='/no_think'
+    ```
+
+=== "Qwen3-30B-A3B"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=Qwen/Qwen3-30B-A3B \
+        --benchmarks wmt24pp \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=test \
+        ++inference.tokens_to_generate=512
+        ++prompt_suffix='/no_think'
+    ```
+
+=== "gpt-oss-20B"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=openai/gpt-oss-20b \
+        --benchmarks wmt24pp \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=test \
+        ++inference.tokens_to_generate=2048
+    ```
diff --git a/docs/index.md b/docs/index.md
@@ -21,7 +21,8 @@ Here are some of the features we support:
         - [**Instruction following**](./evaluation/instruction-following.md): e.g. [ifbench](./evaluation/instruction-following.md#ifbench), [ifeval](./evaluation/instruction-following.md#ifeval)
         - [**Long-context**](./evaluation/long-context.md): e.g. [ruler](./evaluation/long-context.md#ruler), [mrcr](./evaluation/long-context.md#mrcr)
         - [**Tool-calling**](./evaluation/tool-calling.md): e.g. [bfcl_v3](./evaluation/tool-calling.md#bfcl_v3)
-        - [**Robustness Evaluation**](./evaluation/robustness.md): Evaluate model sensitvity against changes in prompt.
+        - [**Multilingual capabilities**](./evaluation/multilingual.md): e.g. [mmlu-prox](./evaluation/multilingual.md#mmlu-prox), [flores-200](./evaluation/multilingual.md#FLORES-200), [wmt24pp](./evaluation/multilingual.md#wmt24pp)
+        - [**Robustness evaluation**](./evaluation/robustness.md): Evaluate model sensitvity against changes in prompt.
     - Easily parallelize each evaluation across many Slurm jobs, self-host LLM judges, bring your own prompts or change benchmark configuration in any other way.
 - [Model training](pipelines/training.md): Train models using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner/), [NeMo-RL](https://github.com/NVIDIA/NeMo-RL/) or [verl](https://github.com/volcengine/verl).
-Original file line number
+Diff line change
@@ Expand Up / @@ -49,4 +49,4 @@ ns eval \ @@
     The results, including per-category scores, are stored in metrics.json. Detailed breakdowns by category and sequence length are also available via
     ```
     ns summarize_results --cluster=<cluster_config> <folder_of_output_json>
-    ```
+    ```