Skip to content

Commit fc8f391

Browse files
Kipoktitu1994
andauthored
Adding ruler in the new interface (#510)
Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: smajumdar <titu1994@gmail.com>
1 parent 0cac08c commit fc8f391

27 files changed

Lines changed: 509 additions & 64 deletions

.github/workflows/gpu_tests.yml

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -41,18 +41,19 @@ jobs:
4141
pip install -r requirements/common-tests.txt
4242
ns prepare_data gsm8k human-eval mbpp algebra222 mmlu ifeval
4343
- name: Run GPU tests
44-
timeout-minutes: 120
44+
timeout-minutes: 180
4545
env:
4646
HF_TOKEN: ${{ secrets.HF_TOKEN }}
4747
run: |
4848
cd ${{ github.run_id }}
4949
nvidia-smi
5050
set -o pipefail # this will make sure next line returns non-0 exit code if tests fail
5151
./tests/gpu-tests/run_llama.sh
52-
- name: Cleanup test directory
52+
- name: Cleanup
5353
if: always()
5454
run: |
55-
docker run --rm -v /tmp:/tmp -v /home:/home igitman/nemo-skills:0.6.0 bash -c 'rm -rf /tmp/nemo-skills-tests/mistral_emb /home/azureuser/.nemo_run/'
55+
docker run --rm -v /tmp:/tmp -v /home:/home igitman/nemo-skills:0.6.1 bash -c 'rm -rf /tmp/nemo-skills-tests /home/azureuser/.nemo_run/'
56+
docker ps -a -q | xargs -r docker stop
5657
5758
gpu-tests-qwen:
5859
runs-on: self-hosted-nemo-gpus-1
@@ -79,18 +80,19 @@ jobs:
7980
pip install -r requirements/common-tests.txt
8081
ns prepare_data gsm8k human-eval mbpp algebra222 mmlu ifeval
8182
- name: Run GPU tests
82-
timeout-minutes: 120
83+
timeout-minutes: 180
8384
env:
8485
HF_TOKEN: ${{ secrets.HF_TOKEN }}
8586
run: |
8687
cd ${{ github.run_id }}
8788
nvidia-smi
8889
set -o pipefail # this will make sure next line returns non-0 exit code if tests fail
8990
./tests/gpu-tests/run_qwen.sh
90-
- name: Cleanup test directory
91+
- name: Cleanup
9192
if: always()
9293
run: |
93-
docker run --rm -v /tmp:/tmp -v /home:/home igitman/nemo-skills:0.6.0 bash -c 'rm -rf /tmp/nemo-skills-tests/mistral_emb /home/azureuser/.nemo_run/'
94+
docker run --rm -v /tmp:/tmp -v /home:/home igitman/nemo-skills:0.6.1 bash -c 'rm -rf /tmp/nemo-skills-tests /home/azureuser/.nemo_run/'
95+
docker ps -a -q | xargs -r docker stop
9496
9597
gpu-tests-rm:
9698
runs-on: self-hosted-nemo-gpus-1
@@ -114,15 +116,16 @@ jobs:
114116
pip install -e .
115117
pip install -r requirements/common-tests.txt
116118
- name: Run GPU tests
117-
timeout-minutes: 120
119+
timeout-minutes: 180
118120
env:
119121
HF_TOKEN: ${{ secrets.HF_TOKEN }}
120122
run: |
121123
cd ${{ github.run_id }}
122124
nvidia-smi
123125
set -o pipefail # this will make sure next line returns non-0 exit code if tests fail
124126
./tests/gpu-tests/run_rm.sh
125-
- name: Cleanup test directory
127+
- name: Cleanup
126128
if: always()
127129
run: |
128-
docker run --rm -v /tmp:/tmp -v /home:/home igitman/nemo-skills:0.6.0 bash -c 'rm -rf /tmp/nemo-skills-tests/mistral_emb /home/azureuser/.nemo_run/'
130+
docker run --rm -v /tmp:/tmp -v /home:/home igitman/nemo-skills:0.6.1 bash -c 'rm -rf /tmp/nemo-skills-tests /home/azureuser/.nemo_run/'
131+
docker ps -a -q | xargs -r docker stop

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,4 +33,6 @@ __pycache__
3333
.ipynb_checkpoints
3434

3535
cluster_configs/*
36-
!cluster_configs/example-*.yaml
36+
!cluster_configs/example-*.yaml
37+
38+
nemo_skills/dataset/ruler/*/

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Here are some of the things we support.
1212
- Coding skills: human-eval, mbpp
1313
- Chat/instruction following: ifeval, arena-hard, mt-bench
1414
- General knowledge: mmlu, mmlu-pro, gpqa
15+
- Long context: RULER
1516
- [Model training](https://nvidia.github.io/NeMo-Skills/pipelines/training): Train models at speed-of-light using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner/).
1617

1718
You can find the full documentation [here](https://nvidia.github.io/NeMo-Skills/).
@@ -24,12 +25,12 @@ commands and their options.
2425
Using our pipelines we created [OpenMathReasoning dataset](https://huggingface.co/datasets/nvidia/OpenMathReasoning).
2526
This dataset contains
2627

27-
* 306K unique mathematical problems sourced from [AoPS forums](https://artofproblemsolving.com/community) with:
28+
* 306K unique mathematical problems sourced from [AoPS forums](https://artofproblemsolving.com/community) with:
2829
* 3.2M long chain-of-thought (CoT) solutions
2930
* 1.7M long tool-integrated reasoning (TIR) solutions
3031
* 566K samples that select the most promising solution out of many candidates (GenSelect)
3132
* Additional 193K problems sourced from AoPS forums (problems only, no solutions)
32-
33+
3334
We used [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) to preprocess problems, and
3435
[DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) and [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) to generate solutions.
3536

cluster_configs/example-local.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,9 @@ containers:
2121
nemo: igitman/nemo-skills-nemo:0.6.0
2222
megatron: igitman/nemo-skills-megatron:0.6.0
2323
sandbox: igitman/nemo-skills-sandbox:0.6.1
24-
nemo-skills: igitman/nemo-skills:0.6.0
24+
nemo-skills: igitman/nemo-skills:0.6.1
2525
verl: igitman/nemo-skills-verl:0.6.1
26-
nemo-rl: igitman/nemo-skills-nemo-rl:0.6.0
26+
nemo-rl: igitman/nemo-skills-nemo-rl:0.6.1
2727

2828
# add required mounts for models/data here
2929
# the code is mounted automatically inside /nemo_run/code

cluster_configs/example-slurm.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,9 @@ containers:
2121
nemo: igitman/nemo-skills-nemo:0.6.0
2222
megatron: igitman/nemo-skills-megatron:0.6.0
2323
sandbox: igitman/nemo-skills-sandbox:0.6.1
24-
nemo-skills: igitman/nemo-skills:0.6.0
24+
nemo-skills: igitman/nemo-skills:0.6.1
2525
verl: igitman/nemo-skills-verl:0.6.1
26-
nemo-rl: igitman/nemo-skills-nemo-rl:0.6.0
26+
nemo-rl: igitman/nemo-skills-nemo-rl:0.6.1
2727

2828
job_name_prefix: "nemo_skills:"
2929

dockerfiles/Dockerfile.nemo-skills

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
FROM python:3.10
22

3-
RUN apt-get update && apt-get -y install curl git
3+
RUN apt-get update && apt-get -y install curl git git-lfs
44

55
# for ifeval benchmark
66
# TODO: can we get just a single dir?

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ Here are some of the things we support.
1616
- Coding skills: human-eval, mbpp
1717
- Chat/instruction following: ifeval, arena-hard, mt-bench
1818
- General knowledge: mmlu, mmlu-pro, gpqa
19+
- Long context: RULER
1920
- [Model training](pipelines/training.md): Train models at speed-of-light using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner/).
2021

2122
To get started, follow this [tutorial](basics/index.md), browse available [pipelines](./pipelines/index.md) or run `ns --help` to see all available

docs/pipelines/evaluation.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ We support many popular benchmarks and it's easy to add new in the future. E.g.
1414
- Coding skills: human-eval, mbpp
1515
- Chat/instruction following: ifeval, arena-hard, mt-bench
1616
- General knowledge: mmlu, mmlu-pro, gpqa
17+
- Long context: RULER
1718

1819
See [nemo_skills/dataset](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset) where each folder is a benchmark we support.
1920

@@ -44,6 +45,13 @@ and if you installed from pip, they will be downloaded to wherever the repo is i
4445
python -c "import nemo_skills; print(nemo_skills.__path__)"
4546
```
4647

48+
Some benchmarks (e.g. ruler) require extra parameters to be passed to the prepare_data script. Thus you'd need to explicitly
49+
call `ns prepare_data` for all of them, e.g. for ruler you can use
50+
51+
```bash
52+
ns prepare_data ruler --setup=llama_128k --tokenizer_path=meta-llama/Llama-3.1-8B-Instruct --max_seq_length=131072
53+
```
54+
4755
## Greedy decoding
4856

4957
```bash
@@ -121,6 +129,65 @@ pass@4 | 164 | 78.66 | 72.56
121129

122130
If you want to get both multiple samples and greedy results, use `--add_greedy` parameter.
123131

132+
133+
## Using data on cluster
134+
135+
Some benchmarks (e.g. ruler) have very large input datasets and it's inefficient to prepare them on local machine and
136+
keep uploading on cluster with every evaluation job. Instead, you can prepare them on cluster directly. To do that,
137+
run prepare_data command with `--data_dir` and `--cluster` options, e.g.
138+
139+
```bash
140+
ns prepare_data \
141+
--data_dir=/workspace/ns-data \
142+
--cluster=slurm \
143+
ruler --setup llama_128k --tokenizer_path meta-llama/Llama-3.1-8B-Instruct --max_seq_length 130900
144+
```
145+
146+
Then during evaluation, you'd need to provide the same `data_dir` argument and it will read the data from cluster
147+
directly. You can also use `NEMO_SKILLS_DATA_DIR` environment variable instead of an explicit argument.
148+
149+
Here is an example evaluation command for ruler that uses data_dir parameter
150+
151+
```python
152+
from nemo_skills.pipeline.cli import eval, run_cmd, wrap_arguments
153+
154+
tasks = [
155+
"niah_single_1", "niah_single_2","niah_single_3",
156+
"niah_multikey_1", "niah_multikey_2", "niah_multikey_3",
157+
"niah_multivalue", "niah_multiquery",
158+
"vt", "cwe", "fwe", "qa_1", "qa_2",
159+
]
160+
benchmarks = ",".join([f"ruler.llama_128k.{task}:0" for task in tasks])
161+
162+
eval(
163+
# using a low number of concurrent requests since it's almost entirely prefill stage
164+
ctx=wrap_arguments("++max_concurrent_requests=32"),
165+
cluster="slurm",
166+
model="/hf_models/Meta-Llama-3.1-8B-Instruct",
167+
server_type="sglang",
168+
output_dir="/workspace/eval-ruler",
169+
data_dir="/workspace/ns-data",
170+
benchmarks=benchmarks,
171+
server_gpus=8,
172+
expname="eval-ruler",
173+
)
174+
175+
# running summarize results on the cluster as well to avoid downloading the data
176+
# you can find results in /workspace/eval-ruler/eval-results/metrics.json
177+
# or add --wandb_name parameter to log to W&B
178+
cmd = (
179+
"python -m nemo_skills.pipeline.summarize_results "
180+
" --data_dir /workspace/ns-data /workspace/eval-ruler/eval-results "
181+
)
182+
run_cmd(
183+
ctx=wrap_arguments(cmd),
184+
cluster="slurm",
185+
log_dir="/workspace/eval-ruler/eval-results/summarize_results",
186+
expname="summarize-results",
187+
run_after="eval-ruler",
188+
)
189+
```
190+
124191
## How the benchmarks are defined
125192

126193
Each benchmark exists as a separate folder inside

nemo_skills/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222
'nemo': 'igitman/nemo-skills-nemo:0.6.0',
2323
'megatron': 'igitman/nemo-skills-megatron:0.6.0',
2424
'sandbox': 'igitman/nemo-skills-sandbox:0.6.1',
25-
'nemo-skills': 'igitman/nemo-skills:0.6.0',
25+
'nemo-skills': 'igitman/nemo-skills:0.6.1',
2626
'verl': 'igitman/nemo-skills-verl:0.6.1',
27-
'nemo-rl': 'igitman/nemo-skills-nemo-rl:0.6.0',
27+
'nemo-rl': 'igitman/nemo-skills-nemo-rl:0.6.1',
2828
}

nemo_skills/dataset/prepare.py

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
from nemo_skills.dataset.utils import add_header_to_jsonl_inplace, get_lean4_header
2121

2222

23-
def prepare_datasets(datasets=None, dataset_groups=None, add_lean4_header=False):
23+
def prepare_datasets(datasets=None, dataset_groups=None, add_lean4_header=False, extra_args=""):
2424
if datasets and dataset_groups:
2525
raise ValueError("Cannot specify both datasets and dataset_groups")
2626

@@ -41,7 +41,7 @@ def prepare_datasets(datasets=None, dataset_groups=None, add_lean4_header=False)
4141
for dataset in datasets:
4242
print(f"Preparing {dataset}")
4343
dataset_path = datasets_dir / dataset
44-
subprocess.run(f"{sys.executable} {dataset_path / 'prepare.py'}", shell=True, check=True)
44+
subprocess.run(f"{sys.executable} {dataset_path / 'prepare.py'} {extra_args}", shell=True, check=True)
4545
dataset_module = importlib.import_module(f"nemo_skills.dataset.{dataset}")
4646

4747
if dataset_module.DATASET_GROUP == "math":
@@ -62,12 +62,13 @@ def prepare_datasets(datasets=None, dataset_groups=None, add_lean4_header=False)
6262
'--dataset_groups',
6363
default=[],
6464
nargs="*",
65-
choices=["math", "code", "chat", "multichoice"],
65+
choices=["math", "code", "chat", "multichoice", "long-context"],
6666
help='Can specify a dataset groups here',
6767
)
6868
parser.add_argument(
6969
'--add_lean4_header', action='store_true', help='Add Lean4 header to JSONL files during preparation'
7070
)
71-
args = parser.parse_args()
71+
args, unknown = parser.parse_known_args()
72+
extra_args = " ".join(unknown)
7273

73-
prepare_datasets(args.datasets, args.dataset_groups, args.add_lean4_header)
74+
prepare_datasets(args.datasets, args.dataset_groups, args.add_lean4_header, extra_args=extra_args)

0 commit comments

Comments
 (0)