Skip to content

Commit 1098cc6

Browse files
authored
Refactoring of the generate pipeline (#513)
Signed-off-by: Igor Gitman <igitman@nvidia.com>
1 parent 8845cb6 commit 1098cc6

56 files changed

Lines changed: 1218 additions & 1466 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/gpu_tests.yml

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -39,15 +39,14 @@ jobs:
3939
pip uninstall -y nemo-skills nemo_run
4040
pip install -e .
4141
pip install -r requirements/common-tests.txt
42-
ns prepare_data gsm8k human-eval mbpp algebra222 mmlu ifeval
42+
ns prepare_data gsm8k human-eval mbpp algebra222 mmlu ifeval math-500 amc23 aime24
4343
- name: Run GPU tests
4444
timeout-minutes: 180
4545
env:
4646
HF_TOKEN: ${{ secrets.HF_TOKEN }}
4747
run: |
4848
cd ${{ github.run_id }}
4949
nvidia-smi
50-
export DOCKER_CLIENT_TIMEOUT=120
5150
set -o pipefail # this will make sure next line returns non-0 exit code if tests fail
5251
./tests/gpu-tests/run_llama.sh
5352
- name: Cleanup
@@ -79,15 +78,14 @@ jobs:
7978
pip uninstall -y nemo-skills nemo_run
8079
pip install -e .
8180
pip install -r requirements/common-tests.txt
82-
ns prepare_data gsm8k human-eval mbpp algebra222 mmlu ifeval
81+
ns prepare_data gsm8k human-eval mbpp algebra222 mmlu ifeval math-500 amc23 aime24
8382
- name: Run GPU tests
8483
timeout-minutes: 180
8584
env:
8685
HF_TOKEN: ${{ secrets.HF_TOKEN }}
8786
run: |
8887
cd ${{ github.run_id }}
8988
nvidia-smi
90-
export DOCKER_CLIENT_TIMEOUT=120
9189
set -o pipefail # this will make sure next line returns non-0 exit code if tests fail
9290
./tests/gpu-tests/run_qwen.sh
9391
- name: Cleanup
@@ -124,7 +122,6 @@ jobs:
124122
run: |
125123
cd ${{ github.run_id }}
126124
nvidia-smi
127-
export DOCKER_CLIENT_TIMEOUT=120
128125
set -o pipefail # this will make sure next line returns non-0 exit code if tests fail
129126
./tests/gpu-tests/run_rm.sh
130127
- name: Cleanup

cluster_configs/example-local.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,8 @@ containers:
1818
trtllm: igitman/nemo-skills-trtllm:0.6.1
1919
vllm: vllm/vllm-openai:v0.9.0
2020
sglang: igitman/nemo-skills-sglang:0.6.1
21-
nemo: igitman/nemo-skills-nemo:0.6.0
22-
megatron: igitman/nemo-skills-megatron:0.6.0
21+
nemo: igitman/nemo-skills-nemo:0.6.1
22+
megatron: igitman/nemo-skills-megatron:0.6.1
2323
sandbox: igitman/nemo-skills-sandbox:0.6.1
2424
nemo-skills: igitman/nemo-skills:0.6.1
2525
verl: igitman/nemo-skills-verl:0.6.1

cluster_configs/example-slurm.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,8 @@ containers:
1818
trtllm: igitman/nemo-skills-trtllm:0.6.1
1919
vllm: vllm/vllm-openai:v0.9.0
2020
sglang: igitman/nemo-skills-sglang:0.6.1
21-
nemo: igitman/nemo-skills-nemo:0.6.0
22-
megatron: igitman/nemo-skills-megatron:0.6.0
21+
nemo: igitman/nemo-skills-nemo:0.6.1
22+
megatron: igitman/nemo-skills-megatron:0.6.1
2323
sandbox: igitman/nemo-skills-sandbox:0.6.1
2424
nemo-skills: igitman/nemo-skills:0.6.1
2525
verl: igitman/nemo-skills-verl:0.6.1

dockerfiles/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Some dockerfiles are directly included in this folder and for some others the in
44
To build one of the existing dockerfiles use a command like this
55

66
```
7-
docker build -t igitman/nemo-skills-nemo:0.6.0 -f dockerfiles/Dockerfile.nemo .
7+
docker build -t igitman/nemo-skills-nemo:0.6.1 -f dockerfiles/Dockerfile.nemo .
88
```
99
It might take a long time for some of the images.
1010

docs/basics/index.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ You can either use [OpenAI models](https://platform.openai.com/docs/overview) or
5454
--model=meta/llama-3.1-8b-instruct \
5555
--server_address=https://integrate.api.nvidia.com/v1 \
5656
--output_dir=./generation \
57-
++input_file=./input.jsonl \
57+
--input_file=./input.jsonl \
5858
++prompt_config=./prompt.yaml
5959
```
6060

@@ -67,7 +67,7 @@ You can either use [OpenAI models](https://platform.openai.com/docs/overview) or
6767
--model=gpt-4o-mini \
6868
--server_address=https://api.openai.com/v1 \
6969
--output_dir=./generation \
70-
++input_file=./input.jsonl \
70+
--input_file=./input.jsonl \
7171
++prompt_config=./prompt.yaml
7272
```
7373

@@ -144,7 +144,7 @@ ns generate \
144144
--model=Qwen/Qwen2.5-1.5B-Instruct \
145145
--server_gpus=1 \
146146
--output_dir=/workspace/generation-local \
147-
++input_file=/workspace/input.jsonl \
147+
--input_file=/workspace/input.jsonl \
148148
++prompt_config=/workspace/prompt.yaml
149149
```
150150

@@ -176,7 +176,7 @@ ns generate \
176176
--model=/workspace/qwen2.5-1.5b-instruct-trtllm \
177177
--server_gpus=1 \
178178
--output_dir=/workspace/generation-local-trtllm \
179-
++input_file=/workspace/input.jsonl \
179+
--input_file=/workspace/input.jsonl \
180180
++prompt_config=/workspace/prompt.yaml \
181181
++prompt_template=qwen-instruct # (3)!
182182
```
@@ -215,7 +215,7 @@ ns generate \
215215
--server_type=vllm \
216216
--model=Qwen/Qwen2.5-1.5B-Instruct \
217217
--server_gpus=1 \
218-
++input_file=/nemo_run/code/input.jsonl \
218+
--input_file=/nemo_run/code/input.jsonl \
219219
++prompt_config=/nemo_run/code/prompt.yaml \
220220
--output_dir=/workspace/generation # (2)!
221221
```

docs/openmathinstruct2/dataset.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ ns generate \
3131
--num_random_seeds=512 \
3232
--output_dir=/workspace/solution-augmentation/math \
3333
--eval_args="++eval_type=math" \
34-
++input_file=/nemo_run/code/nemo_skills/dataset/math/train.jsonl \
34+
--input_file=/nemo_run/code/nemo_skills/dataset/math/train.jsonl \
3535
++prompt_config=generic/math-base \
3636
++examples_type=math_text_detailed \
3737
++prompt_template=llama3-base
@@ -49,7 +49,7 @@ ns generate \
4949
--num_random_seeds=64 \
5050
--output_dir=/workspace/solution-augmentation/gsm8k \
5151
--eval_args="++eval_type=math" \
52-
++input_file=/nemo_run/code/nemo_skills/dataset/gsm8k/train.jsonl \
52+
--input_file=/nemo_run/code/nemo_skills/dataset/gsm8k/train.jsonl \
5353
++prompt_config=generic/math-base \
5454
++examples_type=gsm8k_text_detailed \
5555
++prompt_template=llama3-base
@@ -69,7 +69,7 @@ ns generate \
6969
--server_nodes=2 \
7070
--num_random_seeds=80 \
7171
--output_dir=/workspace/problem-augmentation/math \
72-
++input_file=/nemo_run/code/nemo_skills/dataset/math/train.jsonl \
72+
--input_file=/nemo_run/code/nemo_skills/dataset/math/train.jsonl \
7373
++prompt_config=generic/problem-augmentation \
7474
++examples_type=math_problem_augmentation \
7575
++prompt_template=llama3-instruct \
@@ -87,7 +87,7 @@ ns generate \
8787
--server_nodes=2 \
8888
--num_random_seeds=10 \
8989
--output_dir=/workspace/problem-augmentation/gsm8k \
90-
++input_file=/nemo_run/code/nemo_skills/dataset/gsm8k/train.jsonl \
90+
--input_file=/nemo_run/code/nemo_skills/dataset/gsm8k/train.jsonl \
9191
++prompt_config=generic/problem-augmentation-similar \
9292
++examples_type=gsm8k_problem_augmentation \
9393
++prompt_template=llama3-instruct \
@@ -117,8 +117,8 @@ for i in range(80):
117117
server_nodes=2,
118118
num_random_seeds=32,
119119
output_dir=f"/workspace/new-problems-solution-augmentation/math/problem-set{i}",
120+
input_file=f"/workspace/solution-augmentation/math/generation/output-rs{i}",
120121
ctx=wrap_arguments(
121-
f"++input_file=/workspace/solution-augmentation/math/generation/output-rs{i} "
122122
f"++prompt_config=generic/math-base "
123123
f"++examples_type=math_text_detailed "
124124
f"++prompt_template=llama3-base "
@@ -142,8 +142,8 @@ for i in range(10):
142142
server_nodes=2,
143143
num_random_seeds=32,
144144
output_dir=f"/workspace/new-problems-solution-augmentation/gsm8k/problem-set{i}",
145+
input_file=f"/workspace/solution-augmentation/gsm8k/generation/output-rs{i}",
145146
ctx=wrap_arguments(
146-
f"++input_file=/workspace/solution-augmentation/gsm8k/generation/output-rs{i} "
147147
f"++prompt_config=generic/math-base "
148148
f"++examples_type=gsm8k_text_detailed "
149149
f"++prompt_template=llama3-base "
@@ -231,10 +231,11 @@ Next, you need to run LLM inference to check those closest found problems from t
231231
We use the Llama3.1-405B-Instruct model for this, and here's one way of doing it via Nvidia API catalog.
232232

233233
```bash
234-
ns check_contamination \
234+
ns generate \
235235
--cluster=slurm \
236+
--generation_type=check_contamination \
236237
--input_file=/workspace/new-problems-solution-augmentation/contamination-retrieved.jsonl \
237-
--output_file=/workspace/new-problems-solution-augmentation/contamination-llm.jsonl \
238+
--output_dir=/workspace/new-problems-solution-augmentation/contamination-llm \
238239
--server_type=openai \
239240
--model=meta/llama-3.1-405b-instruct \
240241
--server_address=https://integrate.api.nvidia.com/v1 \
@@ -267,7 +268,7 @@ python -m nemo_skills.training.prepare_data \
267268
++hf_model_name="meta-llama/Meta-Llama-3.1-8B" \
268269
++max_solution_length=1024 \
269270
++filters.remove_contaminated=true \
270-
++contamination_file=/workspace/new-problems-solution-augmentation/contamination-llm.jsonl
271+
++contamination_file=/workspace/new-problems-solution-augmentation/contamination-llm/output.jsonl
271272
```
272273

273274
## Dataset contamination explorer

docs/openmathinstruct2/evaluation.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ for dataset in aime24 amc23 math gsm8k omni-math; do
7373
--server_type=openai \
7474
--server_address=https://api.openai.com/v1 \
7575
--output_dir=/workspace/openmath2-llama3.1-8b-eval-judged/eval-results/${dataset} \
76-
++input_dir=/workspace/openmath2-llama3.1-8b-eval/eval-results/${dataset}
76+
--input_dir=/workspace/openmath2-llama3.1-8b-eval/eval-results/${dataset}
7777
done
7878
```
7979

@@ -155,14 +155,13 @@ for dataset in aime24 amc23 math gsm8k omni-math; do
155155
--server_type=openai \
156156
--server_address=https://api.openai.com/v1 \
157157
--output_dir=/workspace/openmath2-llama3.1-8b-eval-judged/eval-results-majority/${dataset} \
158-
++input_file=/workspace/openmath2-llama3.1-8b-eval/eval-results-majority/${dataset}/output-agg.jsonl \
159-
++output_file=/workspace/openmath2-llama3.1-8b-eval/eval-results-majority/${dataset}/output-rs0.jsonl
158+
--input_file=/workspace/openmath2-llama3.1-8b-eval/eval-results-majority/${dataset}/output-agg.jsonl
160159
done
161160
```
162161

163162
```bash
164163
ns summarize_results /workspace/openmath2-llama3.1-8b-eval-judged/eval-results-majority --cluster local
165164
```
166165

167-
This will print majority results (they will be labeled as `majority@1` since we fused them into a single file).
166+
This will print majority results (they will be labeled as `greedy` since we fused them into a single file).
168167
You can also ignore the symbolic score as it's not accurate anymore after we filled majority answers.

docs/openmathreasoning1/evaluation.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -141,7 +141,7 @@ ns generate \
141141
--server_type=trtllm \
142142
--server_gpus=4 \
143143
--output_dir=/workspace/openmath-nemotron-1.5b-eval-cot/eval-results-judged/hle \
144-
++input_dir=/workspace/openmath-nemotron-1.5b-eval-cot/eval-results/hle
144+
--input_dir=/workspace/openmath-nemotron-1.5b-eval-cot/eval-results/hle
145145
```
146146

147147
Alternatively, you can use an API model like gpt-4o, but the results might be different.
@@ -155,7 +155,7 @@ ns generate \
155155
--server_type=openai \
156156
--server_address=https://api.openai.com/v1 \
157157
--output_dir=/workspace/openmath-nemotron-1.5b-eval-cot/eval-results-judged/hle \
158-
++input_dir=/workspace/openmath-nemotron-1.5b-eval-cot/eval-results/hle
158+
--input_dir=/workspace/openmath-nemotron-1.5b-eval-cot/eval-results/hle
159159
```
160160

161161
To print the metrics run
@@ -227,9 +227,8 @@ All other commands are the same as in the [CoT part](#run-cot-evaluations).
227227
Here is a sample command to run GenSelect evaluation:
228228

229229
```bash
230-
ns generate \
231-
--generation_type=genselect \
232-
--genselect_args="++input_dir=/workspace/openmath-nemotron-1.5b-eval-cot/eval-results-judged/hle" \
230+
ns genselect \
231+
--preprocess_args="++input_dir=/workspace/openmath-nemotron-1.5b-eval-cot/eval-results-judged/hle" \
233232
--model=/trt_models/openmath-nemotron-1.5b \
234233
++prompt_template=qwen-instruct \
235234
--output_dir=/workspace/openmath-nemotron-1.5b-eval-cot/self_genselect_hle \

docs/pipelines/decontamination.md

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
!!! info
44

5-
This pipeline starting script is [nemo_skills/pipeline/check_contamination.py](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/pipeline/check_contamination.py)
5+
This pipeline starting script is [nemo_skills/pipeline/generate.py](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/pipeline/generate.py)
66

77
All extra parameters are passed to [nemo_skills/inference/check_contamination.py](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/inference/check_contamination.py)
88

@@ -27,7 +27,7 @@ you have `/workspace` defined in your [cluster config](../basics/cluster-configs
2727
you can do it in the following way
2828

2929
```python
30-
from nemo_skills.pipeline.cli import wrap_arguments, run_cmd
30+
from nemo_skills.pipeline.cli import wrap_arguments, run_cmd, generate
3131

3232

3333
test_sets = ['math', 'amc23', 'aime24']
@@ -52,14 +52,16 @@ run_cmd(
5252
Next, you need to run LLM inference to check those closest found questions from the output file. Here is an example
5353
using Llama-405B from Nvidia API catalog, but you can replace it with OpenAI models or self-hosted models.
5454

55-
```
56-
ns check_contamination \
57-
--cluster=local \
58-
--input_file=/workspace/math-contamination-retrieved.jsonl \
59-
--output_file=/workspace/math-contamination-results.jsonl \
60-
--server_type=openai \
61-
--model=meta/llama-3.1-405b-instruct \
62-
--server_address=https://integrate.api.nvidia.com/v1
55+
```python
56+
generate(
57+
cluster="local",
58+
generation_type="check_contamination",
59+
input_file="/workspace/math-contamination-retrieved.jsonl",
60+
output_dir="/workspace/math-contamination-results",
61+
model="meta/llama-3.1-405b-instruct",
62+
server_type="openai",
63+
server_address="https://integrate.api.nvidia.com/v1",
64+
)
6365
```
6466

6567
This script will print an output that looks like this
@@ -74,7 +76,8 @@ If you want instead to clean your training data from contaminated examples all t
7476
you need to swap values for the `retrieve_from` and `compare_to` arguments in the `retrieve_similar` step
7577
since we now want to make a check for each training set example and find closest test set problems.
7678

77-
After you get `/workspace/math-contamination-results.jsonl`, you can pass it into [prepare_data command](training.md#preparing-the-data)
79+
After you get `/workspace/math-contamination-results/output.jsonl`,
80+
you can pass it into [prepare_data command](training.md#preparing-the-data)
7881
with `++contamination_file=...` option.
7982

8083
See a more detailed example in [OpenMathInstruct-2 dataset construction pipeline](../openmathinstruct2/dataset.md#decontamination).

docs/pipelines/evaluation.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -77,14 +77,14 @@ ns summarize_results --cluster local /workspace/test-eval
7777
Which should print the following
7878

7979
```
80-
------------------------- gsm8k -------------------------
81-
evaluation_mode | num_entries | symbolic_correct | no_answer
82-
greedy | 1319 | 82.34 | 0.91
80+
--------------------------------- gsm8k ---------------------------------
81+
evaluation_mode | num_entries | avg_tokens | symbolic_correct | no_answer
82+
greedy | 1319 | 169 | 83.40% | 1.97%
8383
8484
85-
------------------------------ human-eval -----------------------------
86-
evaluation_mode | num_entries | passing_base_tests | passing_plus_tests
87-
greedy | 164 | 67.68 | 62.20
85+
------------------------------------ human-eval ------------------------------------
86+
evaluation_mode | num_entries | avg_tokens | passing_base_tests | passing_plus_tests
87+
greedy | 164 | 228 | 70.12% | 62.80%
8888
```
8989

9090
The [summarize_results](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/pipeline/summarize_results.py) script
@@ -116,19 +116,19 @@ ns eval \
116116
you will see the following output after summarizing results
117117

118118
```
119-
-------------------------- gsm8k ---------------------------
120-
evaluation_mode | num_entries | symbolic_correct | no_answer
121-
majority@4 | 1319 | 87.95 | 0.00
122-
pass@4 | 1319 | 93.78 | 0.00
119+
--------------------------------- gsm8k ---------------------------------
120+
evaluation_mode | num_entries | avg_tokens | symbolic_correct | no_answer
121+
pass@1[4] | 1319 | 161 | 78.96% | 6.01%
122+
majority@4 | 1319 | 161 | 88.10% | 0.08%
123+
pass@4 | 1319 | 161 | 93.25% | 0.08%
123124
124125
125-
------------------------------ human-eval -----------------------------
126-
evaluation_mode | num_entries | passing_base_tests | passing_plus_tests
127-
pass@4 | 164 | 78.66 | 72.56
126+
------------------------------------ human-eval ------------------------------------
127+
evaluation_mode | num_entries | avg_tokens | passing_base_tests | passing_plus_tests
128+
pass@1[4] | 164 | 251 | 64.18% | 59.30%
129+
pass@4 | 164 | 251 | 82.32% | 78.05%
128130
```
129131

130-
If you want to get both multiple samples and greedy results, use `--add_greedy` parameter.
131-
132132

133133
## Using data on cluster
134134

0 commit comments

Comments
 (0)