You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: Revise slurm scripts and update for NSS github (#85)
# Summary
Consolidate slurm submission into a single `submit_slurm_jobs.sh` that
supports individual dataset experiments, multiple custom datasets, or
using the named short and long groups. Also updates instructions and
scripts to operate from the github repo version of NSS.
- Consolidate `submit_single_dataset.sh` into `submit_slurm_jobs.sh` via
new `--dataset-urls` flag, eliminating duplicated submission logic
- Replace complex per-config index arithmetic with packed
comma-separated arrays (`PACKED_DATASETS`, `PACKED_CONFIGS`) that map
1:1 to `SLURM_ARRAY_TASK_ID`
- Simplify `slurm_srun.sh` to a thin `srun` pass-through, relying on
`--export=ALL` instead of manually forwarding each variable
- Rename `NMP_DIR` to `NSS_DIR` and update all paths to reflect the
standalone Safe-Synthesizer repo structure
- Add `--time-limit`, `--train-time-limit`, `--generate-time-limit`,
`--max-concurrent-slurm-jobs`, `--dry-run`, and `--dataset-urls` CLI
flags to `submit_slurm_jobs.sh`
- Remove per-config time-limit associative arrays
(`CONFIG_TIME_LIMITS_SHORT/LONG`) in favor of explicit CLI flags
- Add upfront environment variable validation in `slurm_nss_matrix.sh`
with actionable error messages
## Pre-Review Checklist
<!-- These checks should be completed before a PR is reviewed, -->
<!-- but you can submit a draft early to indicate that the issue is
being worked on. -->
Ensure that the following pass:
- [x] `make format && make lint` or via prek validation.
- [ ] `make test` passes locally
- [ ] `make test-e2e` passes locally
- [ ] `make test-ci-container` passes locally (recommended)
## Pre-Merge Checklist
<!-- These checks need to be completed before a PR is merged, -->
<!-- but as PRs often change significantly during review, -->
<!-- it's OK for them to be incomplete when review is first requested.
-->
- [ ] New or updated tests for any fix or new behavior
- [X] Updated documentation for new features and behaviors, including
docstrings for API docs.
## Other Notes
<!-- Please add the issue number that should be closed when this PR is
merged. -->
- Tested with a few end_to_end and two_stage mode slurm runs
- Copied from internal MR that was not merged before the fork to this
repo
- Hand written
- Reviewed by cursor agent and hand fixed findings
- PR summary generated by cursor agent
---------
Signed-off-by: Kendrick Boyd <kendrickb@nvidia.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy file name to clipboardExpand all lines: script/slurm/README.md
+92-86Lines changed: 92 additions & 86 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,14 +3,16 @@
3
3
4
4
### NeMo Safe Synthesizer Slurm Jobs
5
5
6
-
This directory contains scripts to launch matrix Slurm jobs for NeMo Safe Synthesizer. Jobs are submitted via `submit_slurm_jobs.sh`, which launches a containerized `srun` (`slurm_srun.sh`) that executes the matrix runner (`slurm_nss_matrix.sh`). All paths and defaults are configured in one place: `env_variables.sh`.
6
+
This directory contains scripts to launch Slurm jobs for NeMo Safe Synthesizer experimentation.
7
+
The contents of this directly are often specific to internal NVIDIA slurm clusters, but shared here as inspiration for others that might be using slurm to do hyperparameter experiments with NeMo Safe Synthesizer.
8
+
9
+
Jobs are submitted via `submit_slurm_jobs.sh`, which launches a containerized `srun` (`slurm_srun.sh`) that executes the matrix runner (`slurm_nss_matrix.sh`). All paths and defaults are configured in one place: `env_variables.sh`.
7
10
8
11
### Files
9
-
-`env_variables.sh`: single source of truth for user, paths, configs, and time limits.
10
-
-`submit_slurm_jobs.sh`: submits Slurm array jobs for each config and dataset group. Supports two-stage TRAIN→GEN pipeline.
11
-
-`submit_single_dataset.sh`: submits jobs for one dataset/path. Supports two-stage TRAIN→GEN pipeline.
12
-
-`slurm_srun.sh`: wraps `srun` with container image and mounts.
13
-
-`slurm_nss_matrix.sh`: picks dataset/index/run and launches the python entrypoint inside the container. Honors `PHASE=train|generate`.
12
+
-`env_variables.sh`: Single source of truth for user, paths.
13
+
-`submit_slurm_jobs.sh`: Submits Slurm array jobs for each config and dataset. Supports two-stage TRAIN→GEN pipeline.
14
+
-`slurm_nss_matrix.sh`: Picks dataset and config and launches the python entrypoint inside the container. Honors `NSS_PHASE=train|generate|end_to_end`.
15
+
-`slurm_srun.sh`: Wraps `srun` with container image and mounts, mostly just a pass through, primary logic is in `submit_slurm_jobs.sh` and `slurm_nss_matrix.sh`.
14
16
15
17
Pipeline entrypoints (invoked by Slurm scripts) via uv:
16
18
-`uv run safe-synthesizer run --run-path <path>` (full end-to-end pipeline)
@@ -21,12 +23,21 @@ Pipeline entrypoints (invoked by Slurm scripts) via uv:
21
23
22
24
- Slurm Cluster Access: Ensure you have access to the Slurm clusters. You can verify this by running `ssh cs-oci-ord-login-01.nvidia.com` in your terminal (VPN connection required). For an introduction to Slurm, see [these onboarding resources](https://confluence.nvidia.com/display/HWINFCSSUP/Onboarding+to+Clusters).
23
25
- NIM API Key: You will need a `NIM_API_KEY` to run column classification. If you do not have one, you can generate it at [build.nvidia.com](https://build.nvidia.com) using your `nvidian` organization account.
24
-
- Enroot Credentials Follow https://confluence.nvidia.com/display/HWINFCSSUP/Using+Containers#UsingContainers-SettingupEnrootCredentials. You should add the lines for all 3 of `nvcr.io`, `authn.nvidia.com`, and `gitlab-master.nvidia.com`.
26
+
- Enroot Credentials: Follow https://confluence.nvidia.com/display/HWINFCSSUP/Using+Containers#UsingContainers-SettingupEnrootCredentials. You should add the lines for all 3 of `nvcr.io`, `authn.nvidia.com`, and `gitlab-master.nvidia.com`.
27
+
- Clone Safe-Synthesizer
28
+
```bash
29
+
export USER_NAME="$USER"# Or hardcode username in slurm
- This is a strongly recommended setup, but is not be the only way to get things working.
36
+
- DO NOT FOLLOW the general CONTRIBUTING.md or README.md instructions for installation and setup, unless you understand exactly what's being installed where and how that interacts with the distributed nature of a slurm cluster.
37
+
- The following setup is strongly recommended, but is not the only way to get things working.
27
38
- The key issues about working in slurm we need to address
28
-
- /home/$USER is quite small (10 GB) and not recommended for accessing data, easily filled up by uv cache
29
-
- Slurm jobs may run in containers with different $HOME (and different users/uids)
39
+
- /home/$USER is quite small (10 GB) and not recommended for accessing data (easily filled up by uv cache)
40
+
- Slurm jobs may run in containers with different $HOME locations (and different users/uids)
30
41
- Thus we put uv and python in your user directory in /lustre and not in /home/$USER
31
42
```bash
32
43
export USER_NAME="$USER"# Or hardcode username in slurm
-`NSS_SHARED_DIR`: location of shared files such as benchmark data and container images, see section below for details.
97
-
- Time limits: `CONFIG_TIME_LIMITS_SHORT` and `CONFIG_TIME_LIMITS_LONG` associative maps. Keys are matched by pattern (`unsloth`, `dp`), falling back to `max`.
98
113
99
114
NSS CLI Environment Variables (used by `safe-synthesizer` CLI via pydantic-settings):
100
115
-`NSS_ARTIFACTS_PATH`: Base directory for artifacts (aliased from `ADAPTER_PATH`).
@@ -104,98 +119,88 @@ NSS CLI Environment Variables (used by `safe-synthesizer` CLI via pydantic-setti
104
119
-`NSS_LOG_FILE`: Path to log file.
105
120
106
121
Note: Associative arrays/arrays aren't exported to child processes, so only `submit_slurm_jobs.sh` uses them directly.
122
+
When needed, arrays are converted to a comma delimited value in an environment variable to pass through to `slurm_nss_matrix.sh`.
123
+
This is used for `PACKED_DATASETS` and `PACKED_CONFIGS` which contain the information for all jobs within the array.
124
+
In `slurm_nss_matrix.sh`, each job extracts the dataset and config that it should run based on the `SLURM_ARRAY_TASK_ID` environment variable.
125
+
126
+
### Submit jobs
127
+
128
+
129
+
Run the submit script (flags are order-independent) from this directory:
107
130
108
-
### Submit jobs (matrix across dataset groups)
109
-
Run the matrix submitter (flags are order-independent) from this directory:
# Example: Adult data (defined in NVIDIA internal dataset_registry.yaml), three configs, 5 runs each on polar4, use different wandb project from the exp name
141
+
bash submit_slurm_jobs.sh \
142
+
--dataset-urls adult \
143
+
--configs unsloth,dp,dp_usg_guidance \
144
+
--runs 5 \
145
+
--partition polar4 \
146
+
--exp-name regex_adult \
147
+
--pipeline-mode two_stage \
148
+
--wandb-project other_adult
149
+
150
+
# Example: arbitrary path/url (not a named dataset from the dataset_registry.yaml), 1 config, 10 runs, with max 3 jobs running at a time
- CONFIGS source: By default, configs come from `CONFIGS=(...)` in `env_variables.sh`. Override with `--configs c1,c2` (base names without `.yaml`).
124
-
- RUNS: Number of runs per dataset-config pair.
125
-
- PARTITION: Slurm partition to use. See partition info in your cluster docs.
126
-
-`EXP_NAME`: Experiment namespace for logs/outputs.
127
-
-`DATASET_GROUP`: `short` or `long` (selects built-in dataset sets and time limits).
128
-
-`SLEEP_SEC`: Pause between submissions to reduce image import contention.
129
-
-`PIPELINE_MODE`: `two_stage` (TRAIN→GEN with dependency) or `end_to_end` (single job).
130
-
-`SUBMIT_MODE`: `array` (submit arrays) or `sequential` (submit jobs with dependencies per dataset/run).
131
-
-`WANDB_PROJECT`: Name of the Weights & Biases project to track experiments. Defaults to the experiment name if not specified.
132
-
133
-
How many jobs will run concurrently?
134
-
135
-
For the built-in `short` group there are currently 17 datasets (see `slurm_nss_matrix.sh`).
136
-
- In `two_stage` mode with arrays, the submitter launches one TRAIN array and one GEN array. GEN tasks are linked to corresponding TRAIN tasks via `aftercorr`. Effective max concurrency is cluster/partition limited, but GEN tasks won’t start until their matching TRAIN tasks succeed.
137
-
- In `end_to_end` mode, a single array is submitted of size `num_datasets * RUNS * NUM_CONFIGS`.
138
-
139
-
How long will my jobs take?
162
+
-`--runs`: Number of runs per dataset-config pair.
163
+
-`--partition`: Slurm partition(s) to use. See partition info in your cluster docs.
164
+
-`--exp-name`: Experiment namespace for logs/outputs.
165
+
-`--dataset-group`: `short` or `long` (selects built-in dataset sets).
166
+
Mutually exclusive with `--dataset-urls`.
167
+
-`--dataset-urls`: comma separated value of named datasets from registry, file path, or url
168
+
Mutually exclusive with `--dataset-group`.
169
+
-`--pipeline-mode`: `two_stage` (TRAIN→GEN with dependency) or `end_to_end` (single job).
170
+
-`--wandb-project`: Name of the Weights & Biases project to track experiments.
171
+
Defaults to `--exp-name` if not specified.
172
+
173
+
174
+
### How many jobs will run concurrently?
175
+
176
+
In general, concurrent jobs will depend on the cluster GPU availability and the Fair Share for the PPP.
177
+
178
+
- In `two_stage` mode, the submitter launches one TRAIN array and one GENERATE array. GENERATE tasks are linked to corresponding TRAIN tasks via `aftercorr`. Effective max concurrency is cluster/partition limited, but GEN tasks won’t start until their matching TRAIN tasks succeed.
179
+
- In `end_to_end` mode, a single array is submitted of size `# datasets * runs * # configs`.
180
+
181
+
The `--max-concurrent-slurm-jobs N` param can be used to further restrict concurrent jobs.
182
+
This only restricts within an array, so with end_to_end mode, this will restrict to precisely N simultaneously running jobs.
183
+
In two_stage mode, up to 2*N jobs might run, N each from TRAIN arrays and GENERATE arrays.
184
+
Using `--max-concurrent-slurm-jobs` is recommended for large experiments to reduce bursting and be friendlier to other users.
185
+
Consider using a max of 2-3x the current allocation for llmservice_sdg_research PPP in the cluster to avoid bursting and rapidly dropping our Fair Share for everyone.
186
+
187
+
### How long will my jobs take?
188
+
140
189
With `num_input_records_to_sample=25000`
141
190
- For the baseline config, the longest job typically finishes within 80 minutes. Total wall time estimate: `60 * RUNS` minutes.
142
191
- For the `dp` config, the longest job typically finishes within 120 minutes. Total wall time estimate: `120 * RUNS` minutes.
- W&B logging: set the `WANDB_MODE` to `online` to additionally log experiment configs and metrics to W&B. Make sure to export your `WANDB_API_KEY` (request an account [here](https://confluence.nvidia.com/display/AIALGO/Weights+and+Biases+%28WandB%29+Enterprise+Account)) in `${LUSTRE_DIR}/.api_tokens.sh`. There is an optional flag `--wandb-project` to specify a W&B project name if you don't want to use the experiment name.
152
201
153
202
- When running in `two_stage` mode, be mindful not to submit multiple bash commands that run simutaneously because we aren't able to guarantee unique adapter path for each single run. As a result, two runs might be logged as one on W&B.
154
203
155
-
### One-off single dataset runs
156
-
For quick testing of a specific CSV with selected configs and N runs, run from this directory using `submit_single_dataset.sh`.
- The script honors time limits from `env_variables.sh` based on config name patterns (`unsloth`, `dp`, fallback `max`).
193
-
- Set `DATASET_GROUP` to `long` to use the long time limits.
194
-
- The dataset path is passed via `DATASET_URLS` and will be used directly by the runner.
195
-
- In `two_stage` mode, the TRAIN job creates a workdir at `--run-path` containing the adapter and config. The GEN job resumes from the same workdir and writes uniquely-timestamped output files (e.g., `synthetic_data_20260114T123456.csv`) allowing multiple generation runs from the same trained adapter.
196
-
197
-
198
-
199
204
### Monitoring and cancellation
200
205
```bash
201
206
squeue -u ${USER_NAME}
@@ -207,6 +212,7 @@ scancel <jobid>
207
212
Use W&B by setting `WANDB_MODE=online` in `env_variables.sh` and add your W&B token to `.api_tokens.sh`.
208
213
209
214
### Troubleshooting
215
+
210
216
- "USER_NAME is not set": run `export USER_NAME=...` and retry.
211
217
- Missing token file/key: create `${LUSTRE_DIR}/.api_tokens.sh` with `NIM_API_KEY` and `chmod 600`.
212
218
- Missing config files: verify `CONFIGS` in `env_variables.sh` and files in `CONFIG_DIR`.
0 commit comments