NVIDIA-NeMo
diff --git a/‎script/slurm/README.md‎
Lines changed: 92 additions & 86 deletions b/‎script/slurm/README.md‎
Lines changed: 92 additions & 86 deletions
diff --git a/‎script/slurm/env_variables.sh‎
Lines changed: 3 additions & 15 deletions b/‎script/slurm/env_variables.sh‎
Lines changed: 3 additions & 15 deletions
@@ -3,14 +3,16 @@
 
 ### NeMo Safe Synthesizer Slurm Jobs
 
-This directory contains scripts to launch matrix Slurm jobs for NeMo Safe Synthesizer. Jobs are submitted via `submit_slurm_jobs.sh`, which launches a containerized `srun` (`slurm_srun.sh`) that executes the matrix runner (`slurm_nss_matrix.sh`). All paths and defaults are configured in one place: `env_variables.sh`.
+This directory contains scripts to launch Slurm jobs for NeMo Safe Synthesizer experimentation.
+The contents of this directly are often specific to internal NVIDIA slurm clusters, but shared here as inspiration for others that might be using slurm to do hyperparameter experiments with NeMo Safe Synthesizer.
+
+Jobs are submitted via `submit_slurm_jobs.sh`, which launches a containerized `srun` (`slurm_srun.sh`) that executes the matrix runner (`slurm_nss_matrix.sh`). All paths and defaults are configured in one place: `env_variables.sh`.
 
 ### Files
-- `env_variables.sh`: single source of truth for user, paths, configs, and time limits.
-- `submit_slurm_jobs.sh`: submits Slurm array jobs for each config and dataset group. Supports two-stage TRAIN→GEN pipeline.
-- `submit_single_dataset.sh`: submits jobs for one dataset/path. Supports two-stage TRAIN→GEN pipeline.
-- `slurm_srun.sh`: wraps `srun` with container image and mounts.
-- `slurm_nss_matrix.sh`: picks dataset/index/run and launches the python entrypoint inside the container. Honors `PHASE=train|generate`.
+- `env_variables.sh`: Single source of truth for user, paths.
+- `submit_slurm_jobs.sh`: Submits Slurm array jobs for each config and dataset. Supports two-stage TRAIN→GEN pipeline.
+- `slurm_nss_matrix.sh`: Picks dataset and config and launches the python entrypoint inside the container. Honors `NSS_PHASE=train|generate|end_to_end`.
+- `slurm_srun.sh`: Wraps `srun` with container image and mounts, mostly just a pass through, primary logic is in `submit_slurm_jobs.sh` and `slurm_nss_matrix.sh`.
 
 Pipeline entrypoints (invoked by Slurm scripts) via uv:
 - `uv run safe-synthesizer run --run-path <path>` (full end-to-end pipeline)
@@ -21,12 +23,21 @@ Pipeline entrypoints (invoked by Slurm scripts) via uv:
 
 - Slurm Cluster Access: Ensure you have access to the Slurm clusters. You can verify this by running `ssh cs-oci-ord-login-01.nvidia.com` in your terminal (VPN connection required). For an introduction to Slurm, see [these onboarding resources](https://confluence.nvidia.com/display/HWINFCSSUP/Onboarding+to+Clusters).
 - NIM API Key: You will need a `NIM_API_KEY` to run column classification. If you do not have one, you can generate it at [build.nvidia.com](https://build.nvidia.com) using your `nvidian` organization account.
-- Enroot Credentials Follow https://confluence.nvidia.com/display/HWINFCSSUP/Using+Containers#UsingContainers-SettingupEnrootCredentials. You should add the lines for all 3 of `nvcr.io`, `authn.nvidia.com`, and `gitlab-master.nvidia.com`.
+- Enroot Credentials: Follow https://confluence.nvidia.com/display/HWINFCSSUP/Using+Containers#UsingContainers-SettingupEnrootCredentials. You should add the lines for all 3 of `nvcr.io`, `authn.nvidia.com`, and `gitlab-master.nvidia.com`.
+- Clone Safe-Synthesizer
+```bash
+export USER_NAME="$USER" # Or hardcode username in slurm
+export LUSTRE_DIR="/lustre/fsw/portfolios/llmservice/users/${USER_NAME}"
+cd $LUSTRE_DIR
+git clone git@github.com:NVIDIA-NeMo/Safe-Synthesizer.git
+cd Safe-Synthesizer
+```
 - uv and python install in the slurm cluster
-  - This is a strongly recommended setup, but is not be the only way to get things working.
+  - DO NOT FOLLOW the general CONTRIBUTING.md or README.md instructions for installation and setup, unless you understand exactly what's being installed where and how that interacts with the distributed nature of a slurm cluster.
+  - The following setup is strongly recommended, but is not the only way to get things working.
   - The key issues about working in slurm we need to address
-    - /home/$USER is quite small (10 GB) and not recommended for accessing data, easily filled up by uv cache
-    - Slurm jobs may run in containers with different $HOME (and different users/uids)
+    - /home/$USER is quite small (10 GB) and not recommended for accessing data (easily filled up by uv cache)
+    - Slurm jobs may run in containers with different $HOME locations (and different users/uids)
   - Thus we put uv and python in your user directory in /lustre and not in /home/$USER
 ```bash
 export USER_NAME="$USER" # Or hardcode username in slurm
@@ -38,8 +49,13 @@ export UV_CACHE_DIR="${LUSTRE_DIR}/.cache/uv"
 export UV_PYTHON_INSTALL_DIR="${LUSTRE_DIR}/.local/share/uv/python"
 export UV_PYTHON_BIN_DIR="${LUSTRE_DIR}/.local/bin"
 export UV_TOOL_DIR="${LUSTRE_DIR}/.local/share/uv/tools"
-# Install python 3.11 (as required by NSS) in a location `uv` is aware of
-uv python install 3.11
+# With the above env vars, the usual make command should work.
+# Note this may be quite slow the first time due to very slow network
+# connectivity on slurm to download from pypi, but subsequent executions
+# (such as startup for your jobs) should be much faster since uv will 
+# pull cached wheels from UV_CACHE_DIR.
+# (Be sure to run from the root of the Safe-Synthesizer repo)
+make bootstrap-nss cu128
 ```
 
 #### Nice to have
@@ -87,14 +103,13 @@ chmod 600 /lustre/fsw/portfolios/llmservice/users/${USER_NAME}/.api_tokens.sh
 
 ### Configure
 Edit `env_variables.sh` to match your environment. Key items:
-- `CONFIGS=(...)`: base names of YAML configs to run (without `.yaml`).
+- `CONFIGS=(...)`: base names of YAML configs to run (without `.yaml`), or provide via --config argument to `submit_slurm_jobs.sh`.
 - `CONFIG_DIR`: directory where config files live.
 - `BASE_LOG_DIR`: where Slurm logs will be written.
-- `NMP_DIR`: path to this repository.
+- `NSS_DIR`: path to this repository.
 - `ADAPTER_PATH`: base path for workdirs (each run creates a subdirectory with adapter, logs, and outputs).
 - `VLLM_CACHE_ROOT`, `UV_CACHE_DIR`, `UV_PYTHON_INSTALL_DIR`, `UV_PYTHON_BIN_DIR`, `UV_TOOL_DIR`, `HF_HOME`: cache locations to avoid stressing login nodes.
 - `NSS_SHARED_DIR`: location of shared files such as benchmark data and container images, see section below for details.
-- Time limits: `CONFIG_TIME_LIMITS_SHORT` and `CONFIG_TIME_LIMITS_LONG` associative maps. Keys are matched by pattern (`unsloth`, `dp`), falling back to `max`.
 
 NSS CLI Environment Variables (used by `safe-synthesizer` CLI via pydantic-settings):
 - `NSS_ARTIFACTS_PATH`: Base directory for artifacts (aliased from `ADAPTER_PATH`).
@@ -104,98 +119,88 @@ NSS CLI Environment Variables (used by `safe-synthesizer` CLI via pydantic-setti
 - `NSS_LOG_FILE`: Path to log file.
 
 Note: Associative arrays/arrays aren't exported to child processes, so only `submit_slurm_jobs.sh` uses them directly.
+When needed, arrays are converted to a comma delimited value in an environment variable to pass through to `slurm_nss_matrix.sh`.
+This is used for `PACKED_DATASETS` and `PACKED_CONFIGS` which contain the information for all jobs within the array.
+In `slurm_nss_matrix.sh`, each job extracts the dataset and config that it should run based on the `SLURM_ARRAY_TASK_ID` environment variable.
+
+### Submit jobs
+
+
+Run the submit script (flags are order-independent) from this directory:
 
-### Submit jobs (matrix across dataset groups)
-Run the matrix submitter (flags are order-independent) from this directory:
 ```bash
-bash submit_slurm_jobs.sh [--configs c1,c2] [--runs N] [--partition P] [--exp-name NAME] [--dataset-group short|long] [--sleep-sec S] [--pipeline-mode two_stage|end_to_end] [--submit-mode array|sequential] [--wandb-project PROJECT]
+bash submit_slurm_jobs.sh [--configs c1,c2] [--dataset-urls name1,url1,path1] [--dataset-group short|long] [--runs N] [--exp-name NAME] [--pipeline-mode two_stage|end_to_end] [--partition P] [--wandb-project PROJECT] [--max-concurrent-slurm-jobs N] [--time-limit TIME] [--train-time-limit TIME] [--generate-time-limit TIME] [--dry-run]
 
-# Example: two-stage (TRAIN→GEN) across "short" datasets (array mode; default)
-bash submit_slurm_jobs.sh --exp-name matrix_exp --dataset-group short --runs 1 --partition polar4 --pipeline-mode two_stage
+# Example: end_to_end with 2 hour time limit across "short" datasets
+bash submit_slurm_jobs.sh --exp-name short_end_to_end --dataset-group short --runs 1 --partition polar4 --pipeline-mode end_to_end --time-limit 2:00:00
 
-# Example: two-stage (TRAIN→GEN) sequential per dataset/run (GEN after TRAIN)
-bash submit_slurm_jobs.sh --exp-name matrix_seq --dataset-group short --runs 1 --partition polar4 --pipeline-mode two_stage --submit-mode sequential
+# Example: two-stage (TRAIN→GEN) across "short" datasets with 1 hour train time limit and 30 minute generate time limit
+bash submit_slurm_jobs.sh --exp-name short_two_stage --dataset-group short --runs 1 --partition polar4 --pipeline-mode two_stage  --train-time-limit 1:00:00 --generate-time-limit 0:30:00
 
-# Example: end-to-end (single job per run) sequential per dataset/run
-bash submit_slurm_jobs.sh --exp-name matrix_e2e_seq --dataset-group short --runs 1 --partition polar4 --pipeline-mode end_to_end --submit-mode sequential
+# Example: Adult data (defined in NVIDIA internal dataset_registry.yaml), three configs, 5 runs each on polar4, use different wandb project from the exp name
+bash submit_slurm_jobs.sh \
+  --dataset-urls adult \
+  --configs unsloth,dp,dp_usg_guidance \
+  --runs 5 \
+  --partition polar4 \
+  --exp-name regex_adult \
+  --pipeline-mode two_stage \
+  --wandb-project other_adult
+
+# Example: arbitrary path/url (not a named dataset from the dataset_registry.yaml), 1 config, 10 runs, with max 3 jobs running at a time
+bash submit_slurm_jobs.sh \
+  --dataset-urls "https://raw.githubusercontent.com/gretelai/gretel-blueprints/refs/heads/main/sample_data/financial_transactions.csv" \
+  --configs unsloth \
+  --runs 10 \
+  --partition polar,polar3,polar4 \
+  --exp-name financial_repeats \
+  --pipeline-mode end_to_end \
+  --max-concurrent-slurm-jobs 3
 ```
 
 - CONFIGS source: By default, configs come from `CONFIGS=(...)` in `env_variables.sh`. Override with `--configs c1,c2` (base names without `.yaml`).
-- RUNS: Number of runs per dataset-config pair.
-- PARTITION: Slurm partition to use. See partition info in your cluster docs.
-- `EXP_NAME`: Experiment namespace for logs/outputs.
-- `DATASET_GROUP`: `short` or `long` (selects built-in dataset sets and time limits).
-- `SLEEP_SEC`: Pause between submissions to reduce image import contention.
-- `PIPELINE_MODE`: `two_stage` (TRAIN→GEN with dependency) or `end_to_end` (single job).
-- `SUBMIT_MODE`: `array` (submit arrays) or `sequential` (submit jobs with dependencies per dataset/run).
-- `WANDB_PROJECT`: Name of the Weights & Biases project to track experiments. Defaults to the experiment name if not specified.
-
-How many jobs will run concurrently?
-
-For the built-in `short` group there are currently 17 datasets (see `slurm_nss_matrix.sh`).
-- In `two_stage` mode with arrays, the submitter launches one TRAIN array and one GEN array. GEN tasks are linked to corresponding TRAIN tasks via `aftercorr`. Effective max concurrency is cluster/partition limited, but GEN tasks won’t start until their matching TRAIN tasks succeed.
-- In `end_to_end` mode, a single array is submitted of size `num_datasets * RUNS * NUM_CONFIGS`.
-
-How long will my jobs take?
+- `--runs`: Number of runs per dataset-config pair.
+- `--partition`: Slurm partition(s) to use. See partition info in your cluster docs.
+- `--exp-name`: Experiment namespace for logs/outputs.
+- `--dataset-group`: `short` or `long` (selects built-in dataset sets).
+  Mutually exclusive with `--dataset-urls`.
+- `--dataset-urls`: comma separated value of named datasets from registry, file path, or url
+  Mutually exclusive with `--dataset-group`.
+- `--pipeline-mode`: `two_stage` (TRAIN→GEN with dependency) or `end_to_end` (single job).
+- `--wandb-project`: Name of the Weights & Biases project to track experiments.
+  Defaults to `--exp-name` if not specified.
+
+
+### How many jobs will run concurrently?
+
+In general, concurrent jobs will depend on the cluster GPU availability and the Fair Share for the PPP.
+
+- In `two_stage` mode, the submitter launches one TRAIN array and one GENERATE array. GENERATE tasks are linked to corresponding TRAIN tasks via `aftercorr`. Effective max concurrency is cluster/partition limited, but GEN tasks won’t start until their matching TRAIN tasks succeed.
+- In `end_to_end` mode, a single array is submitted of size `# datasets * runs * # configs`.
+
+The `--max-concurrent-slurm-jobs N` param can be used to further restrict concurrent jobs.
+This only restricts within an array, so with end_to_end mode, this will restrict to precisely N simultaneously running jobs.
+In two_stage mode, up to 2*N jobs might run, N each from TRAIN arrays and GENERATE arrays.
+Using `--max-concurrent-slurm-jobs` is recommended for large experiments to reduce bursting and be friendlier to other users.
+Consider using a max of 2-3x the current allocation for llmservice_sdg_research PPP in the cluster to avoid bursting and rapidly dropping our Fair Share for everyone.
+
+### How long will my jobs take?
+
 With `num_input_records_to_sample=25000`
 - For the baseline config, the longest job typically finishes within 80 minutes. Total wall time estimate: `60 * RUNS` minutes.
 - For the `dp` config, the longest job typically finishes within 120 minutes. Total wall time estimate: `120 * RUNS` minutes.
 
 
 ### Logs and outputs
-- Slurm logs: `${BASE_LOG_DIR}/${EXP_NAME}/short|long/<config>/slurm_%A_%a.{out,err}`
+- Slurm logs: `${BASE_LOG_DIR}/${EXP_NAME}/slurm_%A_%a.{out,err}`
 - You can tail logs while jobs run:
 ```bash
-tail -f ${BASE_LOG_DIR}/${EXP_NAME}/short/<config>/slurm_*.out
+tail -f ${BASE_LOG_DIR}/${EXP_NAME}/slurm_*.out
 ```
 - W&B logging: set the `WANDB_MODE` to `online` to additionally log experiment configs and metrics to W&B. Make sure to export your `WANDB_API_KEY` (request an account [here](https://confluence.nvidia.com/display/AIALGO/Weights+and+Biases+%28WandB%29+Enterprise+Account)) in `${LUSTRE_DIR}/.api_tokens.sh`. There is an optional flag `--wandb-project` to specify a W&B project name if you don't want to use the experiment name.
 
   - When running in `two_stage` mode, be mindful not to submit multiple bash commands that run simutaneously because we aren't able to guarantee unique adapter path for each single run. As a result, two runs might be logged as one on W&B.
 
-### One-off single dataset runs
-For quick testing of a specific CSV with selected configs and N runs, run from this directory using `submit_single_dataset.sh`.
-
-Usage:
-```bash
-bash submit_single_dataset.sh --dataset-urls PATH_OR_URL [--configs c1,c2] [--runs N] [--partition P] [--exp-name NAME] [--dataset-group short|long] [--sleep-sec S] [--submit-mode array|sequential] [--pipeline-mode two_stage|end_to_end] [--wandb-project PROJECT]
-
-# Example: Adobe-2k, three configs, 5 runs each on polar4 (array-based two-stage)
-bash submit_single_dataset.sh \
-  --dataset-urls /lustre/fsw/portfolios/llmservice/users/${USER_NAME}/safe-synthetics/cleaned/Adobe-2k.csv \
-  --configs unsloth,dp,dp_usg_guidance \
-  --runs 5 \
-  --partition polar4 \
-  --exp-name regex_adobe2k \
-  --dataset-group short \
-  --sleep-sec 5 \
-  --submit-mode array \
-  --pipeline-mode two_stage \
-  --wandb-project regex_adobe2k
-
-# Example: sequential per-run two-stage (GEN depends on its TRAIN)
-bash submit_single_dataset.sh \
-  --dataset-urls /lustre/fsw/portfolios/llmservice/users/${USER_NAME}/safe-synthetics/cleaned/Adobe-2k.csv \
-  --configs unsloth,dp \
-  --runs 3 \
-  --partition polar4 \
-  --exp-name demo_exp \
-  --dataset-group short \
-  --sleep-sec 3 \
-  --submit-mode sequential \
-  --pipeline-mode two_stage
-
-# Tail logs
-tail -f /lustre/fsw/portfolios/llmservice/users/${USER_NAME}/nss_results/regex_adobe2k/short/*/slurm_*.out
-```
-
-Notes:
-- The script honors time limits from `env_variables.sh` based on config name patterns (`unsloth`, `dp`, fallback `max`).
-- Set `DATASET_GROUP` to `long` to use the long time limits.
-- The dataset path is passed via `DATASET_URLS` and will be used directly by the runner.
-- In `two_stage` mode, the TRAIN job creates a workdir at `--run-path` containing the adapter and config. The GEN job resumes from the same workdir and writes uniquely-timestamped output files (e.g., `synthetic_data_20260114T123456.csv`) allowing multiple generation runs from the same trained adapter.
-
-
-
 ### Monitoring and cancellation
 ```bash
 squeue -u ${USER_NAME}
@@ -207,6 +212,7 @@ scancel <jobid>
 Use W&B by setting `WANDB_MODE=online` in `env_variables.sh` and add your W&B token to `.api_tokens.sh`.
 
 ### Troubleshooting
+
 - "USER_NAME is not set": run `export USER_NAME=...` and retry.
 - Missing token file/key: create `${LUSTRE_DIR}/.api_tokens.sh` with `NIM_API_KEY` and `chmod 600`.
 - Missing config files: verify `CONFIGS` in `env_variables.sh` and files in `CONFIG_DIR`.
 
@@ -12,11 +12,11 @@ export NSS_SHARED_DIR="/lustre/fsw/portfolios/llmservice/users/kendrickb/shared_
 
 ## change the followings if you want them to be different
 CONFIGS=(unsloth dp dp_usg_guidance) # the jobs will run all datasets with these configs
-export NMP_DIR="/lustre/fsw/portfolios/llmservice/users/${USER_NAME}/nmp" # where the nmp repo is located
-export NSS_SLURM_DIR="${NMP_DIR}/packages/nemo_safe_synthesizer/script/slurm" # slurm scripts location (inside repo)
+export NSS_DIR="/lustre/fsw/portfolios/llmservice/users/${USER_NAME}/Safe-Synthesizer" # where the nss repo is located
+export NSS_SLURM_DIR="${NSS_DIR}/script/slurm" # slurm scripts location (inside repo)
 export CONFIG_DIR="${NSS_SLURM_DIR}" # where the config files are located
 export BASE_LOG_DIR="${LUSTRE_DIR}/nss_results" # where you want the slurm logs to be saved, each job will have err and out files
-export ADAPTER_PATH="${LUSTRE_DIR}/nmp/exp/adapters" # base path for run directories (each run creates a subdirectory via --run-path)
+export ADAPTER_PATH="${LUSTRE_DIR}/nss_results/adapters" # base path for run directories (each run creates a subdirectory via --run-path)
 export VLLM_CACHE_ROOT="${LUSTRE_DIR}/.cache/vllm/" # where the vllm cache is saved, this is to prevent the login node from blowing up
 export UV_CACHE_DIR="${LUSTRE_DIR}/.cache/uv"
 export UV_PYTHON_INSTALL_DIR="${LUSTRE_DIR}/.local/share/uv/python"
@@ -33,15 +33,3 @@ export WANDB_MODE="disabled" # "online", "offline" or "disabled"
 #   NSS_LOG_FORMAT - Log format ("json" or "plain")
 #   NSS_LOG_FILE - Path to log file
 export NSS_ARTIFACTS_PATH="${ADAPTER_PATH}"
-
-# time limits for the short and long jobs
-declare -A CONFIG_TIME_LIMITS_SHORT
-declare -A CONFIG_TIME_LIMITS_LONG
-
-CONFIG_TIME_LIMITS_SHORT[unsloth]="00:40:00"
-CONFIG_TIME_LIMITS_SHORT[dp]="02:00:00"
-CONFIG_TIME_LIMITS_SHORT[max]="04:00:00" #fallback if config names do not include unsloth or dp
-
-CONFIG_TIME_LIMITS_LONG[unsloth]="01:20:00"
-CONFIG_TIME_LIMITS_LONG[dp]="02:00:00"
-CONFIG_TIME_LIMITS_LONG[max]="04:00:00" #fallback if config names do not include unsloth or dp