Skip to content

Commit 4e616ab

Browse files
authored
chore: set wandb mode to online by default (#355)
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. --> <!-- SPDX-License-Identifier: Apache-2.0 --> <!-- Thank you for contributing to Safe Synthesizer! --> # Summary <!-- Brief description of changes --> Set the default WANDB_MODE to online. Since this is only for internal usage and all the internal people, as far as I know, use the online mode for wandb, it makes sense to switch it. ## Pre-Merge Checklist <!-- These checks need to be completed before a PR is merged, --> <!-- but as PRs often change significantly during review, --> <!-- it's OK for them to be incomplete when review is first requested. --> - [ ] New or updated tests for any fix or new behavior - [x] Updated documentation for new features and behaviors, including docstrings for API docs. ## Other Notes <!-- Please add the issue number that should be closed when this PR is merged. --> - Closes #<issue> --------- Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
1 parent a3224bd commit 4e616ab

3 files changed

Lines changed: 24 additions & 7 deletions

File tree

script/slurm/README.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ Pipeline entrypoints (invoked by Slurm scripts) via uv:
2424

2525
- Slurm Cluster Access: Ensure you have access to the Slurm clusters. You can verify this by running `ssh cs-oci-ord-login-01.nvidia.com` in your terminal (VPN connection required). For an introduction to Slurm, see [these onboarding resources](https://confluence.nvidia.com/display/HWINFCSSUP/Onboarding+to+Clusters).
2626
- An LLM inference endpoint and the API Key: You will need a `NSS_INFERENCE_KEY` to run column classification, if using the default `NSS_INFERENCE_ENDPOINT`. If you do not have one, you can generate it at [build.nvidia.com](https://build.nvidia.com).
27+
- Weights & Biases API Key: W&B logging is enabled by default (`WANDB_MODE=online`). You will need a `WANDB_API_KEY` — request an account [here](https://confluence.nvidia.com/display/AIALGO/Weights+and+Biases+%28WandB%29+Enterprise+Account). Set `WANDB_MODE=disabled` in `env_variables.sh` to skip W&B.
2728
- Enroot Credentials: Follow https://confluence.nvidia.com/display/HWINFCSSUP/Using+Containers#UsingContainers-SettingupEnrootCredentials. You should add the lines for all 3 of `nvcr.io`, `authn.nvidia.com`, and `gitlab-master.nvidia.com`.
2829
- Clone Safe-Synthesizer
2930
```bash
@@ -86,9 +87,13 @@ export HF_HOME="${LUSTRE_DIR}/.cache/huggingface"
8687
export USER_NAME=your_lustre_username
8788
```
8889

89-
2) Create your API token file with `NSS_INFERENCE_KEY` and restrict permissions, recommended to inclue `HF_TOKEN` to avoid throttling by HF Hub and, if you're using W&B, `WANDB_API_KEY`:
90+
2) Create your API token file and restrict permissions. `NSS_INFERENCE_KEY` and `WANDB_API_KEY` are required by default. `HF_TOKEN` is recommended to avoid throttling by HF Hub:
9091
```bash
91-
echo 'export NSS_INFERENCE_KEY="<your_api_key>"' > /lustre/fsw/portfolios/llmservice/users/${USER_NAME}/.api_tokens.sh
92+
cat > /lustre/fsw/portfolios/llmservice/users/${USER_NAME}/.api_tokens.sh << 'TOKENS'
93+
export NSS_INFERENCE_KEY="<your_inference_api_key>"
94+
export WANDB_API_KEY="<your_wandb_api_key>"
95+
export HF_TOKEN="<your_hf_token>"
96+
TOKENS
9297
chmod 600 /lustre/fsw/portfolios/llmservice/users/${USER_NAME}/.api_tokens.sh
9398
```
9499

@@ -191,7 +196,7 @@ Consider using a max of 2-3x the current allocation for llmservice_sdg_research
191196
```bash
192197
tail -f ${BASE_LOG_DIR}/${EXP_NAME}/slurm_*.out
193198
```
194-
- W&B logging: set the `WANDB_MODE` to `online` to additionally log experiment configs and metrics to W&B. Make sure to export your `WANDB_API_KEY` (request an account [here](https://confluence.nvidia.com/display/AIALGO/Weights+and+Biases+%28WandB%29+Enterprise+Account)) in `${LUSTRE_DIR}/.api_tokens.sh`. There is an optional flag `--wandb-project` to specify a W&B project name if you don't want to use the experiment name.
199+
- W&B logging: `WANDB_MODE` is set to `online` by default to additionally log experiment configs and metrics to W&B. Make sure to export your `WANDB_API_KEY` (request an account [here](https://confluence.nvidia.com/display/AIALGO/Weights+and+Biases+%28WandB%29+Enterprise+Account)) in `${LUSTRE_DIR}/.api_tokens.sh`. There is an optional flag `--wandb-project` to specify a W&B project name if you don't want to use the experiment name.
195200

196201
- When running in `two_stage` mode, be mindful not to submit multiple bash commands that run simutaneously because we aren't able to guarantee unique adapter path for each single run. As a result, two runs might be logged as one on W&B.
197202

@@ -234,7 +239,7 @@ Log directory resolution order (first match wins):
234239

235240
### Collect results
236241

237-
Use W&B by setting `WANDB_MODE=online` in `env_variables.sh` and add your W&B token to `.api_tokens.sh`.
242+
W&B is enabled by default with `WANDB_MODE=online` in `env_variables.sh`. Make sure to add your W&B token to `.api_tokens.sh`. Set `WANDB_MODE=disabled` otherwise.
238243

239244
### Troubleshooting
240245

script/slurm/env_variables.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ export UV_PYTHON_INSTALL_DIR="${LUSTRE_DIR}/.local/share/uv/python"
3030
export UV_PYTHON_BIN_DIR="${LUSTRE_DIR}/.local/bin"
3131
export UV_TOOL_DIR="${LUSTRE_DIR}/.local/share/uv/tools"
3232
export HF_HOME="${LUSTRE_DIR}/.cache/huggingface"
33-
export WANDB_MODE="disabled" # "online", "offline" or "disabled"
33+
export WANDB_MODE="online" # "online", "offline" or "disabled"
3434

3535
# NSS CLI environment variables (used by safe-synthesizer CLI via pydantic-settings)
3636
# These are picked up automatically by CLISettings in the CLI:

script/slurm/slurm_nss_matrix.sh

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,20 @@ if [ -z "${LUSTRE_DIR:-}" ]; then
4242
exit 1
4343
fi
4444

45+
if [[ ! -f "${LUSTRE_DIR}/.api_tokens.sh" ]]; then
46+
echo "ERROR: ${LUSTRE_DIR}/.api_tokens.sh not found." >&2
47+
echo "Create it with at least NSS_INFERENCE_KEY (and WANDB_API_KEY if WANDB_MODE=online)." >&2
48+
exit 1
49+
fi
50+
source "${LUSTRE_DIR}/.api_tokens.sh"
51+
52+
if [[ "${WANDB_MODE:-disabled}" == "online" && -z "${WANDB_API_KEY:-}" ]]; then
53+
echo "ERROR: WANDB_MODE is 'online' but WANDB_API_KEY is not set." >&2
54+
echo "Add 'export WANDB_API_KEY=\"<your_key>\"' to ${LUSTRE_DIR}/.api_tokens.sh" >&2
55+
echo "Or set WANDB_MODE=disabled in env_variables.sh to skip W&B logging." >&2
56+
exit 1
57+
fi
58+
4559
if [ -z "${NSS_SHARED_DIR:-}" ]; then
4660
echo "NSS_SHARED_DIR must be set" >&2
4761
echo "Run script through submit_slurm_jobs.sh, or if running manually source env_variables.sh before running" >&2
@@ -95,8 +109,6 @@ uv sync --frozen --extra cu128 --extra engine --group dev
95109
# for column classification
96110
export NSS_INFERENCE_ENDPOINT=https://integrate.api.nvidia.com/v1
97111
export NIM_MODEL_ID=qwen/qwen2.5-coder-32b-instruct
98-
source "${LUSTRE_DIR}/.api_tokens.sh"
99-
100112

101113
# Extract dataset name for path construction (handles both full paths and simple names)
102114
# e.g., "/path/to/adult.csv" -> "adult", "/path/to/data.parquet" -> "data", "adult" -> "adult"

0 commit comments

Comments
 (0)