Skip to content

wandb: raise init_timeout, add retry, fix shared-mode init for cross-region clusters#1027

Open
DavidBellamy wants to merge 1 commit intoradixark:mainfrom
LLM360:fix/wandb-shared-mode-online-timeout
Open

wandb: raise init_timeout, add retry, fix shared-mode init for cross-region clusters#1027
DavidBellamy wants to merge 1 commit intoradixark:mainfrom
LLM360:fix/wandb-shared-mode-online-timeout

Conversation

@DavidBellamy
Copy link
Copy Markdown

@DavidBellamy DavidBellamy commented Apr 21, 2026

Context

Fixes an online-mode boot failure in shared mode when primary + secondary writers are on a cross-region cluster with concurrent actor bursts. The default init_timeout=90.0 is too tight for the cross-region HTTPS round-trip wandb needs for login + run attach when the cluster is many hops from wandb cloud.

Observed failure

miles/utils/wandb_utils.py calls wandb.init(settings=Settings(mode="shared", x_primary=True)) on the primary and wandb.init(..., x_primary=False) on the secondary. On cross-region clusters, the secondary's init exceeds 90s and aborts the whole run with a silent handshake abort, making it impractical to run online. The workaround I've been using is WANDB_MODE=offline with an out-of-band sync loop — this PR removes the need for that workaround.

Changes

  • init_timeout=300.0 on both primary and secondary wandb.Settings (configurable via WANDB_INIT_TIMEOUT_SECS env var)
  • New _wandb_init_with_retry helper: bounded exponential-backoff retry on wandb.errors.CommError/UsageError (3 attempts, 5→10→20s; env-tunable)
  • x_label per-rank tagging per the shared-mode docs: primary gets rank_<rank>_primary, secondaries get rank_<rank>_secondary
  • Drop reinit=True from secondary init_kwargs (not needed for shared mode, triggered stale-state warnings)

Why shared mode is the right abstraction here

Per wandb/wandb#6882's feature description, shared mode spawns independent wandb-cores per writer and aggregates server-side by run_id. There's no local socket handshake between primary and secondary. The observed failure is pure HTTPS latency plus the 90s init_timeout default.

Testing

Validated against a 20-node pilot with WANDB_MODE=online. Expected behavior: boot completes well under 5 min, all ranks attach to the same run, near-realtime dashboards — confirmed.

Rollback

If the defaults misbehave in any environment, WANDB_INIT_TIMEOUT_SECS=90 and WANDB_INIT_RETRY_ATTEMPTS=1 restore the prior behavior in-place via env vars.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a retry mechanism with exponential backoff for Weights & Biases initialization to improve reliability in high-latency or cross-region environments. It also increases the default initialization timeout and adds unique labels for primary and secondary actors to facilitate better auditing in the W&B UI. Feedback was provided to optimize the retry loop by removing redundant delays after the final attempt, ensuring robustness against invalid configuration values, and simplifying exception handling.

Comment on lines +49 to +68
last_exc: BaseException | None = None
for attempt in range(1, WANDB_INIT_RETRY_ATTEMPTS + 1):
try:
return wandb.init(**init_kwargs)
except wandb.errors.CommError as exc: # type: ignore[attr-defined]
last_exc = exc
except wandb.errors.UsageError as exc: # type: ignore[attr-defined]
last_exc = exc
except Exception as exc: # unexpected; re-raise immediately
logger.error("wandb.init (%s) failed with non-retryable %s: %s", role, type(exc).__name__, exc)
raise
wait = WANDB_INIT_RETRY_BACKOFF_SECS * (2 ** (attempt - 1))
logger.warning(
"wandb.init (%s) attempt %d/%d failed: %s. Retrying in %.1fs.",
role, attempt, WANDB_INIT_RETRY_ATTEMPTS, last_exc, wait,
)
time.sleep(wait)
logger.error("wandb.init (%s) exhausted %d retries; giving up", role, WANDB_INIT_RETRY_ATTEMPTS)
assert last_exc is not None
raise last_exc
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The retry logic in _wandb_init_with_retry has a few areas for improvement:

  1. Redundant Sleep: The loop currently sleeps after the final attempt fails, which adds unnecessary delay before the error is raised.
  2. Exception Grouping: CommError and UsageError can be caught in a single block to simplify the code.
  3. Robustness: If WANDB_INIT_RETRY_ATTEMPTS is set to 0 or less via environment variables, the loop is skipped and the code crashes with an AssertionError at line 67. Using max(1, ...) ensures at least one attempt is made.
  4. Type Hinting: last_exc is typed as BaseException, but it only ever holds Exception subclasses in this context. Exception is more idiomatic.

I suggest refactoring the loop to address these points.

Suggested change
last_exc: BaseException | None = None
for attempt in range(1, WANDB_INIT_RETRY_ATTEMPTS + 1):
try:
return wandb.init(**init_kwargs)
except wandb.errors.CommError as exc: # type: ignore[attr-defined]
last_exc = exc
except wandb.errors.UsageError as exc: # type: ignore[attr-defined]
last_exc = exc
except Exception as exc: # unexpected; re-raise immediately
logger.error("wandb.init (%s) failed with non-retryable %s: %s", role, type(exc).__name__, exc)
raise
wait = WANDB_INIT_RETRY_BACKOFF_SECS * (2 ** (attempt - 1))
logger.warning(
"wandb.init (%s) attempt %d/%d failed: %s. Retrying in %.1fs.",
role, attempt, WANDB_INIT_RETRY_ATTEMPTS, last_exc, wait,
)
time.sleep(wait)
logger.error("wandb.init (%s) exhausted %d retries; giving up", role, WANDB_INIT_RETRY_ATTEMPTS)
assert last_exc is not None
raise last_exc
last_exc: Exception | None = None
attempts = max(1, WANDB_INIT_RETRY_ATTEMPTS)
for attempt in range(1, attempts + 1):
try:
return wandb.init(**init_kwargs)
except (wandb.errors.CommError, wandb.errors.UsageError) as exc: # type: ignore[attr-defined]
last_exc = exc
if attempt == attempts:
break
wait = WANDB_INIT_RETRY_BACKOFF_SECS * (2 ** (attempt - 1))
logger.warning(
"wandb.init (%s) attempt %d/%d failed: %s. Retrying in %.1fs.",
role, attempt, attempts, last_exc, wait,
)
time.sleep(wait)
except Exception as exc: # unexpected; re-raise immediately
logger.error("wandb.init (%s) failed with non-retryable %s: %s", role, type(exc).__name__, exc)
raise
logger.error("wandb.init (%s) exhausted %d attempts; giving up", role, attempts)
raise last_exc # last_exc is guaranteed to be set if we reached here

…r cross-region clusters

In online + shared mode, both `init_wandb_primary` and `init_wandb_secondary`
make HTTPS round-trips to wandb cloud (login + run create/attach). On
cross-region clusters with concurrent actor bursts, a single round-trip
can exceed the wandb SDK's 90s default `init_timeout` — tearing down the
whole run with a silent handshake abort.

Shared mode itself does not use a local primary-to-secondary socket
handshake. Per wandb/wandb#6882, each writer spawns an independent
wandb-core that talks to the cloud directly; aggregation is server-side
by run_id. The observed failure is pure HTTPS latency against the 90s
default, not a local race.

Changes
-------

- Bump `init_timeout` to 300s for primary and secondary Settings.
  Configurable via `WANDB_INIT_TIMEOUT_SECS` env var for tuning.
- Wrap both init paths in a bounded exponential-backoff retry
  (`_wandb_init_with_retry`) that re-attempts on wandb.errors.CommError
  and wandb.errors.UsageError. 3 attempts with 5→10→20s backoff by
  default, tunable via `WANDB_INIT_RETRY_ATTEMPTS` /
  `WANDB_INIT_RETRY_BACKOFF_SECS`.
- Add `x_label` tagging per wandb distributed-training docs: primary
  gets `rank_<rank>_primary`, secondaries get `rank_<rank>_secondary`.
  Enables per-rank console-log filtering in the wandb UI.
- Drop `reinit=True` from secondary init_kwargs. Shared mode natively
  supports concurrent writers on a single run; `reinit=True` triggered
  stale-state warnings on secondary actors without functional benefit.

Compat
------

- Default behavior preserved for users unaffected by cross-region
  latency: `WANDB_INIT_TIMEOUT_SECS=90` and `WANDB_INIT_RETRY_ATTEMPTS=1`
  restore the prior behavior in-place via env vars.
- Retry wrapper only triggers on terminal wandb transport errors;
  unrelated exceptions are still raised immediately.
@DavidBellamy DavidBellamy force-pushed the fix/wandb-shared-mode-online-timeout branch from 51fb03f to 28cb49c Compare April 21, 2026 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant