wandb: raise init_timeout, add retry, fix shared-mode init for cross-region clusters#1027
Open
DavidBellamy wants to merge 1 commit intoradixark:mainfrom
Open
wandb: raise init_timeout, add retry, fix shared-mode init for cross-region clusters#1027DavidBellamy wants to merge 1 commit intoradixark:mainfrom
DavidBellamy wants to merge 1 commit intoradixark:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces a retry mechanism with exponential backoff for Weights & Biases initialization to improve reliability in high-latency or cross-region environments. It also increases the default initialization timeout and adds unique labels for primary and secondary actors to facilitate better auditing in the W&B UI. Feedback was provided to optimize the retry loop by removing redundant delays after the final attempt, ensuring robustness against invalid configuration values, and simplifying exception handling.
Comment on lines
+49
to
+68
| last_exc: BaseException | None = None | ||
| for attempt in range(1, WANDB_INIT_RETRY_ATTEMPTS + 1): | ||
| try: | ||
| return wandb.init(**init_kwargs) | ||
| except wandb.errors.CommError as exc: # type: ignore[attr-defined] | ||
| last_exc = exc | ||
| except wandb.errors.UsageError as exc: # type: ignore[attr-defined] | ||
| last_exc = exc | ||
| except Exception as exc: # unexpected; re-raise immediately | ||
| logger.error("wandb.init (%s) failed with non-retryable %s: %s", role, type(exc).__name__, exc) | ||
| raise | ||
| wait = WANDB_INIT_RETRY_BACKOFF_SECS * (2 ** (attempt - 1)) | ||
| logger.warning( | ||
| "wandb.init (%s) attempt %d/%d failed: %s. Retrying in %.1fs.", | ||
| role, attempt, WANDB_INIT_RETRY_ATTEMPTS, last_exc, wait, | ||
| ) | ||
| time.sleep(wait) | ||
| logger.error("wandb.init (%s) exhausted %d retries; giving up", role, WANDB_INIT_RETRY_ATTEMPTS) | ||
| assert last_exc is not None | ||
| raise last_exc |
Contributor
There was a problem hiding this comment.
The retry logic in _wandb_init_with_retry has a few areas for improvement:
- Redundant Sleep: The loop currently sleeps after the final attempt fails, which adds unnecessary delay before the error is raised.
- Exception Grouping:
CommErrorandUsageErrorcan be caught in a single block to simplify the code. - Robustness: If
WANDB_INIT_RETRY_ATTEMPTSis set to 0 or less via environment variables, the loop is skipped and the code crashes with anAssertionErrorat line 67. Usingmax(1, ...)ensures at least one attempt is made. - Type Hinting:
last_excis typed asBaseException, but it only ever holdsExceptionsubclasses in this context.Exceptionis more idiomatic.
I suggest refactoring the loop to address these points.
Suggested change
| last_exc: BaseException | None = None | |
| for attempt in range(1, WANDB_INIT_RETRY_ATTEMPTS + 1): | |
| try: | |
| return wandb.init(**init_kwargs) | |
| except wandb.errors.CommError as exc: # type: ignore[attr-defined] | |
| last_exc = exc | |
| except wandb.errors.UsageError as exc: # type: ignore[attr-defined] | |
| last_exc = exc | |
| except Exception as exc: # unexpected; re-raise immediately | |
| logger.error("wandb.init (%s) failed with non-retryable %s: %s", role, type(exc).__name__, exc) | |
| raise | |
| wait = WANDB_INIT_RETRY_BACKOFF_SECS * (2 ** (attempt - 1)) | |
| logger.warning( | |
| "wandb.init (%s) attempt %d/%d failed: %s. Retrying in %.1fs.", | |
| role, attempt, WANDB_INIT_RETRY_ATTEMPTS, last_exc, wait, | |
| ) | |
| time.sleep(wait) | |
| logger.error("wandb.init (%s) exhausted %d retries; giving up", role, WANDB_INIT_RETRY_ATTEMPTS) | |
| assert last_exc is not None | |
| raise last_exc | |
| last_exc: Exception | None = None | |
| attempts = max(1, WANDB_INIT_RETRY_ATTEMPTS) | |
| for attempt in range(1, attempts + 1): | |
| try: | |
| return wandb.init(**init_kwargs) | |
| except (wandb.errors.CommError, wandb.errors.UsageError) as exc: # type: ignore[attr-defined] | |
| last_exc = exc | |
| if attempt == attempts: | |
| break | |
| wait = WANDB_INIT_RETRY_BACKOFF_SECS * (2 ** (attempt - 1)) | |
| logger.warning( | |
| "wandb.init (%s) attempt %d/%d failed: %s. Retrying in %.1fs.", | |
| role, attempt, attempts, last_exc, wait, | |
| ) | |
| time.sleep(wait) | |
| except Exception as exc: # unexpected; re-raise immediately | |
| logger.error("wandb.init (%s) failed with non-retryable %s: %s", role, type(exc).__name__, exc) | |
| raise | |
| logger.error("wandb.init (%s) exhausted %d attempts; giving up", role, attempts) | |
| raise last_exc # last_exc is guaranteed to be set if we reached here |
…r cross-region clusters In online + shared mode, both `init_wandb_primary` and `init_wandb_secondary` make HTTPS round-trips to wandb cloud (login + run create/attach). On cross-region clusters with concurrent actor bursts, a single round-trip can exceed the wandb SDK's 90s default `init_timeout` — tearing down the whole run with a silent handshake abort. Shared mode itself does not use a local primary-to-secondary socket handshake. Per wandb/wandb#6882, each writer spawns an independent wandb-core that talks to the cloud directly; aggregation is server-side by run_id. The observed failure is pure HTTPS latency against the 90s default, not a local race. Changes ------- - Bump `init_timeout` to 300s for primary and secondary Settings. Configurable via `WANDB_INIT_TIMEOUT_SECS` env var for tuning. - Wrap both init paths in a bounded exponential-backoff retry (`_wandb_init_with_retry`) that re-attempts on wandb.errors.CommError and wandb.errors.UsageError. 3 attempts with 5→10→20s backoff by default, tunable via `WANDB_INIT_RETRY_ATTEMPTS` / `WANDB_INIT_RETRY_BACKOFF_SECS`. - Add `x_label` tagging per wandb distributed-training docs: primary gets `rank_<rank>_primary`, secondaries get `rank_<rank>_secondary`. Enables per-rank console-log filtering in the wandb UI. - Drop `reinit=True` from secondary init_kwargs. Shared mode natively supports concurrent writers on a single run; `reinit=True` triggered stale-state warnings on secondary actors without functional benefit. Compat ------ - Default behavior preserved for users unaffected by cross-region latency: `WANDB_INIT_TIMEOUT_SECS=90` and `WANDB_INIT_RETRY_ATTEMPTS=1` restore the prior behavior in-place via env vars. - Retry wrapper only triggers on terminal wandb transport errors; unrelated exceptions are still raised immediately.
51fb03f to
28cb49c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Fixes an online-mode boot failure in
sharedmode when primary + secondary writers are on a cross-region cluster with concurrent actor bursts. The defaultinit_timeout=90.0is too tight for the cross-region HTTPS round-trip wandb needs for login + run attach when the cluster is many hops from wandb cloud.Observed failure
miles/utils/wandb_utils.pycallswandb.init(settings=Settings(mode="shared", x_primary=True))on the primary andwandb.init(..., x_primary=False)on the secondary. On cross-region clusters, the secondary's init exceeds 90s and aborts the whole run with a silent handshake abort, making it impractical to run online. The workaround I've been using isWANDB_MODE=offlinewith an out-of-band sync loop — this PR removes the need for that workaround.Changes
init_timeout=300.0on both primary and secondarywandb.Settings(configurable viaWANDB_INIT_TIMEOUT_SECSenv var)_wandb_init_with_retryhelper: bounded exponential-backoff retry onwandb.errors.CommError/UsageError(3 attempts, 5→10→20s; env-tunable)x_labelper-rank tagging per the shared-mode docs: primary getsrank_<rank>_primary, secondaries getrank_<rank>_secondaryreinit=Truefrom secondary init_kwargs (not needed for shared mode, triggered stale-state warnings)Why shared mode is the right abstraction here
Per wandb/wandb#6882's feature description, shared mode spawns independent wandb-cores per writer and aggregates server-side by
run_id. There's no local socket handshake between primary and secondary. The observed failure is pure HTTPS latency plus the 90sinit_timeoutdefault.Testing
Validated against a 20-node pilot with
WANDB_MODE=online. Expected behavior: boot completes well under 5 min, all ranks attach to the same run, near-realtime dashboards — confirmed.Rollback
If the defaults misbehave in any environment,
WANDB_INIT_TIMEOUT_SECS=90andWANDB_INIT_RETRY_ATTEMPTS=1restore the prior behavior in-place via env vars.