wandb: raise init_timeout, add retry, fix shared-mode init for cross-region clusters by DavidBellamy · Pull Request #1027 · radixark/miles

DavidBellamy · 2026-04-21T19:10:44Z

Context

Fixes an online-mode boot failure in shared mode when primary + secondary writers are on a cross-region cluster with concurrent actor bursts. The default init_timeout=90.0 is too tight for the cross-region HTTPS round-trip wandb needs for login + run attach when the cluster is many hops from wandb cloud.

Observed failure

miles/utils/wandb_utils.py calls wandb.init(settings=Settings(mode="shared", x_primary=True)) on the primary and wandb.init(..., x_primary=False) on the secondary. On cross-region clusters, the secondary's init exceeds 90s and aborts the whole run with a silent handshake abort, making it impractical to run online. The workaround I've been using is WANDB_MODE=offline with an out-of-band sync loop — this PR removes the need for that workaround.

Changes

init_timeout=300.0 on both primary and secondary wandb.Settings (configurable via WANDB_INIT_TIMEOUT_SECS env var)
New _wandb_init_with_retry helper: bounded exponential-backoff retry on wandb.errors.CommError/UsageError (3 attempts, 5→10→20s; env-tunable)
x_label per-rank tagging per the shared-mode docs: primary gets rank_<rank>_primary, secondaries get rank_<rank>_secondary
Drop reinit=True from secondary init_kwargs (not needed for shared mode, triggered stale-state warnings)

Why shared mode is the right abstraction here

Per wandb/wandb#6882's feature description, shared mode spawns independent wandb-cores per writer and aggregates server-side by run_id. There's no local socket handshake between primary and secondary. The observed failure is pure HTTPS latency plus the 90s init_timeout default.

Testing

Validated against a 20-node pilot with WANDB_MODE=online. Expected behavior: boot completes well under 5 min, all ranks attach to the same run, near-realtime dashboards — confirmed.

Rollback

If the defaults misbehave in any environment, WANDB_INIT_TIMEOUT_SECS=90 and WANDB_INIT_RETRY_ATTEMPTS=1 restore the prior behavior in-place via env vars.

gemini-code-assist

Code Review

This pull request introduces a retry mechanism with exponential backoff for Weights & Biases initialization to improve reliability in high-latency or cross-region environments. It also increases the default initialization timeout and adds unique labels for primary and secondary actors to facilitate better auditing in the W&B UI. Feedback was provided to optimize the retry loop by removing redundant delays after the final attempt, ensuring robustness against invalid configuration values, and simplifying exception handling.

gemini-code-assist · 2026-04-21T19:12:51Z

+    last_exc: BaseException | None = None
+    for attempt in range(1, WANDB_INIT_RETRY_ATTEMPTS + 1):
+        try:
+            return wandb.init(**init_kwargs)
+        except wandb.errors.CommError as exc:  # type: ignore[attr-defined]
+            last_exc = exc
+        except wandb.errors.UsageError as exc:  # type: ignore[attr-defined]
+            last_exc = exc
+        except Exception as exc:  # unexpected; re-raise immediately
+            logger.error("wandb.init (%s) failed with non-retryable %s: %s", role, type(exc).__name__, exc)
+            raise
+        wait = WANDB_INIT_RETRY_BACKOFF_SECS * (2 ** (attempt - 1))
+        logger.warning(
+            "wandb.init (%s) attempt %d/%d failed: %s. Retrying in %.1fs.",
+            role, attempt, WANDB_INIT_RETRY_ATTEMPTS, last_exc, wait,
+        )
+        time.sleep(wait)
+    logger.error("wandb.init (%s) exhausted %d retries; giving up", role, WANDB_INIT_RETRY_ATTEMPTS)
+    assert last_exc is not None
+    raise last_exc


The retry logic in _wandb_init_with_retry has a few areas for improvement:

Redundant Sleep: The loop currently sleeps after the final attempt fails, which adds unnecessary delay before the error is raised.

Exception Grouping: CommError and UsageError can be caught in a single block to simplify the code.

Robustness: If WANDB_INIT_RETRY_ATTEMPTS is set to 0 or less via environment variables, the loop is skipped and the code crashes with an AssertionError at line 67. Using max(1, ...) ensures at least one attempt is made.

Type Hinting: last_exc is typed as BaseException, but it only ever holds Exception subclasses in this context. Exception is more idiomatic.

I suggest refactoring the loop to address these points.

Suggested change

last_exc: BaseException | None = None

for attempt in range(1, WANDB_INIT_RETRY_ATTEMPTS + 1):

try:

return wandb.init(**init_kwargs)

except wandb.errors.CommError as exc: # type: ignore[attr-defined]

last_exc = exc

except wandb.errors.UsageError as exc: # type: ignore[attr-defined]

last_exc = exc

except Exception as exc: # unexpected; re-raise immediately

logger.error("wandb.init (%s) failed with non-retryable %s: %s", role, type(exc).__name__, exc)

raise

wait = WANDB_INIT_RETRY_BACKOFF_SECS * (2 ** (attempt - 1))

logger.warning(

"wandb.init (%s) attempt %d/%d failed: %s. Retrying in %.1fs.",

role, attempt, WANDB_INIT_RETRY_ATTEMPTS, last_exc, wait,

)

time.sleep(wait)

logger.error("wandb.init (%s) exhausted %d retries; giving up", role, WANDB_INIT_RETRY_ATTEMPTS)

assert last_exc is not None

raise last_exc

last_exc: Exception | None = None

attempts = max(1, WANDB_INIT_RETRY_ATTEMPTS)

for attempt in range(1, attempts + 1):

try:

return wandb.init(**init_kwargs)

except (wandb.errors.CommError, wandb.errors.UsageError) as exc: # type: ignore[attr-defined]

last_exc = exc

if attempt == attempts:

break

wait = WANDB_INIT_RETRY_BACKOFF_SECS * (2 ** (attempt - 1))

logger.warning(

"wandb.init (%s) attempt %d/%d failed: %s. Retrying in %.1fs.",

role, attempt, attempts, last_exc, wait,

)

time.sleep(wait)

except Exception as exc: # unexpected; re-raise immediately

logger.error("wandb.init (%s) failed with non-retryable %s: %s", role, type(exc).__name__, exc)

raise

logger.error("wandb.init (%s) exhausted %d attempts; giving up", role, attempts)

raise last_exc # last_exc is guaranteed to be set if we reached here

…r cross-region clusters In online + shared mode, both `init_wandb_primary` and `init_wandb_secondary` make HTTPS round-trips to wandb cloud (login + run create/attach). On cross-region clusters with concurrent actor bursts, a single round-trip can exceed the wandb SDK's 90s default `init_timeout` — tearing down the whole run with a silent handshake abort. Shared mode itself does not use a local primary-to-secondary socket handshake. Per wandb/wandb#6882, each writer spawns an independent wandb-core that talks to the cloud directly; aggregation is server-side by run_id. The observed failure is pure HTTPS latency against the 90s default, not a local race. Changes ------- - Bump `init_timeout` to 300s for primary and secondary Settings. Configurable via `WANDB_INIT_TIMEOUT_SECS` env var for tuning. - Wrap both init paths in a bounded exponential-backoff retry (`_wandb_init_with_retry`) that re-attempts on wandb.errors.CommError and wandb.errors.UsageError. 3 attempts with 5→10→20s backoff by default, tunable via `WANDB_INIT_RETRY_ATTEMPTS` / `WANDB_INIT_RETRY_BACKOFF_SECS`. - Add `x_label` tagging per wandb distributed-training docs: primary gets `rank_<rank>_primary`, secondaries get `rank_<rank>_secondary`. Enables per-rank console-log filtering in the wandb UI. - Drop `reinit=True` from secondary init_kwargs. Shared mode natively supports concurrent writers on a single run; `reinit=True` triggered stale-state warnings on secondary actors without functional benefit. Compat ------ - Default behavior preserved for users unaffected by cross-region latency: `WANDB_INIT_TIMEOUT_SECS=90` and `WANDB_INIT_RETRY_ATTEMPTS=1` restore the prior behavior in-place via env vars. - Retry wrapper only triggers on terminal wandb transport errors; unrelated exceptions are still raised immediately.

DavidBellamy requested review from fzyzcjy, guapisolo, maocheng23 and yueming-yuan as code owners April 21, 2026 19:10

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

DavidBellamy force-pushed the fix/wandb-shared-mode-online-timeout branch from 51fb03f to 28cb49c Compare April 21, 2026 19:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wandb: raise init_timeout, add retry, fix shared-mode init for cross-region clusters#1027

wandb: raise init_timeout, add retry, fix shared-mode init for cross-region clusters#1027
DavidBellamy wants to merge 1 commit intoradixark:mainfrom
LLM360:fix/wandb-shared-mode-online-timeout

DavidBellamy commented Apr 21, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DavidBellamy commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Observed failure

Changes

Why shared mode is the right abstraction here

Testing

Rollback

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DavidBellamy commented Apr 21, 2026 •

edited

Loading