[None][fix] Always sync local ranks after prefetch in HfWeightLoader#13556
[None][fix] Always sync local ranks after prefetch in HfWeightLoader#13556lancelly wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
`enable_prefetch` depends on `psutil.virtual_memory().available`, a per-rank volatile value, so different local ranks may take different branches. Gating `local_mpi_barrier()` on `enable_prefetch` could deadlock between ranks that prefetched and ranks that skipped. Move the barrier out of the conditional so all local ranks synchronize unconditionally; ranks that didn't prefetch reach the barrier immediately. Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
📝 WalkthroughWalkthroughThe synchronization barrier in Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/bot run --disable-fail-fast |
|
PR_Github #45919 [ run ] triggered by Bot. Commit: |
|
PR_Github #45919 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #46018 [ run ] triggered by Bot. Commit: |
|
PR_Github #46018 [ run ] completed with state
|
Summary
Move
local_mpi_barrier()inHfWeightLoader.load_weightsout of theif enable_prefetch:branch so all local ranks invoke the collective unconditionally. The previous code gated a collective on a per-rank volatile value, which caused a 4-rank deadlock during DeepSeek-V4 Pro (NVFP4) loading with MTP=1 + DEP4.Root cause
enable_prefetchis computed as:psutil.virtual_memory().availableis an OS-level instantaneous value. Even though_get_local_available_host_memory()does anMPI.MINallreduce, the snapshot itself is taken at slightly different wall-clock moments per rank, and per-rank CPU memory peaks (model meta init,model.to("cuda"), GC, page-cache churn) drift. Observed on a single node:For DSV4 Pro (
prefetch_size = 805 GB), the 0.9x threshold (~894 GB) landed in the middle of the per-rank distribution, so two ranks tookenable_prefetch = Trueand two tookFalse. The twoTrueranks enteredlocal_mpi_barrier()while the other two had already moved on, producing a hard deadlock.Why this was not discovered before
Introduced in PR #6486 (2025-08-01, DeepSeek R1 FP8 on Blackwell). The deadlock requires two conditions:
prefetch_size ~ 0.45-0.55 x system_mem, so the threshold cuts through the per-rankavailabledistribution.Every prior model (Llama 70B / 405B FP8, Mixtral 8x22B, DeepSeek-V3 / R1 FP8) had
prefetch_sizewell below the threshold on typical nodes, so all ranks unanimously choseTrue. DSV4 Pro NVFP4 (805 GB) is the first model whose footprint sits nearsystem_mem / 2. MTP=1 with DEP4 amplifies condition (2): the extraDeepseekV4MTPdraft layer with attention-DP shifts per-rankmodel.to("cuda")completion by hundreds of milliseconds, widening the per-rankavailablespread enough to straddle the threshold.Fix
Move the barrier out of the conditional. Ranks that did not prefetch reach the barrier immediately, so cost is negligible and no decision semantics change.