[None][fix] Always sync local ranks after prefetch in HfWeightLoader by lancelly · Pull Request #13556 · NVIDIA/TensorRT-LLM

lancelly · 2026-04-28T09:44:16Z

Summary

Move local_mpi_barrier() in HfWeightLoader.load_weights out of the if enable_prefetch: branch so all local ranks invoke the collective unconditionally. The previous code gated a collective on a per-rank volatile value, which caused a 4-rank deadlock during DeepSeek-V4 Pro (NVFP4) loading with MTP=1 + DEP4.

Root cause

enable_prefetch is computed as:

enable_prefetch = prefetch_size < psutil.virtual_memory().available * 0.9 \
                  and num_layers == 0

psutil.virtual_memory().available is an OS-level instantaneous value. Even though _get_local_available_host_memory() does an MPI.MIN allreduce, the snapshot itself is taken at slightly different wall-clock moments per rank, and per-rank CPU memory peaks (model meta init, model.to("cuda"), GC, page-cache churn) drift. Observed on a single node:

rank0: avail=  846.33 GB
rank2: avail=  881.24 GB
rank3: avail= 1389.18 GB
rank1: avail= 1657.35 GB

For DSV4 Pro (prefetch_size = 805 GB), the 0.9x threshold (~894 GB) landed in the middle of the per-rank distribution, so two ranks took enable_prefetch = True and two took False. The two True ranks entered local_mpi_barrier() while the other two had already moved on, producing a hard deadlock.

Why this was not discovered before

Introduced in PR #6486 (2025-08-01, DeepSeek R1 FP8 on Blackwell). The deadlock requires two conditions:

prefetch_size ~ 0.45-0.55 x system_mem, so the threshold cuts through the per-rank available distribution.
Large enough per-rank memory-timing skew to cross the threshold.

Every prior model (Llama 70B / 405B FP8, Mixtral 8x22B, DeepSeek-V3 / R1 FP8) had prefetch_size well below the threshold on typical nodes, so all ranks unanimously chose True. DSV4 Pro NVFP4 (805 GB) is the first model whose footprint sits near system_mem / 2. MTP=1 with DEP4 amplifies condition (2): the extra DeepseekV4MTP draft layer with attention-DP shifts per-rank model.to("cuda") completion by hundreds of milliseconds, widening the per-rank available spread enough to straddle the threshold.

Fix

Move the barrier out of the conditional. Ranks that did not prefetch reach the barrier immediately, so cost is negligible and no decision semantics change.

if enable_prefetch:
    self.prefetch_files(weight_files)
local_mpi_barrier()  # always invoked by all local ranks

`enable_prefetch` depends on `psutil.virtual_memory().available`, a per-rank volatile value, so different local ranks may take different branches. Gating `local_mpi_barrier()` on `enable_prefetch` could deadlock between ranks that prefetched and ranks that skipped. Move the barrier out of the conditional so all local ranks synchronize unconditionally; ranks that didn't prefetch reach the barrier immediately. Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>

coderabbitai · 2026-04-28T09:46:57Z

📝 Walkthrough

Walkthrough

The synchronization barrier in HfWeightLoader.load_weights is moved outside the conditional block to ensure all local MPI ranks reach it unconditionally, rather than only those that prefetch checkpoint files. This prevents deadlocks when prefetch decisions differ across ranks due to runtime memory variability.

Changes

Cohort / File(s)	Summary
MPI Barrier Synchronization `tensorrt_llm/_torch/models/checkpoints/hf/weight_loader.py`	Moved synchronization barrier outside conditional prefetch logic to ensure all local ranks hit the barrier unconditionally, preventing rank desynchronization.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically identifies the main change: moving the local rank sync outside the prefetch conditional block in HfWeightLoader.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description clearly explains the issue (deadlock caused by gating a collective on per-rank volatile memory values), root cause analysis with concrete evidence, context on why it wasn't discovered before, and the fix. However, the PR description provided by the author does not follow the template structure (missing explicit Description, Test Coverage, and PR Checklist sections).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

lancelly · 2026-04-28T10:01:31Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-28T10:08:42Z

PR_Github #45919 [ run ] triggered by Bot. Commit: 33d39f2 Link to invocation

tensorrt-cicd · 2026-04-28T20:29:53Z

PR_Github #45919 [ run ] completed with state FAILURE. Commit: 33d39f2
/LLM/main/L0_MergeRequest_PR pipeline #36081 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lancelly · 2026-04-29T01:40:51Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-29T01:47:11Z

PR_Github #46018 [ run ] triggered by Bot. Commit: 33d39f2 Link to invocation

tensorrt-cicd · 2026-04-29T07:23:12Z

PR_Github #46018 [ run ] completed with state SUCCESS. Commit: 33d39f2
/LLM/main/L0_MergeRequest_PR pipeline #36168 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lancelly requested a review from a team as a code owner April 28, 2026 09:44

lancelly requested a review from syuoni April 28, 2026 09:44

github-actions Bot assigned lancelly Apr 28, 2026

juney-nvidia approved these changes Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][fix] Always sync local ranks after prefetch in HfWeightLoader#13556

[None][fix] Always sync local ranks after prefetch in HfWeightLoader#13556
lancelly wants to merge 1 commit intoNVIDIA:mainfrom
lancelly:fix_prefetch

lancelly commented Apr 28, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

lancelly commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

lancelly commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lancelly commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Why this was not discovered before

Fix

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

lancelly commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

lancelly commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lancelly commented Apr 28, 2026 •

edited

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading