feat(torch): add GMS weight-loading prototype by galletas1712 · Pull Request #1 · galletas1712/TensorRT-LLM

galletas1712 · 2026-04-22T23:28:11Z

Summary

extract only the GMS pieces from PR [TRTLLM-11851][feat] MX and GMS integration MVP for Dynamo weight sharing NVIDIA/TensorRT-LLM#13045 into a fresh TRT-LLM branch
intentionally exclude all MX-related code, config, and tests
adapt the TRT-LLM side to the new explicit Dynamo/GMS two-step API:
- register_module_tensors(...)
- finalize-only finalize_gms_write(...)
add a minimal LoadFormat.GMS weight-loading path plus focused GMS-only tests

Scope in this PR

Included:

tensorrt_llm/_torch/memory/gpu_memory_backend.py
tensorrt_llm/_torch/memory/__init__.py
LoadFormat.GMS and GMS args in tensorrt_llm/llmapi/llm_args.py
GMS RW/RO load path in tensorrt_llm/_torch/pyexecutor/model_loader.py
minimal Linear._weights_presharded support for RO materialization
GMS-only unit coverage and API stability snapshot updates

Intentionally excluded:

all MX pieces from PR [TRTLLM-11851][feat] MX and GMS integration MVP for Dynamo weight sharing NVIDIA/TensorRT-LLM#13045
MX checkpoint loaders / config / validators / tests
unrelated async / executor / visual-gen / workflow changes from that donor PR

API target

This branch targets the new explicit Dynamo/GMS split API. On the TRT-LLM side the write path now does:

register_module_tensors(...)
torch.cuda.synchronize()
finalize-only finalize_gms_write(...)

Deferred parity items

Still deferred from the broader single-node/failover line:

SleepConfig.restore_modes pickling fix
TP>1 GMS autotuner disable/guard
spec-dec / shared-draft RO alias-prebind parity

Notes

This draft PR is intended for review of the GMS-only extraction shape first.
The branch is currently based on the fork's dynamo/main line because that is the local TensorRT-LLM baseline carrying the Dynamo-oriented filelock hardening used for this extraction.

galletas1712 · 2026-04-23T07:57:33Z

FYI — the _weights_presharded handling here only protects three of the ~90 load_weight_shard(..., module.tp_size, module.tp_rank, ...) call sites across linear.py, fused_moe/quantization.py, fused_moe/fused_moe_triton.py, and triton_linear.py. The quantized and MoE paths still re-shard unconditionally.

Kimi failover doesn't exercise the unprotected sites (SIGTERM-kill + cold-boot, no reload), but the design is brittle.

#7 proposes moving the gate into load_weight_shard itself via an optional module= keyword. It's stacked on the integrated tip rather than in-place on this PR so it doesn't orphan the downstream PRs (#4, #5) and orphan commits. If/when we rebase the whole stack, #7 can be squashed into this PR.

Introduce the GPU Memory Service (GMS) weight-loading prototype for Dynamo's cross-engine zero-copy weight sharing: - `tensorrt_llm/_torch/memory/gpu_memory_backend.py` — `GMSBackend` wrapper around `gpu_memory_service.client.torch.allocator` providing publish/materialize primitives (`materialize_module`, `defer_finalize_write`, `move_untracked_params`) and `mem_pool_scope` for directing `torch.empty`/`torch.zeros` allocations into GMS-backed virtual memory. - `tensorrt_llm/_torch/pyexecutor/model_loader.py` — extend `LoadFormat` with `GMS` and branch on `gms_backend.is_rw` to publish weights (RW) or materialize an existing layout zero-copy (RO). - `tensorrt_llm/llmapi/llm_args.py` — new `gms_mode`, `gms_tag`, `gms_socket_path` fields; API stability YAML + test coverage. - `tensorrt_llm/_torch/modules/linear.py` — mark `Linear` modules that materialized zero-copy from a committed GMS layout with `_weights_presharded = True`; gate re-sharding in `load_weight_shard` itself via an optional `module=` kwarg so every call site (not just the three unquantized helpers) is protected. The `_weights_presharded` gate lives in `load_weight_shard` to avoid scattering the same ternary across ~90 callers (quantized scales, MoE expert loaders, fused/triton linear). Callers opt in by passing `module=module`; a presharded module forces `tensor_parallel_size=1`, `tensor_parallel_rank=0` and returns the tensor unchanged. Existing callers without `module=` retain legacy behavior. Tests: `tests/unittest/_torch/memory/test_gms_backend.py`, `tests/unittest/_torch/modules/test_load_weight_shard.py`, `tests/unittest/llmapi/test_gms_args.py`.

fix(trtllm): tolerate shared-fs config lock errors

d41949b

This was referenced Apr 23, 2026

fix(torch): tolerate recoverable config lock failures #2

Draft

feat(torch): load Eagle3 draft weights on GMS RW publish #5

Draft

galletas1712 force-pushed the schwinns/gms-only-upstream-20260422 branch 2 times, most recently from a44b410 to d04b750 Compare April 23, 2026 07:54

galletas1712 mentioned this pull request Apr 23, 2026

refactor(torch): centralize presharded-weights gate in load_weight_shard #7

Closed

galletas1712 force-pushed the schwinns/gms-only-upstream-20260422 branch from d04b750 to 80a4218 Compare April 23, 2026 08:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(torch): add GMS weight-loading prototype#1

feat(torch): add GMS weight-loading prototype#1
galletas1712 wants to merge 2 commits intodynamo/mainfrom
schwinns/gms-only-upstream-20260422

galletas1712 commented Apr 22, 2026

Uh oh!

galletas1712 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

galletas1712 commented Apr 22, 2026

Summary

Scope in this PR

API target

Deferred parity items

Notes

Uh oh!

galletas1712 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant