feat(torch): add GMS weight-loading prototype#1
Draft
galletas1712 wants to merge 2 commits intodynamo/mainfrom
Draft
feat(torch): add GMS weight-loading prototype#1galletas1712 wants to merge 2 commits intodynamo/mainfrom
galletas1712 wants to merge 2 commits intodynamo/mainfrom
Conversation
This was referenced Apr 23, 2026
a44b410 to
d04b750
Compare
Owner
Author
|
FYI — the Kimi failover doesn't exercise the unprotected sites (SIGTERM-kill + cold-boot, no #7 proposes moving the gate into |
Introduce the GPU Memory Service (GMS) weight-loading prototype for Dynamo's cross-engine zero-copy weight sharing: - `tensorrt_llm/_torch/memory/gpu_memory_backend.py` — `GMSBackend` wrapper around `gpu_memory_service.client.torch.allocator` providing publish/materialize primitives (`materialize_module`, `defer_finalize_write`, `move_untracked_params`) and `mem_pool_scope` for directing `torch.empty`/`torch.zeros` allocations into GMS-backed virtual memory. - `tensorrt_llm/_torch/pyexecutor/model_loader.py` — extend `LoadFormat` with `GMS` and branch on `gms_backend.is_rw` to publish weights (RW) or materialize an existing layout zero-copy (RO). - `tensorrt_llm/llmapi/llm_args.py` — new `gms_mode`, `gms_tag`, `gms_socket_path` fields; API stability YAML + test coverage. - `tensorrt_llm/_torch/modules/linear.py` — mark `Linear` modules that materialized zero-copy from a committed GMS layout with `_weights_presharded = True`; gate re-sharding in `load_weight_shard` itself via an optional `module=` kwarg so every call site (not just the three unquantized helpers) is protected. The `_weights_presharded` gate lives in `load_weight_shard` to avoid scattering the same ternary across ~90 callers (quantized scales, MoE expert loaders, fused/triton linear). Callers opt in by passing `module=module`; a presharded module forces `tensor_parallel_size=1`, `tensor_parallel_rank=0` and returns the tensor unchanged. Existing callers without `module=` retain legacy behavior. Tests: `tests/unittest/_torch/memory/test_gms_backend.py`, `tests/unittest/_torch/modules/test_load_weight_shard.py`, `tests/unittest/llmapi/test_gms_args.py`.
d04b750 to
80a4218
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
register_module_tensors(...)finalize_gms_write(...)LoadFormat.GMSweight-loading path plus focused GMS-only testsScope in this PR
Included:
tensorrt_llm/_torch/memory/gpu_memory_backend.pytensorrt_llm/_torch/memory/__init__.pyLoadFormat.GMSand GMS args intensorrt_llm/llmapi/llm_args.pytensorrt_llm/_torch/pyexecutor/model_loader.pyLinear._weights_preshardedsupport for RO materializationIntentionally excluded:
API target
This branch targets the new explicit Dynamo/GMS split API. On the TRT-LLM side the write path now does:
register_module_tensors(...)torch.cuda.synchronize()finalize_gms_write(...)Deferred parity items
Still deferred from the broader single-node/failover line:
SleepConfig.restore_modespickling fixNotes
dynamo/mainline because that is the local TensorRT-LLM baseline carrying the Dynamo-oriented filelock hardening used for this extraction.