Skip to content

feat(torch): add GMS weight-loading prototype#1

Draft
galletas1712 wants to merge 2 commits intodynamo/mainfrom
schwinns/gms-only-upstream-20260422
Draft

feat(torch): add GMS weight-loading prototype#1
galletas1712 wants to merge 2 commits intodynamo/mainfrom
schwinns/gms-only-upstream-20260422

Conversation

@galletas1712
Copy link
Copy Markdown
Owner

Summary

Scope in this PR

Included:

  • tensorrt_llm/_torch/memory/gpu_memory_backend.py
  • tensorrt_llm/_torch/memory/__init__.py
  • LoadFormat.GMS and GMS args in tensorrt_llm/llmapi/llm_args.py
  • GMS RW/RO load path in tensorrt_llm/_torch/pyexecutor/model_loader.py
  • minimal Linear._weights_presharded support for RO materialization
  • GMS-only unit coverage and API stability snapshot updates

Intentionally excluded:

API target

This branch targets the new explicit Dynamo/GMS split API. On the TRT-LLM side the write path now does:

  1. register_module_tensors(...)
  2. torch.cuda.synchronize()
  3. finalize-only finalize_gms_write(...)

Deferred parity items

Still deferred from the broader single-node/failover line:

  • SleepConfig.restore_modes pickling fix
  • TP>1 GMS autotuner disable/guard
  • spec-dec / shared-draft RO alias-prebind parity

Notes

  • This draft PR is intended for review of the GMS-only extraction shape first.
  • The branch is currently based on the fork's dynamo/main line because that is the local TensorRT-LLM baseline carrying the Dynamo-oriented filelock hardening used for this extraction.

@galletas1712
Copy link
Copy Markdown
Owner Author

FYI — the _weights_presharded handling here only protects three of the ~90 load_weight_shard(..., module.tp_size, module.tp_rank, ...) call sites across linear.py, fused_moe/quantization.py, fused_moe/fused_moe_triton.py, and triton_linear.py. The quantized and MoE paths still re-shard unconditionally.

Kimi failover doesn't exercise the unprotected sites (SIGTERM-kill + cold-boot, no reload), but the design is brittle.

#7 proposes moving the gate into load_weight_shard itself via an optional module= keyword. It's stacked on the integrated tip rather than in-place on this PR so it doesn't orphan the downstream PRs (#4, #5) and orphan commits. If/when we rebase the whole stack, #7 can be squashed into this PR.

Introduce the GPU Memory Service (GMS) weight-loading prototype for
Dynamo's cross-engine zero-copy weight sharing:

- `tensorrt_llm/_torch/memory/gpu_memory_backend.py` — `GMSBackend`
  wrapper around `gpu_memory_service.client.torch.allocator` providing
  publish/materialize primitives (`materialize_module`,
  `defer_finalize_write`, `move_untracked_params`) and `mem_pool_scope`
  for directing `torch.empty`/`torch.zeros` allocations into GMS-backed
  virtual memory.
- `tensorrt_llm/_torch/pyexecutor/model_loader.py` — extend
  `LoadFormat` with `GMS` and branch on `gms_backend.is_rw` to publish
  weights (RW) or materialize an existing layout zero-copy (RO).
- `tensorrt_llm/llmapi/llm_args.py` — new `gms_mode`, `gms_tag`,
  `gms_socket_path` fields; API stability YAML + test coverage.
- `tensorrt_llm/_torch/modules/linear.py` — mark `Linear` modules that
  materialized zero-copy from a committed GMS layout with
  `_weights_presharded = True`; gate re-sharding in `load_weight_shard`
  itself via an optional `module=` kwarg so every call site (not just
  the three unquantized helpers) is protected.

The `_weights_presharded` gate lives in `load_weight_shard` to avoid
scattering the same ternary across ~90 callers (quantized scales, MoE
expert loaders, fused/triton linear). Callers opt in by passing
`module=module`; a presharded module forces `tensor_parallel_size=1`,
`tensor_parallel_rank=0` and returns the tensor unchanged. Existing
callers without `module=` retain legacy behavior.

Tests: `tests/unittest/_torch/memory/test_gms_backend.py`,
`tests/unittest/_torch/modules/test_load_weight_shard.py`,
`tests/unittest/llmapi/test_gms_args.py`.
@galletas1712 galletas1712 force-pushed the schwinns/gms-only-upstream-20260422 branch from d04b750 to 80a4218 Compare April 23, 2026 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant