System Info
-
CPU architecture: N/A (code inspection issue)
-
GPU: N/A (issue identified through source analysis)
-
TensorRT-LLM branch: main
-
TensorRT-LLM commit: current main branch at time of investigation
-
OS: N/A
-
Additional information:
- This issue was identified through source-code inspection and review of the MLA capability gating logic.
- No specific hardware was required to observe the behavior.
- The report concerns the SM allowlists used by the MLA block-reuse and chunked-prefill feature gates.
Who can help?
@kaiyux
Information
Tasks
Reproduction
Summary
While reviewing the MLA capability gating logic in tensorrt_llm/_torch/pyexecutor/py_executor_creator.py, I noticed that both MLA KV-cache reuse and MLA chunked prefill are gated by the following SM allowlist:
SM121 is excluded from both checks.
Relevant code:
if kv_cache_config.enable_block_reuse and sm_version not in [
90, 100, 103, 120
]:
...
if enable_chunked_context and sm_version not in [
90, 100, 103, 120
]:
...
I could not find any code, comments, tests, documentation, or commit history indicating that SM121 is intentionally unsupported for MLA block reuse or MLA chunked prefill.
At the same time, multiple other locations in the repository treat SM120 and SM121 as the same Blackwell family.
Examples include:
fused_moe_cute_dsl_b12x.py
deep_ep_low_latency.py
eagle3_dynamic_tree.py
- several integration tests using
(120, 121) checks
Additionally, the MLA XQA JIT path contains:
// SM121 uses the same cubin target as SM120 (sm_120f) for compatibility.
in:
cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/src/nvrtcWrapper.cpp
Steps to reproduce the behavior
- Review the MLA feature gating logic in:
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
- Observe that SM121 is excluded from both MLA block-reuse and chunked-prefill allowlists.
- Compare this behavior against other SM120/SM121 checks throughout the repository and the MLA XQA JIT kernel support path.
Minimal example
The existing unit test pattern in:
tests/unittest/_torch/executor/test_py_executor_creator_mla_cache_reuse_sync.py
can be adapted with:
kv_cache_reuse, runtime_cache_reuse = _run_create_py_executor(
monkeypatch,
sm_version=121,
kv_cache_quant_algo=QuantAlgo.NO_QUANT,
)
Under the current implementation, SM121 follows the unsupported-SM fallback path and MLA cache reuse is disabled.
Expected behavior
If SM121 is intended to be supported similarly to SM120 for MLA execution, I would expect SM121 to be included in the MLA capability allowlists.
Alternatively, if SM121 is intentionally unsupported, it would be helpful to document the architectural limitation or rationale for the exclusion.
In either case, I would expect the behavior to be explicitly documented.
actual behavior
SM121 falls through the unsupported-SM path and MLA KV-cache reuse / MLA chunked prefill are disabled.
Specifically:
kv_cache_config.enable_block_reuse is forced to False
attn_runtime_features.cache_reuse is forced to False
- MLA chunked prefill is disabled when requested
The runtime emits warnings indicating that these features are unsupported on SM121.
additional notes
This report is primarily a request for clarification.
I investigated whether the exclusion of SM121 was intentional and was unable to find:
- comments indicating MLA is unsupported on SM121
- tests expecting SM121 to be disabled
- documentation describing an SM121 limitation
- commit history explicitly excluding SM121
Because SM121 appears to share the same sm_120f MLA kernel target as SM120, I wanted to confirm whether the current allowlists are intentional or whether SM121 was unintentionally omitted when MLA support was expanded to additional architectures.
If the current behavior is intentional, I would appreciate any context on the limitation. If not, I would be happy to help with a follow-up fix and regression test.
Before submitting a new issue...
System Info
CPU architecture: N/A (code inspection issue)
GPU: N/A (issue identified through source analysis)
TensorRT-LLM branch: main
TensorRT-LLM commit: current main branch at time of investigation
OS: N/A
Additional information:
Who can help?
@kaiyux
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Summary
While reviewing the MLA capability gating logic in
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py, I noticed that both MLA KV-cache reuse and MLA chunked prefill are gated by the following SM allowlist:SM121 is excluded from both checks.
Relevant code:
I could not find any code, comments, tests, documentation, or commit history indicating that SM121 is intentionally unsupported for MLA block reuse or MLA chunked prefill.
At the same time, multiple other locations in the repository treat SM120 and SM121 as the same Blackwell family.
Examples include:
fused_moe_cute_dsl_b12x.pydeep_ep_low_latency.pyeagle3_dynamic_tree.py(120, 121)checksAdditionally, the MLA XQA JIT path contains:
// SM121 uses the same cubin target as SM120 (sm_120f) for compatibility.in:
Steps to reproduce the behavior
tensorrt_llm/_torch/pyexecutor/py_executor_creator.pyMinimal example
The existing unit test pattern in:
can be adapted with:
Under the current implementation, SM121 follows the unsupported-SM fallback path and MLA cache reuse is disabled.
Expected behavior
If SM121 is intended to be supported similarly to SM120 for MLA execution, I would expect SM121 to be included in the MLA capability allowlists.
Alternatively, if SM121 is intentionally unsupported, it would be helpful to document the architectural limitation or rationale for the exclusion.
In either case, I would expect the behavior to be explicitly documented.
actual behavior
SM121 falls through the unsupported-SM path and MLA KV-cache reuse / MLA chunked prefill are disabled.
Specifically:
kv_cache_config.enable_block_reuseis forced toFalseattn_runtime_features.cache_reuseis forced toFalseThe runtime emits warnings indicating that these features are unsupported on SM121.
additional notes
This report is primarily a request for clarification.
I investigated whether the exclusion of SM121 was intentional and was unable to find:
Because SM121 appears to share the same
sm_120fMLA kernel target as SM120, I wanted to confirm whether the current allowlists are intentional or whether SM121 was unintentionally omitted when MLA support was expanded to additional architectures.If the current behavior is intentional, I would appreciate any context on the limitation. If not, I would be happy to help with a follow-up fix and regression test.
Before submitting a new issue...