Skip to content

[Bug]: [pyexecutor] SM121 appears to be unintentionally excluded from MLA block-reuse and chunked-prefill allowlists #15344

@DhineshPonnarasan

Description

@DhineshPonnarasan

System Info

  • CPU architecture: N/A (code inspection issue)

  • GPU: N/A (issue identified through source analysis)

  • TensorRT-LLM branch: main

  • TensorRT-LLM commit: current main branch at time of investigation

  • OS: N/A

  • Additional information:

    • This issue was identified through source-code inspection and review of the MLA capability gating logic.
    • No specific hardware was required to observe the behavior.
    • The report concerns the SM allowlists used by the MLA block-reuse and chunked-prefill feature gates.

Who can help?

@kaiyux

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Summary

While reviewing the MLA capability gating logic in tensorrt_llm/_torch/pyexecutor/py_executor_creator.py, I noticed that both MLA KV-cache reuse and MLA chunked prefill are gated by the following SM allowlist:

[90, 100, 103, 120]

SM121 is excluded from both checks.

Relevant code:

if kv_cache_config.enable_block_reuse and sm_version not in [
    90, 100, 103, 120
]:
    ...
if enable_chunked_context and sm_version not in [
    90, 100, 103, 120
]:
    ...

I could not find any code, comments, tests, documentation, or commit history indicating that SM121 is intentionally unsupported for MLA block reuse or MLA chunked prefill.

At the same time, multiple other locations in the repository treat SM120 and SM121 as the same Blackwell family.

Examples include:

  • fused_moe_cute_dsl_b12x.py
  • deep_ep_low_latency.py
  • eagle3_dynamic_tree.py
  • several integration tests using (120, 121) checks

Additionally, the MLA XQA JIT path contains:

// SM121 uses the same cubin target as SM120 (sm_120f) for compatibility.

in:

cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/src/nvrtcWrapper.cpp

Steps to reproduce the behavior

  1. Review the MLA feature gating logic in:
    tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  2. Observe that SM121 is excluded from both MLA block-reuse and chunked-prefill allowlists.
  3. Compare this behavior against other SM120/SM121 checks throughout the repository and the MLA XQA JIT kernel support path.

Minimal example

The existing unit test pattern in:

tests/unittest/_torch/executor/test_py_executor_creator_mla_cache_reuse_sync.py

can be adapted with:

kv_cache_reuse, runtime_cache_reuse = _run_create_py_executor(
    monkeypatch,
    sm_version=121,
    kv_cache_quant_algo=QuantAlgo.NO_QUANT,
)

Under the current implementation, SM121 follows the unsupported-SM fallback path and MLA cache reuse is disabled.

Expected behavior

If SM121 is intended to be supported similarly to SM120 for MLA execution, I would expect SM121 to be included in the MLA capability allowlists.

Alternatively, if SM121 is intentionally unsupported, it would be helpful to document the architectural limitation or rationale for the exclusion.

In either case, I would expect the behavior to be explicitly documented.

actual behavior

SM121 falls through the unsupported-SM path and MLA KV-cache reuse / MLA chunked prefill are disabled.

Specifically:

  • kv_cache_config.enable_block_reuse is forced to False
  • attn_runtime_features.cache_reuse is forced to False
  • MLA chunked prefill is disabled when requested

The runtime emits warnings indicating that these features are unsupported on SM121.

additional notes

This report is primarily a request for clarification.

I investigated whether the exclusion of SM121 was intentional and was unable to find:

  • comments indicating MLA is unsupported on SM121
  • tests expecting SM121 to be disabled
  • documentation describing an SM121 limitation
  • commit history explicitly excluding SM121

Because SM121 appears to share the same sm_120f MLA kernel target as SM120, I wanted to confirm whether the current allowlists are intentional or whether SM121 was unintentionally omitted when MLA support was expanded to additional architectures.

If the current behavior is intentional, I would appreciate any context on the limitation. If not, I would be happy to help with a follow-up fix and regression test.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions