yingguo-trt · pull · Apr 1, 2026 · Apr 1, 2026 · Apr 1, 2026 · Apr 1, 2026
diff --git a/.claude/skills/serve-config-guide/SKILL.md b/.claude/skills/serve-config-guide/SKILL.md
@@ -0,0 +1,74 @@
+---
+name: serve-config-guide
+description: Generate a source-backed starting `trtllm-serve --config` YAML for
+  basic aggregate single-node PyTorch serving, aligned with checked-in TensorRT-LLM
+  configs and deployment docs. Preserves explicit latency / balanced / throughput
+  objectives. Excludes disaggregated, multi-node, and non-MTP speculative configs.
+---
+
+# Serve Config Guide
+
+**Scope:** aggregate/IFB (in-flight batching) colocated prefill+decode, single node, PyTorch backend, non-speculative by default; DeepSeek-R1 MTP is the standard mode (all checked-in configs include it).
+
+**Input:** model, GPU, ISL (input sequence length), OSL (output sequence length), concurrency, TP, performance objective (`Min Latency` | `Balanced` | `Max Throughput` | unspecified).
+**Output:** repo-grounded starting YAML for `trtllm-serve --config`.
+
+If the request is adjacent but out of scope, provide a best-effort answer using the nearest in-scope config as a starting point, clearly label inferred vs. verified fields, and point to the relevant feature doc in `docs/source/features/` (e.g., speculative-decoding, disagg-serving, parallel-strategy) or `examples/llm-api/`.
+
+## Constraints
+
+1. **Speculative exclusion:** Exclude configs containing `speculative_config` by default. Exception: exact checked-in DeepSeek-R1 MTP configs (models with `decoding_type: MTP` in `examples/configs/`). When including MTP, copy the full `speculative_config` block verbatim — never interpolate speculative fields.
+
+2. **Objective preservation:** Preserve the user's stated objective through config selection. Use `database.py` profile labels (`Min Latency`, `Balanced`, `Max Throughput`; plus `Low Latency`/`High Throughput` in smaller sets) as selection aids. If a config is unlabeled, treat it as a default starting point — do not claim it matches a specific objective. If the only match conflicts with the stated objective, call out the mismatch.
+
+3. **Source preference:** Prefer checked-in configs over interpolation. When docs and configs disagree, prefer the config for the exact scenario and note the mismatch. Mark any interpolation as unverified.
+
+## Response Format
+
+For **exact matches**: `Config` → `Source` → `Launch command`
+
+For **interpolated configs**: `Config` → `Source used as starting point` → `What to benchmark` (single list of knobs worth sweeping, not per-field unverified tags)
+
+## Step 0: Lock Objective and Decode Mode
+
+Identify the user's objective (`Min Latency` | `Balanced` | `Max Throughput` | unspecified) and decode mode (non-speculative or DeepSeek-R1 MTP per **Constraint 1**). Preserve both through the remaining steps.
+
+## Step 1: Exact Database Match
+
+Search `examples/configs/database/lookup.yaml` for an exact `(model, gpu, isl, osl, concurrency, num_gpus)` match. Use `database.py` as a loader/helper.
+
+- Apply **speculative exclusion**.
+- When multiple recipes exist at different concurrency points, use profile labels to match the user's objective per **objective preservation**.
+- Prefer an exact match that also matches the stated objective over manual tuning.
+
+## Step 2: Nearest Checked-In Config
+
+If no exact match, widen the search to also include `examples/configs/curated/lookup.yaml`.
+
+Apply the same constraints as Step 1. Additionally:
+- A partial match from `database/` is preferred over a partial match from `curated/` for the same model (database configs are benchmark-tuned).
+- Exclude disaggregated-only or prefill-only entries (e.g., `qwen3-disagg-prefill.yaml`).
+- For curated configs, only treat intent as explicit when the repo labels it (e.g., `*-latency.yaml`, `*-throughput.yaml`, or guide text).
+- If no in-scope config matches the stated objective, pick the nearest same-model starting point and call out the mismatch.
+
+## Step 3: Read Model Docs
+
+Search `docs/source/deployment-guide/` and `examples/models/core/` for the model's deployment guide and README. Read both before adjusting knobs.
+
+**Excluded sources:** Do NOT use `docs/source/legacy/` tuning values or benchmark numbers — those were measured on the TensorRT engine-building backend and do not transfer to PyTorch backend serving.
+
+**DeepSeek-V3 caveat:** For DeepSeek-V3/V3.2-Exp, use `examples/models/core/deepseek_v3/README.md`, not the R1 deployment guide.
+
+## Step 4: Adjust Source-Backed Fields
+
+Commonly scenario-dependent fields (adjust only these, guided by the checked-in source):
+
+`max_batch_size`, `max_num_tokens`, `max_seq_len`, `enable_attention_dp`, `attention_dp_config.*`, `kv_cache_config.free_gpu_memory_fraction`, `moe_expert_parallel_size` (MoE), `moe_config.backend` (when guide specifies), `stream_interval`, `num_postprocess_workers`, `cuda_graph_config.max_batch_size`/`batch_sizes`, and MTP-specific fields when using DeepSeek-R1 MTP configs.
+
+Do not assume other fields are constant across models/GPUs. For tuning notes, read `references/knob-heuristics.md`.
+
+## Validation Checklist
+
+- [ ] `trust_remote_code: true` called out as trust boundary when present
+- [ ] `max_num_tokens` >= ISL + chat template overhead (requests rejected if violated)
+- [ ] If interpolated: single "What to benchmark" section listing knobs to sweep, not per-field unverified tags
diff --git a/.claude/skills/serve-config-guide/references/knob-heuristics.md b/.claude/skills/serve-config-guide/references/knob-heuristics.md
@@ -0,0 +1,53 @@
+# Source-Backed Tuning Notes
+
+Read an exact or nearby checked-in config and the model's deployment guide **before** using these notes. These are not universal thresholds.
+
+## Commonly Tuned Fields
+
+| Field | Guidance |
+|---|---|
+| `max_batch_size` | Scheduler ceiling, not a memory reservation and NOT proportional to concurrency — actual batch size adapts at runtime. Copy from the nearest checked-in source config; do not invent a value from concurrency. Prefer keeping the source value unless OOM occurs. MoE models generally cap lower than dense. |
+| `max_num_tokens` | Scheduler token budget. When chunked prefill is **disabled** (default): must exceed ISL plus chat template overhead; sweet spot is ISL to 2× ISL. When chunked prefill is **enabled**: acts as the chunk size — see `enable_chunked_prefill` section below. General default is 8192. Tune together with `max_batch_size`. |
+| `max_seq_len` | Global hard cap on total tokens per request (prompt + output). Set to `ISL + OSL + chat_template_overhead`. Chat templates and benchmarking preambles add tokens beyond raw ISL — overhead varies by model (checked-in configs show 20–200 tokens). Setting too tight rejects or truncates requests; setting too loose wastes KV cache per request. Copy from nearest checked-in config when available. |
+| `enable_attention_dp` | High-throughput knob. MoE+GQA models benefit at lower concurrency thresholds than MoE+MLA or Dense+GQA. Memory overhead: small for MLA (compressed attention), substantial for GQA (full replication). Can trigger OOM when combined with aggressive KV cache fraction. Follow the exact model guide/config. |
+| `kv_cache_config.free_gpu_memory_fraction` | OOM lever. MLA models (compressed KV) tolerate higher fractions; GQA models need more headroom. Lower when ADP enabled to account for replicated attention overhead. Large MoE models with ADP may need notably conservative fractions. Guides often adjust `max_batch_size` or `max_seq_len` first. |
+| `moe_expert_parallel_size` / `moe_config.backend` | MoE only. Copy both from checked-in source — EP does not necessarily equal TP. If no backend source exists, mark as unverified; benchmark CUTLASS vs TRTLLM. |
+| `cuda_graph_config.max_batch_size` / `batch_sizes` | Caps which decode batch sizes get CUDA graphs captured; batches above this fall back to eager execution (no error, just slower). **Default to `max_batch_size`** (safe, covers all batch sizes). Only lower when memory is tight — e.g., DeepSeek-R1 conc=1 uses `cuda_graph_config.max_batch_size: 1` with server `max_batch_size: 512` to avoid wasting graph memory on unreachable sizes. Also capped by `max_num_tokens / (1 + max_total_draft_tokens)` at runtime. |
+
+## KV Cache Estimation
+
+Use these formulas to sanity-check whether a concurrency target fits in GPU memory. Read the required values from the model's HuggingFace config (`config.json`).
+
+**Per-token KV cache size:**
+
+- **GQA (standard grouped-query attention):**
+  `kv_per_token = 2 × num_attention_layers × (num_key_value_heads / TP) × head_dim × dtype_bytes`
+  When `enable_attention_dp` is enabled, KV cache is fully replicated per rank (not TP-sharded); use divisor 1 instead of TP.
+- **MLA (multi-latent attention, e.g. DeepSeek-V2/V3):**
+  `kv_per_token = num_attention_layers × (kv_lora_rank + qk_rope_head_dim) × dtype_bytes`
+
+Where `dtype_bytes` is 2 for BF16/FP16, 1 for FP8/INT8.
+
+**Approximate max concurrent requests (upper bound):**
+
+```
+max_requests ≈ floor((GPU_HBM × 0.90 − model_weights_bytes / TP) / (kv_per_token × (ISL + OSL)))
+```
+
+The 0.90 factor reserves ~10% of HBM for CUDA context, driver, and runtime overhead. Result is per-GPU.
+
+**HF config fields to read:** `num_attention_layers` (equals `num_hidden_layers` for standard transformers; differs for hybrid models like Nemotron-H), `num_key_value_heads`, `head_dim` (or `hidden_size / num_attention_heads`), `kv_lora_rank`, `qk_rope_head_dim`.
+
+**Caveats:** This estimate ignores activation memory, CUDA graph workspace, MoE expert workspace, and attention data parallelism (ADP) overhead. Always prefer checked-in config values over formula-derived estimates. Mark any formula-derived number as unverified.
+
+## Chunked Prefill
+
+Chunked prefill (`enable_chunked_prefill: true`) splits long prefill sequences into chunks so that decode batches sharing the same iteration are not starved. It is **disabled by default** and should be treated as an advanced latency optimization, not a default recommendation. See the `max_num_tokens` table entry above for how it changes token budget semantics.
+
+**MLA models (DeepSeek-V2/V3/R1, Kimi-K2):**
+- Chunked prefill IS supported for MLA — dedicated CUDA kernels exist with multi-round attention and softmax merging.
+- **Hardware constraint:** only available on SM90 (Hopper) and SM100/SM103/SM120 (Blackwell+). The runtime automatically disables it with a warning on older GPUs.
+- **Trade-off:** *"primarily designed to reduce TPOT [...] will also decrease overall throughput."*
+- **Recommendation:** do not enable by default for MLA models. Consider it only for latency-sensitive workloads on Hopper or Blackwell GPUs where TPOT reduction outweighs the throughput cost.
+
+**Non-MLA models (GQA):** more broadly supported across GPU generations. Still disabled by default; enable when long prefill sequences cause decode latency spikes.
diff --git a/examples/configs/curated/kimi-k2-thinking.yaml b/examples/configs/curated/kimi-k2-thinking.yaml
@@ -8,7 +8,4 @@ print_iter_log: true
 kv_cache_config:
   free_gpu_memory_fraction: 0.75
   dtype: fp8
-cache_transceiver_config:
-  backend: UCX
-  max_tokens_in_buffer: 8448
 trust_remote_code: true
diff --git a/requirements-dev.txt b/requirements-dev.txt
@@ -1,7 +1,7 @@
 -r requirements.txt
 einops
 graphviz
-mypy
+mypy==1.19.1
 mako
 oyaml
 parameterized

diff --git a/tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py b/tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
@@ -362,7 +362,8 @@ def get_tuning_config(cls, ep_size: int) -> TuningConfig:
         constraint_specs = cls.get_constraint_specs()
 
         tuning_config = TuningConfig(dynamic_tensor_specs=dynamic_tensor_specs,
-                                     constraint_specs=constraint_specs)
+                                     constraint_specs=constraint_specs,
+                                     tune_max_num_tokens=8192)
 
         return tuning_config
 

diff --git a/tensorrt_llm/_torch/pyexecutor/sampler.py b/tensorrt_llm/_torch/pyexecutor/sampler.py
@@ -3284,12 +3284,12 @@ def update_requests(
         state: SampleStateTorch,
         resource_manager: Optional[ResourceManager] = None,
     ) -> None:
-        if not state.requests:
-            return
-
         if state.sampler_event:
             state.sampler_event.synchronize()
 
+        if not state.requests:
+            return
+
         assert state.host is not None
         new_tokens = state.host.new_tokens
         finish_reasons = state.host.finish_reasons_list()
@@ -4696,12 +4696,13 @@ def update_requests(
     ) -> None:
         # resource_manager will not be used in this function, just for interface consistency.
         assert isinstance(state, SampleStateTRTLLM)
-        if not state.requests:
-            return
 
         if state.sampler_event:
             state.sampler_event.synchronize()
 
+        if not state.requests:
+            return
+
         beam_width = self.beam_width(state.requests)
 
         if beam_width == 1 and self.MAX_DECODING_TOKENS == 1:

diff --git a/tensorrt_llm/runtime/kv_cache_manager_v2/_cuda_virt_mem.py b/tensorrt_llm/runtime/kv_cache_manager_v2/_cuda_virt_mem.py
@@ -18,22 +18,27 @@
 import cuda.bindings.driver as drv
 
 from ._common import MemAddress
+from ._exceptions import CuError
 from ._utils import ItemHolderWithSharedPool, PooledFactoryBase, _unwrap, div_up
 
 
 def _is_prop_supported(prop: drv.CUmemAllocationProp) -> bool:
     err, handle = drv.cuMemCreate(2 << 20, prop, 0)
-    if (
-        err == drv.CUresult.CUDA_ERROR_NOT_PERMITTED
-        or err == drv.CUresult.CUDA_ERROR_NOT_SUPPORTED
-        or err == drv.CUresult.CUDA_ERROR_INVALID_DEVICE
-    ):
-        return False
-    elif err == drv.CUresult.CUDA_SUCCESS:
+    err_int = int(err)
+    if err_int == int(drv.CUresult.CUDA_SUCCESS):
         _unwrap(drv.cuMemRelease(handle))
         return True
+    # Note: OOM is intentionally not caught here — OOM on a 2 MiB probe
+    # indicates a fundamental resource problem, not an unsupported property.
+    elif err_int in (
+        int(drv.CUresult.CUDA_ERROR_NOT_PERMITTED),
+        int(drv.CUresult.CUDA_ERROR_NOT_SUPPORTED),
+        int(drv.CUresult.CUDA_ERROR_INVALID_DEVICE),
+        int(drv.CUresult.CUDA_ERROR_INVALID_VALUE),
+    ):
+        return False
     else:
-        raise ValueError(f"Unexpected error: {err}")
+        raise CuError(err)
 
 
 # Physical memory

diff --git a/tests/integration/test_lists/waives.txt b/tests/integration/test_lists/waives.txt
@@ -367,14 +367,10 @@ perf/test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_qwen3-235b-fp4_8
 unittest/_torch/modules/moe/test_moe_module.py::test_configurable_moe_multi_gpu[parallel=DEP-comm=DEEPEP-e60_k4_h2048_i1408-seq=8-dtype=torch.bfloat16-backend=TRTLLM-quant=NVFP4-routing=Renormalize] SKIP (https://nvbugs/6007285)
 disaggregated/test_disaggregated.py::test_disaggregated_gpt_oss_120b_harmony[gpt_oss/gpt-oss-120b] SKIP (https://nvbugs/6011317)
 accuracy/test_disaggregated_serving.py::TestDeepSeekV32Exp::test_auto_dtype_with_helix[fifo-cudagraph:with_padding-pp1tp2cp2] SKIP (https://nvbugs/6011320)
-accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=True-enable_chunked_prefill=False-v2_kv_cache=True] SKIP (https://nvbugs/6013692)
-accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False-enable_chunked_prefill=False-v2_kv_cache=True] SKIP (https://nvbugs/6013692)
-accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=True-enable_chunked_prefill=True-v2_kv_cache=True] SKIP (https://nvbugs/6013692)
 accuracy/test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_ctx_pp_gen_tp_asymmetric[GSM8K-gen_tp=1-ctx_pp=4] SKIP (https://nvbugs/6007967)
 accuracy/test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_ctx_pp_gen_tp_asymmetric[MMLU-gen_tp=2-ctx_pp=4] SKIP (https://nvbugs/6007967)
 accuracy/test_llm_api_pytorch.py::TestNemotronV3Super::test_bf16_4gpu_mtp_ar SKIP (https://nvbugs/5959992)
 accuracy/test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_ctx_pp_gen_tp_asymmetric[GSM8K-gen_tp=2-ctx_pp=4] SKIP (https://nvbugs/6007967)
-accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False-enable_chunked_prefill=True-v2_kv_cache=True] SKIP (https://nvbugs/6013692)
 accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_eagle3_vswa_reuse_4gpus[two_model] SKIP (https://nvbugs/6013562)
 accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_auto_dtype_with_helix[fifo_v2-cudagraph:with_padding-pp1dp2cp2] SKIP (https://nvbugs/6011320)
 accuracy/test_llm_api_pytorch.py::TestQwen3_8B::test_bf16[latency] SKIP (https://nvbugs/6012526)