Releases · NVIDIA/TensorRT-LLM

12 Mar 19:43

pcastonguay

v1.2.0

51f5ef3

v1.2.0 Latest

Latest

Highlights

Model Support
- Added beta support for K-EXAONE, Nemotron Nano V3, Qwen3-Next and Qwen3-VL.
- Improved GPT-OSS, Nemotron, EXAONE, GLM, Starcoder2, Qwen3, KimiK2, DeepSeek v3.2 and Mistral Large 3 support and validation.
- Expanded Blackwell/Hopper/Ampere enablement including B300/GB200/GB300 and SM120/SM121/SM103 paths.
- Broadened low-precision and MoE capabilities (FP8/NVFP4/MXFP4/INT4-AWQ), including routing and kernels.
Features
- Speculative Decoding:
  - Enabled MTP>1 support for DeepSeek v3.2
- Disaggregated Serving:
  - Added service discovery mechanism for dynamic scaling
  - Added support for cancelling requests
  - Added NIXL-LibFabric support
  - Added support for Mooncake transfer engine as a cache transceiver backend
- Sampling:
  - Implemented batched sampling using FlashInfer sampling
  - Added support for returning logprobs incrementally with streaming mode in PyTorch backend
  - Added Beam Search support to TorchSampler
- Performance:
  - Improved TorchSampler performance
  - Enabled PDL by default and added PDL support for indexer TopK and additional kernels.
  - Improved trtllm-gen kernels
  - Enabled early exit with overlap scheduler
  - Added NUMA-aware CPU affinity automatic configuration
- Expert Parallelism:
  - Enabled EPLB for trtllm-gen and cutlass backend
  - Enabled CuteDSL MoE with large EP
  - Added CUDA graph support for DeepEP
  - Multiple performance improvements
- Hardware:
  - DGX Spark Support (Beta)
- Others:
  - Helix parallelism support
  - New Ray orchestrator type
Documentation
- Deployment Guides:
  - Added comprehensive deployment guides for KimiK2, Qwen3 and Qwen3-Next.
  - Added new guide on CPU Affinity configuration.
  - Updated GPT-OSS guide.
- Developer Guides:
  - Added developer guide about KV Cache Transmission.
  - New section on MoE Expert Load Balance Analysis (Perfect Router) in Performance Analysis guide.
  - New section on API Change Principles in LLM API Change guide.
- Feature Documentation:
  - Created new guides for Additional Outputs, Helix Parallelism, KV Cache Connector, Ray Orchestrator, Sparse Attention and Torch Compile & Piecewise CUDA Graph.
  - Also updated the Feature Combination Matrix and Paged Attention, IFB, and Request Scheduling guide.
- Tech Blogs: Published blogs on:
  - "Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary)"
  - "Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs".
- Examples:
  - Added new section on disaggregated serving service discovery method.
  - Added examples for K-EXAONE, Nemotron Nano V2 VL and Nemotron Nano V3.
  - Added RocketKV usage documentation.
Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.12-py3.
- The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:25.12-py3.
- The dependent public PyTorch version is updated to 2.9.1.
- The dependent transformers version is updated to 4.57.3.
- The dependent triton version is updated to 3.5.1.
- The dependent NIXL version is updated to 0.8.0.
API Changes
- Breaking Changes:
  - FlashInfer sampling now used by default with PyTorch backend.
  - Changes to sampling strategy in some previously undefined cases.
- OpenAI API:
  - Enabled n > 1 with PyTorch backend
  - Added support for GET/DELETE v1/responses
Fixed multiple Issues
Known Issues
- DGX Spark: DGX Spark support is in beta. Only single-node configurations and the models listed above have been validated in this release.
- Disaggregated Serving: A hang may occur in disaggregated serving with context pipeline parallelism and generation tensor parallelism configurations.

Assets 2

10 Mar 20:30

pcastonguay

v1.3.0rc7

69de4a6

v1.3.0rc7 Pre-release

Pre-release

Highlights

Model Support
- Support tensor parallelism of TRTLLM MoE backend for Nemotron-H model (#11470)
- Add Kimi-K2.5 text model support (NVFP4) (#11777)
- Add Helix CP support for DSV3.2 (#11507)
- Support mix quantization between shared experts and routed experts for DSV3 (#11215)
- Support Cohere Command A model (#11505)
- Extract embeddings as .safetensors and support float8-quantized models (#11180)
API
- Add --served-model-name option to serve command (#11711)
- Add flag to trtllm serve to override KV cache dtype (#11487)
- Use string stop/bad words in gRPC proto instead of pre-tokenized TokenSequence (#11888)
- Support multimodal image input in gRPC server (#11800)
- Expose use_python_scheduler in SchedulerConfig and add associated tests (#11884)
- Add max_gpu_total_bytes to control KVCacheManagerV2 capacity (#11907)
Feature
- Support PARD (Parallel Draft Model) in one-model speculative decoding (#11438)
- Enable autotuner for VisualGen and compilation config support (#11660)
- Add globaltimer-based timing backend for autotuner profiling (#11657)
- Support heterogeneous tokens_per_block (#11751)
- Refactor KVCacheManagerV2 to simplify new model support (#11749)
- Support Helix CP with GQA (#11570)
- Add option to skip KV cache memory estimation (#11714)
- Implement suffix automaton on device for speculative decoding and one-model support (#11434)
- Separate radix search tree implementation (#10862)
- Add support for expert_number (\le 2048) and K (\le 32) (#11510)
- Add support for bidirectional sliding window attention mask to fmha_v2 (#11212)
- Avoid duplicated computation with ADP + Helix CP in GQA (#11891)
- Add explicit video encode format support (#11830)
- Refactor video encoding to use ffmpeg CLI or pure Python fallback (#11672)
- Integrate CuTe DSL top-k kernel for Blackwell (#11900)
- Integrate suffix automaton with EAGLE3 and PARD (#11878)
- Add 5D A2A for fused Ulysses (#11787)
- Add SiLU to trtllm-gen MoE (#11663)
- Optimize by fusing nvfp4_quant into layernorm_gated for mamba2_mixer (#11473)
- Wire KVCacheBlock to UnifiedBlockTree using lookup-node pointers (#11919)
- Run extra general warmup to warm up memory pool (#10340)
Fix
- Add async worker to MTP/EAGLE3 sampler (#11573)
- Fix disaggregated cancellation (#11730)
- Use prefer_pinned() in pard.py (#11762)
- Release KVCacheManagerV2 memory immediately on shutdown (#11746)
- Remove duplicated MoE computation with Helix CP+DP (#11167)
- Register add+norm fallback pass for torch.compile in multi-GPU mode (#11739)
- Propagate logprobs from prefill to decode in disaggregated serving (#11727)
- Propagate logits from prefill to decode in disaggregated serving (#11767)
- Enable separate draft KV cache pool for aggregated mode and KVBM (#11689)
- Fix warnings when building moe_kernels.cu (#11703)
- Fix available_blocks typo in scheduler (#11801)
- Clean up memory in rollout process (#11658)
- Warm up maybe_compiled_cat in forward_context_with_chunked_prefill (#11743)
- Fix DeepEPLowLatency with CuTe DSL MoE backend (#11769)
- Fix FP8 per-tensor torch.compile graph break in dynamic quantization (#11759)
- Fix streaming generation logits and speed up logits testcase (#10637)
- Fix overly aggressive capacity scheduler (#11731)
- Use proper tokens when exclude_input_in_output is true (#9453)
- Move launch_dependent_grids after tmem free to fix race (#11812)
- Fix E/PD disaggregated chunked prefill bug (#11805)
- Fix SM120 issue for rms_norm with nvfp4_quant_fusion (#11774)
- Remove dead code (#11813)
- Fix KVCacheManagerV2 OOM and dummy request allocation in chunked prefill / pipeline parallel (#11710)
- Fix AttributeError when DSA indexer accesses non-DSA KVCacheManager (#11858)
- Override mMaxAttentionWindow with actual largest window size (#11842)
- Update check_is_moe to support mlp_layer_types after config.json update (#11477)
- Fix incorrect GPU timing in time breakdown under overlap scheduler (#11860)
- Fix OOM hang with NCCL_SYMMETRIC fallback during long-context inference (#11870)
- Fix position IDs input for Qwen3.5 text-only usage (#11877)
- Disable preload for Llama4 Scout (#11873)
- Fix formatting issue in tensorrt_llm/serve/openai_server.py (#11920)
- Prevent RuntimeError from dict mutation during iteration in EXAONE MoE weight mapper (#11862)
- Fix Nemotron MTP crash on SM90 (#11807)
- Fix Mistral Large3 + EAGLE bug (#11942, #11885)
- Fix TeaCache broken caching for FLUX.1 and FLUX.2 (#11868)
- Fix FLUX.1 TeaCache polynomial coefficients and defaults (#12007)
- Implement workaround for ClientPayloadError (#12018)
- Fix duplicate model entry in model list (#12029)
- Fix Python string truthiness bug in FMHA cubin selection (#11909)
Documentation
- Fix typos, grammar, and accuracy across documentation (#11766)
- Add sparse attention tech blog (#11644)
- Add known issue for disaggregated serving hang with asymmetric PP/TP (#11789)
- Fix documentation links (#11912)
- Replace “TensorRT-LLM” with “TensorRT LLM” (#11914)
- Add CI trigger and test-failure retrieval instructions to AGENTS.md (#11803)
Benchmark
- Vectorize quantize_fp8_blockwise with CUDA kernel (#11724)
- Use F.rms_norm for per-head QK normalization in VisualGen (#11798)
- Short-sequence MHA optimization for DSA MLA prefill (#11677)
- Parallel VAE harness and implementation for WAN (#11875)
- Add Triton FP8 blockwise quant kernel and autotuner bucket-skip for VisualGen (#11854)
- Optimize _prepare_inputs host time (#11704)
- Improve are_stop_words performance (#11196)
- Add DeepSeek RCCA performance test case (#11736)
- Add VisualGen benchmarking script (#11651)
Test & Infra
- Add tests for all database configs (#11653)
- Move B200 test stage to AIHub (#11692)
- Support local wheel installation and add GB300 demo cases (#11742)
- Remove submodule pulls from TRT-LLM git checkouts (#11693)
- Add back WAN VBench test in CI (#11804)
- Add E2E test for cancelled disaggregated generation requests with overlap scheduler (#11795)
- Pass Nsight options to ray_executor and trigger profiling through collective_rpc (#11493)
- Add B200 multi-node tests DB (#11783)
- Add sanity tests for release 1.2 version (#11738)
- Add QA test case for trust-remote-code on multi-node failure (#11905)
- Fix model_name Starcoder 15B allowed-models issue (#11981)
- Upgrade xgrammar from 0.1.25 to 0.1.32 (#12016)
- Limit TileIRAS to CUDA 13.1 (#12042)
- Remove VisualGen benchmark test from YAML (#12027)

What's Changed

[None][feat] Support tensor parallelism for nemotron-h model by @Wanli-Jiang in #11470
[None][test] Add tests for all database configs. by @fsaady in #11653
[https://nvbugs/5911143][fix] add async worker to MTP/Eagle3 sampler,… by @dhansen-nvidia in #11573
[TRTLLM-10886][feat] Support PARD(Parallel Draft Model) in one-model spec dec by @ziyixiong-nv in #11438
[None][fix] Fix disagg cancellation by @Tabrizian in #11730
[None][fix] Use prefer_pinned() in pard.py by @mikeiovine in #11762
[None][fix] Make KVCacheManagerV2 release mem immediately on shutdown by @lowsfer in #11746
[TRTLLM-11115][feat] enable autotuner for visual gen + Compilation Config by @NVShreyas in #11660
[None][chore] Minor fix in w4a8 mxfp4 mxfp8 test. by @Tracin in #11745
[None][infra] Move B200 test stage to AIHub by @yuanjingx87 in #11692
[None][infra] Waive failed cases for main on 02/27 by @EmmaQiaoCh in #11770
[TRTLLM-11064][fix] Remove duplicated MoE Computation with Helix CP+DP by @brb-nv in #11167
[TRTLLM-10386][fix] torch.compile: register add+norm fallback pass in multi-GPU mode by @luyiyun1021 in #11739
[None][feat] Support heterogeneous tokens_per_block by @lowsfer in #11751
[None][chore] Remove closed bugs by @xinhe-nv in #11527
[None][test] local wheel installation support and add gb300 cases demo by @fredricz-20070104 in #11742
[None][feat] Refactor cache manager v2 to simplify new model support by @jiaganc in #11749
[https://nvbugs/5879614][fix] Waive test_guided_decoding_with_eagle3 xgrammar in disaggregated serving by @ziyixiong-nv in #11773
[https://nvbugs/5911788][test] Waive test_llm_partial_update_weights[Qwen3/Qwen3-8B] by @liji-nv in #11785
[None][feat] add globaltimer-based timing backend for autotuner profi… by @dhansen-nvidia in #11657
[https://nvbugs/5926823][fix] Propagate logprobs from prefill to decode in disagg by @brb-nv in #11727
[TRTLLMINF-9][chore] Remove submodule pulls from TRT-LLM git checkouts by @dpitman-nvda in #11693
[https://nvbugs/5685010][fix] Delete test_eagle3_output_repetition_4gpus flaky assertions. by @zheyuf in https://github.com/NVI...

Contributors

davidmlw, karljang, and 91 other contributors

Assets 2

06 Mar 19:07

pcastonguay

v1.3.0rc5.post1

fdeaaa9

v1.3.0rc5.post1 Pre-release

Pre-release

What's Changed

[None][chore] bump version to 1.3.0rc5.post1 by @tburt-nv in #11788
[None][fix] Cherry pick cancel fix by @pcastonguay in #11790
[https://nvbugs/5926823][fix] Cherry-pick: Propagate logprobs from prefill to decode in disagg (#11727) by @pcastonguay in #11792
[https://nvbugs/5934461][fix] Cherry-picks 11767 (logits support in disagg) by @pcastonguay in #11832
[https://nvbugs/5935104][fix] Cherry-pick Fix overly aggressive capacity scheduler by @pcastonguay in #11834
[https://nvbugs/5938603][fix] Cherry-pick Fix E/PD disagg chunked prefill bug (#11805) by @pcastonguay in #11847
[https://nvbugs/5930934][fix] Cherry-pick fix NCCL OOM hang by @pcastonguay in #11916

Full Changelog: v1.3.0rc5...v1.3.0rc5.post1

Contributors

pcastonguay and tburt-nv

Assets 2

03 Mar 19:08

pcastonguay

v1.3.0rc6

617440d

v1.3.0rc6 Pre-release

Pre-release

Highlights

Model Support
- Add FLUX.1 and FLUX.2 text-to-image pipeline support (#11556)
- Add GatedDeltaNet sharding from config (#11599)
- Add B300 (sm103) support on VLMs (#11274)
- Fix Nemotron H FP4 and MTP support (#11601)
- Add quantized Eagle3 support by quantizing self.fc (#11699)
API
- Add skip_pre_hopper flag for NVILA and Nano V2 VLMs (#11275)
- Align LlmArgs with Pydantic best practices (#11158)
- Restructure KV cache memory ratio parameters in curated YAML config files (#11511)
Feature
- Refactor time breakdown tool (visualization, generation breakdown, etc.) (#11340)
- Improve TorchSampler performance by reducing host overhead (#11315)
- Use UE8M0 FP8 quant kernel for DeepGemm blockwise GEMM (#11607)
- Implement dynamic quota resize for KVCacheManager v2 (#11503)
- Add KVCache v2 MTP support (#11346)
- Enhance performance dashboard (#11506)
- Add E2E Python KV transceiver for current KV manager (step 5) (#11136)
- Refactor KV connector (#11078)
- Add GPU energy monitoring to trtllm-bench (#11397)
- Support PEFT-saved safetensors file loading (#11339)
- Improve FP8 (per-tensor) quant kernel with vectorized load/store (#11662)
- Remove non-flash-attention-style fmha_v2 kernel for Hopper (#11381)
Fix
- Fix missing sync before cuMemUnmap (#11641)
- Fix message truncation in Helix CP cache transmission (#11252)
- Fix GPT-OSS with non-paged_context_fmha (#11309)
- Fix multi-node trust_remote_code hang in disaggregated serving (#11383)
- Fix kwargs name (#11496)
- Accept **kwargs in DynamicYamlWithDeepMergeSettingsSource (#11621)
- Fix FP8 + skip-softmax attention accuracy issue on fmha_v2 (#11448)
- Handle None priority in KVCacheEventSerializer._event_diff_to_json (#11576)
- Fix WideEP gen-only benchmark hang in disaggregated serving (#11521)
- Fix cancelled disaggregated requests getting stuck in gen server (#11695)
- Fix DeepEP low-latency with DeepGEMM (#11700)
- Recover from CUTLASS MoE doActivation perf regression for MXFP4/NVFP4 dtype (#11165)
- Work around F.linear perf regression for GPTOSS (#11668)
- Fix illegal memory access when max_seq_len > max_position_embeddings (#11598)
- Prevent drift accumulation on kv_lens_cuda (#11696)
Documentation
- Resolve conflicts in markdown documentation (#11255)
- Move kimi-k2-thinking deployment guide configs into config files (#11645)
- Rename svd-nvfp4 to trtllm-nvfp4 in visual generation examples (#11664)
- Fix 60+ broken links across docs, blogs, and examples (#11676)
- Update Qwen3-Next README server argument docs (#11682)
- Update speculative decoding docs (#11604)
- Update PR template (#11735)
- Add Qwen3.5 cookbook (#11728)
Test & Infra
- Enable Nemotron NVFP4 tests (#11172)
- Prepare for NumPy v2 (#11389)
- Add Python builds tests to CI pre-merge pipeline (#9943)
- Disable warmup steps for some WAN unit tests (#11616)
- Use the correct config for GPTOSS perf test (#11046)
- Disable release Spark stage during Spark cloud migration (#11402)
- Re-enable release Spark stage after Spark cloud migration (#11408)
- Fix test prefix generation for per-SM waives (#11519)
- Fix GPU memory requirement in stress test (#11404)
- Do not create timeout XML if the stage is aborted (#9777)
- Fix TritonMoE test for Qwen3_30B_A3B (#11495)
- Refactor MoE unit tests with unified ConfigurableMoE framework (#11648)
- Add comparison operators for perf regression triage (#11675)
- Add WideEP DS-R1 NVFP4 test with attn_dp and kv_cache_reuse (#11670)
- Add concurrency override and fix for 128k/8k cases (#11669)
- Support short test case matcher in disaggregated test (#11707)
- Fix multi-GPU tests (#11615)
- Export HF_TOKEN in tests (#9382)
- Automatically generate attributions file (#11323)
- Update TRTLLM PLC pipeline (#11684)
- Add timeout 14400 for SeedOSS (#11269)
- Remove A100 test cases from QA perf scope (#11712)

What's Changed

[None][chore] Enable Nemotron Super nvfp4 tests by @tcherckez-nvidia in #11172
[#11529][perf] Replace Python-traced FP8 quantization with optimized CUDA op in AD MoE by @MrGeva in #11626
[TRTLLM-10514][feat] Refactor time breakdown tool (visualization, generation breakdown, etc.) by @luyiyun1021 in #11340
[None][infra] Waive failed cases for main branch on 2/23 by @EmmaQiaoCh in #11635
[#11529][perf] AD NemotronH topk router to use the model default dtype by @MrGeva in #11623
[None][fix] numpy v2 preparations by @Funatiq in #11389
[#9907][infra] Add Python builds tests to CI pre-merge pipeline by @jieli-matrix in #9943
[https://nvbugs/5921273][fix] Fix an issue where sync is missing before cuMemUnmap by @lowsfer in #11641
[#11398][feat] AutoDeploy: flashinfer rope for GLM4.7-Flash by @taylor-yb-lee in #11524
[None][infra] Waive failed cases for main for post-merge 2550 by @EmmaQiaoCh in #11650
[TRTLLM-11567][feat] Added GatedDeltaNet sharding from config by @greg-kwasniewski1 in #11599
[None][fix] Nemotron H fp4 and MTP by @NVShreyas in #11601
[https://nvbugs/5919025][fix] Disable warmup steps for some WAN unit tests by @chang-l in #11616
[TRTLLM-10616][feat] Add FLUX.1 and FLUX.2 text-to-image pipeline support by @karljang in #11556
[#10243][chore] switched the default AD attention backend to trtllm by @MrGeva in #11627
[None][chroe] Mass integration of release/1.2 - 5th by @dominicshanshan in #11636
[None][chore] Align LlmArgs with some Pydantic best practices by @anish-shanbhag in #11158
[None][perf] Use UE8M0 FP8 quant kernel for DeepGemm blockwise GEMM by @chang-l in #11607
[None][infra] Waive failed cases for main on 02/24 by @EmmaQiaoCh in #11665
[https://nvbugs/5846489][perf] Apply TE's FP8 per-tensor quantization by @yumin066 in #11057
[None][fix] Fix test prefix generation for per-sm waives by @tburt-nv in #11519
[None][chore] Weekly mass integration of release/1.2 by @mikeiovine in #11572
[TRTLLM-9781][infra] Don't create timeout xml if the stage is aborted by @yiqingy0 in #9777
[None][fix] Accept **kwargs in DynamicYamlWithDeepMergeSettingsSource… by @tcherckez-nvidia in #11621
[https://nvbugs/5606178][fix] unwaive mamba2 two tests by @JadoTu in #11479
[TRTLLM-9108][feat] refactor MoE unit tests: add unified ConfigurableMoE test framework by @xxi-nv in #11648
[None][fix] Add comparison operators for perf regression triage by @chenfeiz0326 in #11675
[None][test] Add wideep DS-R1 nvfp4 test with attn_dp and kv_cache_reuse by @StanleySun639 in #11670
[None][chore] Moving kimi-k2-thinking deployment guide configs to config files. by @fsaady in #11645
[TRTINFRA-7367][infra] Automatically generate attributions file by @tburt-nv in #11323
[None][fix] rename svd-nvfp4 to trtllm-nvfp4 in visual gen examples by @karljang in #11664
[None] [fix] Restructure kv cache memory ratio parameters in curated .yaml config files by @xd-nv in #11511
[None][chore] Bump version to 1.3.0rc6 by @yuanjingx87 in #11688
[None][fix] Fix FP8 + Skip Softmax Attention accuracy issue on fmha_v2. by @bobboli in #11448
[TRTLLM-7836][feat] Implement dynamic quota resize for KVCacheManager v2 by @lowsfer in #11503
[#4666][fix] Handle None priority in KVCacheEventSerializer._event_diff_to_json by @wojciech-wais in #11576
[None][test] add concurrency override and fix for 128k8k cases by @ruodil in #11669
[TRTLLM-9904][feat] KVCache V2 MTP support by @liji-nv in #11346
[None][test] support short test case matcher in disagg test by @ruodil in #11707
[TRTLLM-11614][feat] Fixing multigpu tests by @greg-kwasniewski1 in #11615
[None][docs] Fix 60+ broken links across docs, blogs, and examples by @kaiyux in #11676
[TRTLLM-8828][infra] export HF_TOKEN in tests by @niukuo in #9382
[None][chore] Add feature for enhance perf dashboard by @fredricz-20070104 in #11506
[TRTLLM-11106][chore] Abstract ADPRouter interface and RankState by @lancelly in https://github.com/N...

Contributors

karljang, chienchunhung, and 49 other contributors

Assets 2

24 Feb 19:35

pcastonguay

v1.3.0rc5

630fccb

v1.3.0rc5 Pre-release

Pre-release

Highlights

Model Support
- Add support for Qwen3.5 with AutoDeploy (#11394)
- Read mamba_ssm_cache_dtype from HF config when set to auto (#11582)
- Add NVFP4 dynamic quantization support for visual_gen models (#11563)
API
- Use new index API; add block scale support; fix max sequence length estimation; add flash MLA support (#11334)
- Add dynamic LLMAPI defaults system (#11035)
- Use smg-grpc-proto package for gRPC proto definitions (#11578)
- Move SaveHiddenStates spec-dec mode to one model (#11241)
Feature
- Add cache transfer setup for Mamba states (#10934)
- Optimize MoE export by tracing with reduced experts and expanding graph (#11504)
- Add new Helix kernels for MNNVL-based codepath (#11433)
- Add line_profiler tool for host overhead analysis (#11232)
- Enable multi-stream MoE; add multi-stream MLA attention (#11520)
- Add MoE all-to-all paradigm (#10985)
- Add support for multi instances in Triton backend with PyTorch backend (#11153)
- Add KV cache metrics to MetricsCollector for more Prometheus metrics (#11243)
- Account for reusable KV cache blocks in capacity calculation (#11490)
- Add CUDA graphs, torch compile, NVTX, and warmup for Visual Gen (#11554)
- Make preprocessing async (#11459)
- Split up TorchSampler.Store (#11566)
Fix
- Fix multimodal placeholder counts (#11461)
- Add cacheSaltID property to BlockKey serialization (#11457)
- Fix cache transceiver (#11409)
- Declare the variable in the correct scope (#11066)
- Fix spec-dec mode flag and related C++ requirements (#10996)
- Fix Qwen3-VL-Dense/MoE accuracy drop (#11134)
- Complete WAR for popen in QA env (#11214)
- Improve error message for mismatched MPI world size (#11294)
- Use the torch_dtype set by ModelOpt (#11525)
- Fix silent MPI failures on models with custom tokenizers (#11399)
- Fix Nemotron issues (#11425)
- Fix pipeline parallelism + disaggregated serving (#11509)
- Fix broken LLMAPI config (#11571)
- Fix illegal memory access with Helix CP=64 (#11593)
- Validate requests outside sampling loop (#11584)
- Correct chunked prefill handling in TorchSampler (#11544)
- Fix SpecDec sampling seed (#11081)
- Prevent NIXL agent name collision in containerized disaggregated serving (#11552)
Documentation
- Add doc for TRTLLM AIGV initial release (#11489)
- Update hardware support (#10719)
- Add documentation on configuring CPU affinity in TRT-LLM (#10678)
- Add warning about 2-model MTP deprecation (#11043)
- Update media file paths in Skip Softmax blog (#11540)
- Update TAVA architecture diagrams for visual gen flow and auto deploy flow (#11523)
- Add Qwen3.5 and GLM 4.7 Flash to support matrix (#11594)
Benchmark
- Add ctx-only and gen-only disaggregated perf tests (#11361)
Test & Infra
- Add CUTEDSL MoE backend for DeepSeek R1 NVFP4 checkpoint in stress test (#10920)
- Update MIG tests (#11014)
- Fix Slurm job name (#11265)
- Ensure TorchSampler does not sync (#11508)
- Revert MoE unit tests refactor: add unified ConfigurableMoE test framework (#11532)
- Re-upgrade GHA for blossom-ci workflow (#11483)
- Stop using remotes in the Conan install build step (#11516)
- Update PLC pipeline (#11547, #11597)
- Fix testdb file for l0_b200_multi_gpus_perf_sanity (#11603)
- Add visual_gen CODEOWNERS paths (#11606)

What's Changed

[None][chore] Adjust waive to avoid sm parsing by @tburt-nv in #11518
[None][chore] Optimize MOE export by tracing with reduced experts and expanding graph by @suyoggupta in #11504
[#11170][fix] Fix for mm placeholder counts by @2ez4bz in #11461
[None][feat] Add new helix kernels for MNNVL-based codepath by @brb-nv in #11433
[TRTLLM-11016][fix] Add cacheSaltID property to BlockKey serialization code by @thorjohnsen in #11457
[https://nvbugs/5880261][fix] fix cacheTransceiver by @chuangz0 in #11409
[None][doc] Add doc for TRTLLM AIGV initial release by @chang-l in #11489
[TRTLLM-10851][feat] Add line_profiler tool for host overhead analysis. by @hyukn in #11232
[None][chroe] Mass integration of release/1.2 - 4th by @dominicshanshan in #11500
[None][feat] Use new index api, add block scale support, fix max_seq_len esitmation, add flash mla support by @yizhang-nv in #11334
[#11455][bug] Use the torch_dtype set by ModelOpt by @tcherckez-nvidia in #11525
[#10345][perf] Enable multi-stream MOE for super. Also adds multi-stream MLA attn by @suyoggupta in #11520
[TRTLLM-10030][test] ensure that TorchSampler does not sync by @ixlmar in #11508
[None][revert] - Revert "[TRTLLM-9108][feat] refactor MoE unit tests: add unified ConfigurableMoE test framework" by @chzblych in #11532
[None][fix] Better error message for mismatched MPI world size by @jthomson04 in #11294
[#11109][feat] AutoDeploy: GLM 4.7 Flash Improvements by @bmarimuthu-nv in #11414
[None][doc] Update media files path in Skip Softmax blog. by @bobboli in #11540
[#11318][infra] AutoDeploy: Add fused rope kernel - triton_rope_on_interleaved_qk_inputs by @bmarimuthu-nv in #11327
[None][chore] Waive failing pre-merge test by @brb-nv in #11551
[None][chore] Waive moe fp4 test by @brb-nv in #11558
[None][chore] Bump version to 1.3.0rc5 by @yuanjingx87 in #11557
[TRTLLM-10845][feat] Add dynamic llmapi defaults system by @venkywonka in #11035
[https://nvbugs/5888464][fix] Stop using remotes in the Conan install build step by @tburt-nv in #11516
[None][chore] TAVA architecture diagram updates for visual gen flow and auto deploy flow by @yibinl-nvidia in #11523
[TRTLLM-10064][feat] MoE all-to-all paradigm by @greg-kwasniewski1 in #10985
[TRTLLM-8263][feat] Add ctx-only and gen-only Disagg Perf Tests by @chenfeiz0326 in #11361
[TRTLLM-10037][chore] Re-upgrade GHA for blossom-ci workflow by @dpitman-nvda in #11483
[None][feat] Add support for multi instances in Triton backend with pytorch backend by @achartier in #11153
[None][fix] Fix silent MPI failures on models with custom tokenizers by @jthomson04 in #11399
[None][infra] PLC pipeline update by @yuanjingx87 in #11547
[TRTLLM-10827][feat] Add KV Cache metrics to MetricsCollector for more Prometheus metrics by @yijingl-nvidia in #11243
[https://nvbugs/5880313][fix] Fix pp + disagg by @Tabrizian in #11509
[None][infra] Waive unittest that consistently timed out by @yuanjingx87 in #11580
[TRTLLM-1543][feat] Account for reusable KV cache blocks in capacity … by @SimengLiu-nv in #11490
[None][feat] Visual Gen: add cuda graphs; torch compile; nvtx; warmup by @NVShreyas in #11554
[TRTLLM-9040][perf] Make preprocessing async by @2ez4bz in #11459
[#11440] [feat] AutoDeploy : Support Qwen3.5 by @bmarimuthu-nv in #11394
[#11292][feat] use smg-grpc-proto package for gRPC proto definitions by @CatherineSue in #11578
[None][doc] Add Qwen3.5, GLM 4.7 Flash to support matrix by @bmarimuthu-nv in #11594
[None][feat] AutoDeploy: Add nemotron v2 acc test by @nvchenghaoz in #11429
[#11569][fix] Fix broken LLMAPI config by @2ez4bz in #11571
[None][chore] split up TorchSampler.Store by @ixlmar in #11566
[None][fix] Read mamba_ssm_cache_dtype from HF config when set to auto by @tomeras91 in #11582
[https://nvbugs/5914959][fix] Fix illegal memory access with Helix CP=64 by @brb-nv in #11593
[#10243][feat] Add TRT-LLM attention backend to AutoDeploy by @MrGeva in #11430
[TRTLLM-10857][chore] Move SaveHiddenStates spec dec mode to 1 model by @mikeiovine in #11241
[TRTLLM-10197][feat] Cache Transfer Setup for Mamba States by @NVShreyas in #10934
[TRTLLM-11069][fix] validate requests outside sampling loop by @ixlmar in #11584
[None][fix] correct chunked prefill handling in TorchSampler by @ixlmar in #11544
...

Contributors

achartier, mikeiovine, and 33 other contributors

Assets 2

17 Feb 21:04

pcastonguay

v1.3.0rc4

26901e4

v1.3.0rc4 Pre-release

Pre-release

Highlights

Model Support
- Add EPD disagg support for Qwen3 VL MoE (#10962)
- MLA revisited and GLM 4.7 Flash support (#11324)
- Initial support of AIGV models in TRTLLM (#11462)
- Fix weight loading for Nemotron 3 models on DGX Spark (#11405)
API
- Add user-provided UUID support for multimodal KV cache identification (#11075)
Feature
- Support GB200 and increase disagg test timeout (#11019)
- Avoid syncs in beam search and other improvements (#11349)
- Implement disaggregated harmony chat (#11336)
- Support different KV cache layout for one-model spec dec (#10502)
- Reduce attention module repeated warnings (#11335)
- Make update_weights compatible with CUDA Graph (#11267)
- Fully non-blocking pipeline parallelism executor loop (#10349)
- Move MambaCacheManager from Python to C++ (#10540)
- Pin host memory and batch sampler setup in beam search (#11390)
- Initial PR for trtllm-gen attention backend (#10784)
- Remove the hard code for activation type definition in TRTLLM Moe Backend (#11164)
- Merge residual+hidden into layer norm at the end of each NemotronH MTP, and remove a % operation (#11406)
- Introduce an abstract WaitingQueue interface to decouple the request scheduling logic from specific queue implementations (#11330)
- Add BOLT compatible build flags for further experimental usage (#11297)
- Multi-image support for EPD disagg (#11264)
- Optimize NemotronH model with elementwise and nvfp4 fusion (#11273)
- TorchSampler general host time optimization (#11141)
Fix
- Disaggregated serving: Only send finished context requests to the KV cache transceiver (#11354)
- Replace etcd3 with etcd-sdk-python (#10886)
- Fix offset calculation in _are_stop_words when using speculative decoding (#10854)
- Fix hang issue by avoid exposing UB buf… (#10842)
- WAR for popen in QA env (#10989)
- Fix Eagle3 draft model weight loading for throughput checkpoint (#11010)
- Respect CUDA_LAUNCH_BLOCKING by fixing doCheckError (#11261)
- Avoid reserved filename on Windows (#11382)
- Fix tinygemm accuracy (#11411)
- Disable cutedsl argmax kernel to fix perf regression (#11403)
- Fix deepeplowlatency with trtllm moe backend running fp8 DS_R1 (#11266)
- Gracefully terminate disagg serving servers to prevent leftover subprocess warnings (#11395)
- Update lm_eval to 4.9.10 and re-enable Skip Softmax Attention tests on CI (#11176)
- Remove overlap scheduler adjustment for max sequence length in create_py_executor function (#9229)
- Fix out-of-bounds array access in kernel factory Get() methods (#11373)
- Fix a bug in PR11336 (#11439)
- Fix GLM engine build dtype (#11246)
- Enable warmup for Helix CP (#11460)
- Pre-Allocation for Auto-Tuning NCCL_SYMMETRIC (#11326)
- Make NVML work with older CUDA driver versions (#11465)
- Fallback to triton_ssm for nvfp4 quantization (#11456, #11455)
- Fix CUDA OOM error (#11219)
Documentation
- Add CLAUDE.md and AGENTS.md (#11358)
- Add multiple-instances section in disaggregated serving doc (#11412)
- Update Skip Softmax attention blog (#11443)
- Add SECURITY.md file to TensorRT-LLM GitHub (#11484)
- Enable Deepwiki docs (#11492)
Benchmark
- Add microbench for MoE Comm methods (#10317)
- Enhance multi-GPU tests for IFB stats (#11239)
- Add DGX-Spark multinode perf cases including eagle3 (#11184)
- Add gpt-oss-120b-Eagle3-throughput case on DGX-Spark (#11419)
Test & Infra
- Move test_trtllm_flashinfer_symbol_collision.py to tests/unittest/_torch (#11168)
- Fix missing test cases (#10881)
- Update test constraint (#11054)
- Add CODEOWNERS coverage for serve/ and commands/ directories (#11359)
- Update model list (#11364)
- Unit test for disagg gen cancellation (#11108)
- Disable spark stages due to migration of spark cloud (#11401)
- Enable sparck ci since spark cloud migration is done (#11407)
- Upload unittest sub results in slurm (#10834)
- Remove obsolete code (#11388)
- Fix the testcase name in timeout xml (#9781)
- Use frontend dgx-h100 and b200 slurm platforms (#11251)
- Update allowlist 2026-02-10 (#11426)
- Lock FI version to 0.6.3 (#11371)
- Pin the torchao version (#11444)
- Refactor finish reasons tests (#11445)
- Move DeepEPLowLatency tests to machines that support IBGDA with GPU handles (#11178)
- Refactor MoE unit tests: add unified ConfigurableMoE test framework (#11437)
- Use weakref in atexit handler (#11476)
- Improve assert in sampler (#11475)
- Update allowlist 2026-02-13 (#11512)

What's Changed

[None][infra] Waive failed case for main branch on 02/09 by @EmmaQiaoCh in #11369
[None][chore] Move test_trtllm_flashinfer_symbol_collision.py to tests/unittest/_torch by @yihwang-nv in #11168
[None][chore] Add microbench for MoE Comm methods. by @bobboli in #10317
[https://nvbugs/5829097][fix] Disaggregated serving: Only send finished context requests to the KV cache transceiver by @Funatiq in #11354
[None][test] Enhance multi-GPU tests for IFB stats by @Funatiq in #11239
[https://nvbugs/5834212][chore] unwaive test_disaggregated_mixed by @reasonsolo in #11372
[#10780][feat] AutoDeploy: Support per-expert scales in FP8 and NVFP4 MoE by @galagam in #11322
[TRTLLM-10030][perf] avoid syncs in beam search + other improvements by @ixlmar in #11349
[None][chroe] Mass integration of release/1.2 - 3rd by @dominicshanshan in #11308
[None][fix] Respect CUDA_LAUNCH_BLOCKING by fixing doCheckError by @hnover-nv in #11261
[TRTLLM-10866][feat] implement disaggregated harmony chat by @reasonsolo in #11336
[None][infra] AutoDeploy: Dump graph IR after every transform by @bmarimuthu-nv in #11045
[None][chore] update model list by @tcherckez-nvidia in #11364
[None][chore] Unit test for disagg gen cancellation by @pcastonguay in #11108
[https://nvbugs/5853997][chore] Unwaive gpt-oss test by @mikeiovine in #11287
[TRTLLM-10321][feat] Support different KV cache layout for one-model spec dec by @ziyixiong-nv in #10502
[https://nvbugs/5855540][fix] AutoDeploy: thread cleanup of eagle test by @lucaslie in #11289
[None][chore] Reduce attention module repeated warnings. by @yuxianq in #11335
[https://nvbugs/5843112][chore] Unwaive ngram test by @mikeiovine in #11320
[None][test] Add DGX-Spark multinode perf cases by @JennyLiu-nv in #11184
[None][fix] Avoid reserved filename on Windows by @tongyuantongyu in #11382
[None][infra] Disable spark stages due to migration of spark cloud by @EmmaQiaoCh in #11401
[TRTC-265][chore] Add CODEOWNERS coverage for serve/ and commands/ directories by @venkywonka in #11359
[#11032][feat] MLA revisited and GLM 4.7 Flash support by @lucaslie in #11324
[TRTC-264][doc] Add CLAUDE.md and AGENTS.md by @venkywonka in #11358
[None][chore] Mass merge commits from release/1.2.0rc6.post1 branch by @longlee0622 in #11384
[TRTLLM-9771][feat] Make update_weights compatible with CUDA Graph by @shuyixiong in #11267
[None][infra] Enable sparck ci since spark cloud migration is done by @EmmaQiaoCh in #11407
[None][doc] add multiple-instances section in disaggregated serving doc by @reasonsolo in #11412
[None][feat] Fully non-blocking pipeline parallelism executor loop. by @yuxianq in #10349
[None][infra] Waive failed cases for main branch on 02/10 by @EmmaQiaoCh in #11413
[None][chore] Unwaive tests after last MI by @dominicshanshan in #11400
[TRTLLM-10331][infra] Upload unittest sub results in slurm by @yiqingy0 in #10834
[https://nvbugs/5791242][chore] remove obsolete code by @ixlmar in #11388
[None][fix] fix tinygemm accuracy by @bo-nv in #11411
[https://nvbugs/5853720][fix] Disable cutedsl argmax kernel to fix perf regression by @chenfeiz0326 in #11403
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #11363
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #11396
[TRTLLM-9711][infra] Fix the testcase name in timeout xml by @yiqingy0 in #9781
[https://nvbugs/5848377][fix] fix deepeplowlatency with trtllm moe backend running fp8 DS_R1 by @leslie-fang25 in #11266
[TRTLLM-10273][feat] Move MambaCa...

Contributors

reasonsolo, longlee0622, and 47 other contributors

Assets 2

12 Feb 19:48

pcastonguay

v1.3.0rc3

b464c75

v1.3.0rc3 Pre-release

Pre-release

Highlights:

Model Support
- Support LoRa BF16 checkpoints with Llama 3.3-70B FP8 (#9808)
- Add Eagle3 support for Nemotron H (#11131)
- Enhance support for complex models (#11254)

API
- Allow overriding quantization configs (#11062)
- Set continuous_usage_stats default to False to follow OpenAI protocol (#10644)
- Set max_num_tokens_in_buffer default based on max_seq_len/max_input_len (#11082)

Feature
- Export ONNX for DriveOS LLM (#10117)
- Add L2 norm pattern matcher and fusion transform (#10767)
- Add PDL support for moeAlltoAllKernels (#10591)
- Integrate KVCacheManager V2 into TRTLLM runtime (#10659)
- Integrate cuda.tile RMS norm kernels (#9725)
- Refactor request fetching logic for better separation of concerns (#10988)
- Implement gen-first disagg_service (#11020)
- Support disagg SLURM job rescheduling (#11218)
- Improve layer classification for sharding (#10718)
- Add priority-based KV cache offload filtering (#10751)
- Optimize beam search performance (remove GPU sync, fix batching, refactor) (#11276)
- Avoid sync in PyTorchModelEngine when using beam search (#11341)
- Adjust DeepGEMM tuning buckets for larger num_tokens scope (#11259)
- Add CuteDSL FP8 GEMM for Blackwell (#10130)
- Reduce host memory usage during model loading (#11119)
- Perfect routing for Deepseek models (#11127)
- Modularize transceiver for KV manager v2 (step 4) (#11225)

Fix
- Fix AttributeError with return_perf_metrics on TensorRT backend (#10662)
- Prevent routing context and generation requests to the same worker; document unique disagg ID (#11095)
- Prevent out-of-bounds read (#10868)
- Add __syncthreads() to TinyGEMM to resolve intermittent accuracy issues (#10873)
- Fix PD disaggregation for VLMs that use mrope (#10865)
- Always reset drafting states for GuidedDecoder (#10899)
- Use NCCL as fallback to avoid crash due to insufficient memory (#10928)
- Fix llama sm120 spec decoding (#10765)
- Fix MTP one-model sampler (#10369)
- Align kv_scales with ModelOpt HF checkpoint (#10745)
- Fix selective_state_update perf regression for T=1 decode path (#11194)
- Make health_generate work with beam search (#11097)
- Work around accuracy issue by enforcing paged_context_fmha on Hopper for fmha_v2 (#11192)
- Fix CuteDSL argmax on sm120 (#11181)
- Fix amax to avoid NaN issue in fp8_blockscale_gemm_kernel (#11256)
- Fix VSWA initialization with spec-dec and boundary condition in context input preparation (#10798)
- Fix partial reuse disabled for disagg (#11247)
- Retake ownership of mrope tensors in prefill worker (#11217)
- Fix proto-to-SamplingParams conversion bugs and add gRPC tests (#11292)
- Fix accuracy drop in VSWA with KV cache block reuse (#10875)

Documentation
- Add Glm4MoeForCausalLM to model support matrix (#11156)
- Fix GLM4-MoE Eagle support documentation (#11198)
- Add CUDA Graph + LoRA to feature combination matrix (#11187)
- Fix comments for KV cache manager v2 (#11207)
- Skip Softmax Attention blog and docs (#10592)
- Add sparse attention docs to index (#11342)

Test & Infra
- Update GB200 test configs to use frontend SLURM platforms (#11085)
- Fix jaraco-context and wheel vulnerability (#10901)
- Add --high-priority in bot help message (#11133)
- Print memory usage before/after accuracy test in CI (#11155)
- Fix mocking of HuggingFace downloads in with_mocked_hf_download (#11200)
- Set rerun report stage UNSTABLE and pipeline SUCCESS when rerun tests pass (#11210)
- Move 6x H100 test stage to AIHub platform (#11039)
- Add disagg perf tests (#10912)
- Provide uniform test framework to test all MoE backends (#11128)
- Move disagg scripts env configs from bash to submit.py (#10223)
- Use free port for serve test (#10878)
- Fix test_auto_scaling for 2 GPUs (#10866)
- Update test list (#10883)
- Fix an invalid test name (#11195)
- Refine QA test list for SM120 (#11248)
- Fix multimodal serve test (#11296)
- Pass without_comm to Cutlass and DeepGEMM (#11229)
- Promote SampleState to TypeVar and fix typing (#11281)
- Fix bench script test (#10483)

What's Changed

[None][feat] Export ONNX for DriveOS LLM by @nvyocox in #10117
[#9525][feat] add L2 norm pattern matcher and fusion transform by @karthikvetrivel in #10767
[TRTINFRA-7548][infra] Update GB200 test configs to use frontend SLURM platforms by @mlefeb01 in #11085
[None][doc] Add Glm4MoeForCausalLM to model support matrix by @venkywonka in #11156
[None][feat] Perfect routing for Deepseek models by @brb-nv in #11127
[TRTLLM-10398][feat] Enable TRTLLM moe backend for Nemotron Super by @nv-guomingz in #10791
[#8242][feat] Add int4 GPTQ support for AutoDeploy by @Fridah-nv in #8248
[https://nvbugs/5804683][infra] unwaive Mistral Large3 test by @byshiue in #10680
[TRTLLM-9771][feat] Allow overriding quantization configs by @shuyixiong in #11062
[None][ci] Waive a flaky test on A10 by @chzblych in #11163
[None][infra] Waive failed cases for main on 1/30 by @EmmaQiaoCh in #11142
[None][fix] AttributeError with return_perf_metrics on tensorrt backend by @riZZZhik in #10662
[https://nvbugs/5834212][fix] prevent routing ctx and gen requests to the same worker; update doc for unique disagg ID by @reasonsolo in #11095
[TRTLLM-10666][chore] Refactor request fetching logic for better separation of concerns by @lancelly in #10988
[https://nvbugs/5823284][fix] Unwaive no repro hang issue by @liji-nv in #11138
[None] [feat] Add PDL support for moeAlltoAllKernels by @kaiyux in #10591
[None][infra] Waive failed cases and disable a stage on 02/02 by @EmmaQiaoCh in #11177
[TRTLLM-9766][feat] Integration of the KVCacheManager V2 to TRTLLM Runtime by @yizhang-nv in #10659
[None][chroe] Mass integration of release/1.2 - 2nd by @dominicshanshan in #11088
[None][feat] Integrate cuda.tile RMS norm kernels by @lirundong in #9725
[None][test] Fix an invalid test name by @chzblych in #11195
[None][feat] Nemotron H: Eagle3 support by @IzzyPutterman in #11131
[#10826][feat] AutoDeploy: Eagle One-Model [2/n]: Prefill-Only Implementation by @govind-ramnarayan in #11073
[None][doc] Fix GLM4-MoE Eagle support documentation by @venkywonka in #11198
[TRTLLM-10561][infra] Fix jaraco-context and wheel vulnerability by @yiqingy0 in #10901
[TRTLLM-10307][infra] Add --high-priority in bot help message by @mzweilz in #11133
[None][chore] Print memory usage before/after accuracy test in CI by @taylor-yb-lee in #11155
[TRTLLM-10803][fix] Fix mocking of HuggingFace downloads in with_mocked_hf_download by @anish-shanbhag in #11200
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #11193
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #11202
[TRTLLM-10839][infra] Set rerun report stage UNSTABLE and pipeline SUCCESS in post-merge when there are passed rerun tests by @yiqingy0 in #11210
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #11216
[None][fix] Align kv_scales with modelopt HF checkpoint by @cjluo-nv in #10745
[https://nvbugs/5739981][fix] unwaive tests using opt-125M by @ixlmar in #11100
[TRTLLM-10019][infra] Move 6 h100 test stage to aihub platform by @yuanjingx87 in #11039
[TRTLLM-8921][feat] implement gen-first disagg_service by @reasonsolo in #11020
[#11086][feat] Optimize Auto Deploy weight loading by preloading weights to CPU by @taylor-yb-lee in #11059
[None][fix] Set continuous_usage_stats default to False to follow OpenAI protocol by @riZZZhik in #10644
[None][chore] bump version to 1.3.0rc3 by @tburt-nv in #11238
[TRTLLM-8263][feat] Add Disagg Perf Tests by @chenfeiz0326 in #10912
[None][fix] Fix selective_state_update perf regression for T=1 decode path by @galagam in #11194
[TRTLLM-9111][feat] provide the uniform test framework to test all MoE backends by @xxi-nv in #11128
[None][fix] make health_generate work with beam search by @ixlmar in https://github.com/NVIDIA/TensorRT...

Contributors

Superjomn, reasonsolo, and 55 other contributors

Assets 2

05 Feb 02:36

chzblych

v1.2.0rc6.post3

7c6df0e

v1.2.0rc6.post3 Pre-release

Pre-release

What's Changed

[https://nvbugs/5850094][fix] Fix MoE cost estimation for auto multi-stream scheduling by @yizhang-nv in #11160
[None][feat] update TRT-LLM Gen DS FP8 MoE cubins and optimize finalize kernel by @nekorobov in #11104
[None][chore] Bump version to 1.2.0rc6.post3 by @yiqingy0 in #11224
[None][fix] Fallback to NCCL instead of NCCL symmetric by @Tabrizian in #11174
[None][feat] fuse shared to sparse experts in TRT-LLM Gen MoE by @nekorobov in #11143

Full Changelog: v1.2.0rc6.post2...v1.2.0rc6.post3

Contributors

Tabrizian, nekorobov, and 2 other contributors

Assets 2

05 Feb 02:25

chzblych

v1.2.0rc2.post2

910c070

v1.2.0rc2.post2 Pre-release

Pre-release

What's Changed

[None][fix] fix TinyGemm accuracy issue. cherry-pick #10619 and #10873 by @bo-nv in #10990
[None][chore] Bump version to 1.2.0rc2.post2 by @chzblych in #11012
[None][chore] Upgrade starlette and FastAPI (#9319) by @chzblych in #11027
[None][fix] fix accuracy issue(cherry-pick #11157 and #9530) by @bo-nv in #11222

Full Changelog: v1.2.0rc2.post1...v1.2.0rc2.post2

Contributors

chzblych and bo-nv

Assets 2

03 Feb 19:31

pcastonguay

v1.3.0rc2

f42a6cb

v1.3.0rc2 Pre-release

Pre-release

Highlights:

Known Issues
- On RTX6000D, one might encounter Instruction 'redux.f32' not supported error. This issue will be resolved in the next release.
Model Support
- Enable MTP for Nemotron Super (#10754)
- Make TRTLLM MoE the default for GPTOSS on Blackwell (#11074)
- Add missing absolute position embeddings in Qwen3-VL vision encoder (#11065)
API
- Change context params and disagg params (#10495)
- Add KVCacheManagerV2 APIs for Transceiver (#11003)
Feature
- Add Skip Softmax MLA kernels for Blackwell and fix NVFP4 KV accuracy bug (#10813)
- Fuse AllGather for expert statistics required by EPLB (#10885)
- Add first-iteration streaming for GPT-OSS in trtllm-serve (#10808)
- Integrate CuteDSL argmax kernel (#10476)
- Update Mamba decode kernel to FlashInfer (#10757)
- Improve effective memory bandwidth with TMA.RED (#10987)
- Reorganize AutoTuner cache file for distributed tuning (#10956)
- Support attention DP + Helix CP (#10477)
- Improve performance of _write_finish_reasons in TorchSampler (#10459)
- Add gRPC server for high-performance external router integration (#11037)
- Prepare for future KVCacheV2 MTP support (#11029)
Fix
- Fix CuteDSL MoE unit test (#10983)
- Fix overlap scheduler pause() timing (#10943)
- Fix Pydantic deepcopy bug (#11004)
- Restore IPv6 support in serve.py (#10929)
- Fix conditional compilation for sm10x cubins (#10839)
- Add graceful fallbacks for NCCL symmetric mode (#11042)
- Fix enable_alltoall passed to CutlassFusedMoE (#11016)
- Fix kvCacheManager isLeaf() assertion failure (#10922)
- Add null pointer check to parseNpyHeader (#10944)
- Fix attention DP scheduling sort order to prioritize non-relaxed requests (#11106)
Documentation
- Update Qwen2/3-VL models in supported_models.md (#10797)
Benchmark
- Add performance alignment to layer-wise benchmarks (#11018)
- Clean up layer-wise benchmarks code (#11092)
- Add DGX-Spark VLM gemm3-12b bfp16/fp4/fp8 accuracy and perf cases (#11096)
Test & Infra
- Add 250K-token NVFP4 MoE + PDL regression tests (#10911)
- Add timeout for SeedOSS test (#8683)
- Add Fake Ops for one-sided AlltoAll (#11002)
- Refactor setup for RNN cache transceiver (#10957)
- Change SLURM config access to use resolvePlatform (#11006)
- Update CI allowList (#11040)
- Add Mamba and MLA layers to sharding tests (#10364)
- Remove pybind11 bindings and references (#10550, #11026)
- Add multi-acc and Lyris GB200 test support (#11024)
- Package triton-kernels as a dependency (#10471)
- Fix Qwen3 Eagle test (#11030)
- Dump thread stacks for hanging tests before timeout (#10708)
- Remove -ccache from build_wheel.py args (#11064)
- Fix trtllm-serve guided decoding test (#11101)
- Remove invalid account for Blossom CI (#11126)
- Add source code pulse scan to PLC nightly pipeline (#10961)

What's Changed

[None][fix] Fix CuteDSL MoE unittest by @syuoni in #10983
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #10974
[https://nvbugs/5661741][feat] Add 250K-token NVFP4 MoE + PDL regression tests by @yingguo-trt in #10911
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #10976
[None][infra] Waive failed case for main branch on 01/26 by @EmmaQiaoCh in #10994
[None][feat] Add Skip Softmax MLA kernels for Blackwell and Fix an accuracy bug of NVFP4 KV by @Tom-Zheng in #10813
[TRTLLM-10048][feat] Fuse the AllGather for expert statistics required by the EPLB. by @bobboli in #10885
[https://nvbugs/5794796][fix] Cherry-pick #10855: Unwaive Llama 3.3 related multi GPU tests by @pengbowang-nv in #10942
[#10614][fix] gpt_oss first iteration streaming in trtllm-serve by @LinPoly in #10808
[None][chore] Removing pybind11 bindings and references by @Linda-Stadter in #10550
[#8982][feat] AutoDeploy attention dp support by @lucaslie in #10728
[None][chore] update AD model list by @tcherckez-nvidia in #10981
[TRTLLM-10062][feat] Enable MTP for Nemotron Super by @sunnyqgg in #10754
[TRTLLM-10276][feat] Integrate cutedsl argmax kernel by @ameynaik-hub in #10476
[TRTLLM-10453][feat] Update mamba decode kernel to flashinfer by @Wanli-Jiang in #10757
[TRTLLM-10560][fix] Fix the time of pause() for overlap scheduler by @yuantailing in #10943
[https://nvbugs/5612438][fix] Add timeout for SeedOSS test by @zhhuang-nv in #8683
[None][infra] Waive failed cases for main on 01/27 by @EmmaQiaoCh in #11017
[None][chore] Bump version to 1.3.0rc2 by @yiqingy0 in #11021
[None][chore] Remove closed bugs by @xinhe-nv in #10982
[#10889][fix] fix pydantic deepcopy bug by @reasonsolo in #11004
[TRTLLM-9390][chore] Add Fake OPs for One-Sided AlltoAll. by @bobboli in #11002
[TRTLLM-9831][perf] Use TMA.RED to improve effective memory bandwidth by @sherry-1001 in #10987
[TRTLLM-9527][feat] change context params and disagg params (step3) by @chuangz0 in #10495
[TRTLLM-10308][feat] AutoTuner Cache: reorganize cache file for distributed tuning by @hyukn in #10956
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #10993
[https://nvbugs/5843316][chore] waive overlap_scheduler test by @galagam in #11025
[#10013][feat] AutoDeploy: native cache manager integration by @lucaslie in #10635
[https://nvbugs/5721661][chore] Unwaive fixed bug. by @SimengLiu-nv in #11009
[#10877][fix] restore ipv6 support in serve.py by @Evgueni-Petrov-aka-espetrov in #10929
[TRTLLM-10197][chore] Refactor to setup for RNN cache transceiver by @NVShreyas in #10957
[TRTINFRA-7379][infra] Change SLURM config access to use resolvePlatform by @mlefeb01 in #11006
[None][fix] Proper conditional compilation of sm10x cubins by @tongyuantongyu in #10839
[https://nvbugs/5756804][fix] Re-enable passing test by @dongfengy in #10986
[None][fix] unwaive tests by @xinhe-nv in #11047
[https://nvbugs/5779536][fix] Cherry-pick #10902: Unwaive DeepSeekR1 nvfp4 pp4 mtp test case (#10902) by @pengbowang-nv in #11000
[None][infra] Update CI allowList by @yuanjingx87 in #11040
[TRTLLM-10362][feat] Added Mamba and MLA layers to the sharding tests by @greg-kwasniewski1 in #10364
[None][chore] Removing cpp/tensorrt_llm/pybind by @Linda-Stadter in #11026
[None][feat] support multi_acc and Lyris GB200 test by @yingguo-trt in #11024
[None][infra] Waive failed cases for main on 1/28 by @EmmaQiaoCh in #11053
[None][chore] AutoDeploy: Eagle One-Model [1/n]: PyTorch impl for Eagle3 checkpoint by @govind-ramnarayan in #10674
[#10245][feat] AutoDeploy: Add Minimax M2 support by @bmarimuthu-nv in #10525
[None][fix] nccl symmetric with graceful fallbacks by @nv-lschneider in #11042
[None][fix] fix Qwen2/3 export for AutoDeploy by @Fridah-nv in #11007
[None][fix] No need to remove the original waive list by @yiqingy0 in #11060
[https://nvbugs/5761391][fix] Include triton-kernels as a packaged dependency by @anish-shanbhag in #10471
[None][fix] Fix enable_alltoall passed to CutlassFusedMoE by @syuoni in #11016
[None][feat] Add performance alignment to layer-wise benchmarks by @yuantailing in #11018
[https://nvbugs/5813452][fix] Fix "Assertion failed: isLeaf() in kvCacheManager.cpp:465" by @Boreas618 in #10922
[None][infra] Waived flaky tests by @ZhanruiSunCh in #11091
[TRTLLM-10264][feat] Support attention DP + Helix CP by @brb-nv in #10477
[TRTLLM-10415][feat] Dump thread stacks for hanging tests before time… by @WeiHaocheng in #10708
[TRTLLM-10312][perf] Improve performance of _write_finish_reasons in TorchSampler by @stnie in https:/...