[None][doc] Blogpost for Helix Parallelism by brb-nv · Pull Request #13547 · NVIDIA/TensorRT-LLM

brb-nv · 2026-04-28T06:10:09Z

Description

This MR adds a blogpost for Helix Parallelism.

Test Coverage

N/A

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

Documentation
- Added a new blog post detailing Helix Parallelism techniques for scaling multi-million-token decoding with KV cache sharding, including performance optimization strategies, distributed architecture insights, and practical implementation considerations with performance benchmarks.

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

coderabbitai · 2026-04-28T06:12:13Z

📝 Walkthrough

Walkthrough

A new blog post is added documenting Helix Parallelism for multi-million-token decoding with KV cache sharding. The post covers decoding bottlenecks, temporal disaggregation of parallelism between attention and FFN phases, distributed KV cache partitioning strategies, and TensorRT-LLM integration points with performance results.

Changes

Cohort / File(s)	Summary
Documentation - Blog Post `docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md`	New technical blog post (269 lines) detailing Helix Parallelism architecture, KV cache sharding techniques, distributed attention reconstruction via log-sum-exp rescaling, KV cache partitioning policies, TensorRT-LLM integration configuration and custom CUDA collectives, with DeepSeek-R1 performance analysis.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~15 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies that this PR adds documentation (blog post) for Helix Parallelism, which matches the changeset that introduces a new blog post file.
Description check	✅ Passed	The PR description includes a brief explanation of the changes (blogpost for Helix Parallelism), marks test coverage as N/A (appropriate for documentation), and includes the required checklist with confirmation, matching the template structure.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md (1)

33-33: Small wording polish for concision.

“On top of that” reads a bit wordy here; a shorter transition improves flow.

Suggested wording tweak

-...preserving long-range context is essential for relevance and coherence. On top of that, users expect fast, interactive responses.
+...preserving long-range context is essential for relevance and coherence. Users also expect fast, interactive responses.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md`
at line 33, The sentence "On top of that, users expect fast, interactive
responses." is wordy; replace the transition "On top of that" with a shorter
alternative (e.g., "Additionally," or "Furthermore,") in the paragraph that
begins "Modern AI applications increasingly rely on models..." so the line reads
concisely like "Additionally, users expect fast, interactive responses." to
improve flow and concision.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md`:
- Around line 47-49: The text uses the ambiguous acronym "TTL" when discussing
decode responsiveness; update the two occurrences of "TTL" in the paragraph (the
instances describing DRAM bandwidth/attention ceilings and batch-size
constraints) to a clear term such as "latency" or "TTFT (time-to-first-token)"
and ensure a brief parenthetical clarifier is added on first use (e.g., "latency
(time-to-first-token, TTFT)") so subsequent mentions can use "latency" or "TTFT"
consistently to avoid misinterpretation.

---

Nitpick comments:
In
`@docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md`:
- Line 33: The sentence "On top of that, users expect fast, interactive
responses." is wordy; replace the transition "On top of that" with a shorter
alternative (e.g., "Additionally," or "Furthermore,") in the paragraph that
begins "Modern AI applications increasingly rely on models..." so the line reads
concisely like "Additionally, users expect fast, interactive responses." to
improve flow and concision.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 90cb031a-248f-4049-9cef-8923899b0990

📥 Commits

Reviewing files that changed from the base of the PR and between f3270f9 and 1dba74b.

⛔ Files ignored due to path filters (4)

docs/source/blogs/media/tech_blog21_attention_sharding.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog21_dsr1_fp4_pareto.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog21_helix_execution_flow.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog21_roofline_analysis.png is excluded by !**/*.png

📒 Files selected for processing (1)

docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md

coderabbitai · 2026-04-28T06:12:16Z

+When handling multi-million-token contexts, each GPU must read a massive history of past tokens (KV cache) from DRAM per sample at every decoding step. This constant streaming can saturate DRAM bandwidth, increase TTL, and quickly become a major bottleneck as context length grows.
+
+Consider Tensor Parallelism (TP) as an example: increasing TP can distribute weight loading across more GPUs, but in attention schemes like Grouped Query Attention (GQA) or Multi-Latent Attention (MLA), multiple query heads share a limited number of KV heads K. When TP exceeds K, the system must **duplicate** the multi-million-token KV cache across GPUs so each shard can serve its assigned query heads. Beyond this point, increasing TP neither shrinks per-GPU KV cache size nor speeds up KV reads, imposing a hard ceiling on attention time and forcing smaller batch sizes under tight TTL constraints. In the case of MLA (used by DeepSeek models), K is effectively 1, so any TP > 1 triggers full KV duplication.


⚠️ Potential issue | 🟡 Minor

Clarify latency acronym (TTL) to avoid ambiguity.

At Line 47 and Line 49, TTL is unclear in this context. If this refers to decode responsiveness, use an explicit term like TTFT (time-to-first-token) or “latency” to prevent misinterpretation.

Suggested wording tweak

-This constant streaming can saturate DRAM bandwidth, increase TTL, and quickly become a major bottleneck as context length grows. +This constant streaming can saturate DRAM bandwidth, increase latency (for example, TTFT), and quickly become a major bottleneck as context length grows. -...forcing smaller batch sizes under tight TTL constraints. +...forcing smaller batch sizes under tight latency constraints.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

When handling multi-million-token contexts, each GPU must read a massive history of past tokens (KV cache) from DRAM per sample at every decoding step. This constant streaming can saturate DRAM bandwidth, increase TTL, and quickly become a major bottleneck as context length grows.

Consider Tensor Parallelism (TP) as an example: increasing TP can distribute weight loading across more GPUs, but in attention schemes like Grouped Query Attention (GQA) or Multi-Latent Attention (MLA), multiple query heads share a limited number of KV heads K. When TP exceeds K, the system must **duplicate** the multi-million-token KV cache across GPUs so each shard can serve its assigned query heads. Beyond this point, increasing TP neither shrinks per-GPU KV cache size nor speeds up KV reads, imposing a hard ceiling on attention time and forcing smaller batch sizes under tight TTL constraints. In the case of MLA (used by DeepSeek models), K is effectively 1, so any TP > 1 triggers full KV duplication.

When handling multi-million-token contexts, each GPU must read a massive history of past tokens (KV cache) from DRAM per sample at every decoding step. This constant streaming can saturate DRAM bandwidth, increase latency (for example, TTFT), and quickly become a major bottleneck as context length grows.

Consider Tensor Parallelism (TP) as an example: increasing TP can distribute weight loading across more GPUs, but in attention schemes like Grouped Query Attention (GQA) or Multi-Latent Attention (MLA), multiple query heads share a limited number of KV heads K. When TP exceeds K, the system must **duplicate** the multi-million-token KV cache across GPUs so each shard can serve its assigned query heads. Beyond this point, increasing TP neither shrinks per-GPU KV cache size nor speeds up KV reads, imposing a hard ceiling on attention time and forcing smaller batch sizes under tight latency constraints. In the case of MLA (used by DeepSeek models), K is effectively 1, so any TP > 1 triggers full KV duplication.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md` around lines 47 - 49, The text uses the ambiguous acronym "TTL" when discussing decode responsiveness; update the two occurrences of "TTL" in the paragraph (the instances describing DRAM bandwidth/attention ceilings and batch-size constraints) to a clear term such as "latency" or "TTFT (time-to-first-token)" and ensure a brief parenthetical clarifier is added on first use (e.g., "latency (time-to-first-token, TTFT)") so subsequent mentions can use "latency" or "TTFT" consistently to avoid misinterpretation.

pcastonguay

Overall looks good.

pcastonguay · 2026-04-29T12:33:09Z

In this chart, the baseline curve drop is unexpected. Do we know why it drops like that? Might be worth explaining in the post.

pcastonguay · 2026-04-29T12:34:11Z

With these pareto curves, I sometimes find it useful to add labels to the data points in the curve, to highlight the fact that configurations changes along the curve. Also, might be useful for users to know what TP,PP,CP configurations are used.

ankmore-nv · 2026-04-29T17:29:34Z

+
+- **(Left)** Increasing TP beyond K yields diminishing returns for attention because KV cache must be duplicated, while FFN weight reads continue to benefit linearly from more TP. This divergence motivates using different parallelism widths for the two stages.
+- **(Middle)** Self-attention DRAM read time scales linearly with KV sequence length S at fixed TP. At 1M+ tokens, attention reads dominate total layer latency.
+- **(Right)** KV Parallelism (KVP) distributes KV cache reads across GPUs, achieving linear scaling of attention time with KVP width. The same GPUs are then re-provisioned for TP in FFNs to reduce weight-read latency.


you could expand this to EP and or TP instead of limiting it to TP only to make it more relevant to current models

ankmore-nv · 2026-04-29T17:37:27Z

+
+Modern AI applications increasingly rely on models that combine large parameter counts with multi-million-token context windows. Whether it is AI agents following months of conversation, legal assistants reasoning through gigabytes of case law, or coding copilots navigating sprawling repositories, preserving long-range context is essential for relevance and coherence. On top of that, users expect fast, interactive responses.
+
+The growing demand to decode at this scale underscores the importance of FP4 compute and the high-bandwidth NVLink domain provided by NVIDIA Blackwell systems. **Helix Parallelism**, co-designed with Blackwell, addresses the two dominant bottlenecks in long-context decoding: KV cache streaming and FFN weight loading. It does so by temporally disaggregating the parallelism strategies used for attention and FFN within each transformer layer, allowing the same pool of GPUs to operate in the configuration best suited to each stage's bottleneck.


IMO using "temporally disaggregating" is a little confusing and can be misconstrued with AFD like approach.

I would state it more simply as:
It does so by using different parallelism strategies for attention and FFN ....

ankmore-nv · 2026-04-29T17:38:11Z

+
+### High-Level Execution Flow
+
+Helix is a hybrid sharding strategy that temporally disaggregates the parallelism strategies of attention and FFN within a single transformer layer. Instead of using one fixed parallelism configuration for the entire layer, Helix switches between two configurations using the same pool of N GPUs:


see note in prev comment about using "temporally disaggregates"

ankmore-nv · 2026-04-29T17:42:24Z

+2. After the All-to-All, each GPU has all partial outputs {O_i} and scalars {LSE_i}
+3. The global result is reconstructed as: O = Σ_i (exp(LSE_i - LSE_global) · O_i), where LSE_global = log(Σ_i exp(LSE_i))
+
+This is the same online softmax correction used in [Flash-Decoding](https://crfm.stanford.edu/2023/10/12/flashdecoding.html), extended to a distributed setting. The result is mathematically identical to single-GPU attention.


Might want to add that each GPU does it's own flash-decode across SM. so in essence this is a hierarchical flash-decode technique, first level within the GPU and then distributed across GPUs

schetlur-nv · 2026-04-30T21:39:03Z

schetlur-nv · 2026-04-30T21:45:48Z

+
+Modern AI applications increasingly rely on models that combine large parameter counts with multi-million-token context windows. Whether it is AI agents following months of conversation, legal assistants reasoning through gigabytes of case law, or coding copilots navigating sprawling repositories, preserving long-range context is essential for relevance and coherence. On top of that, users expect fast, interactive responses.
+
+The growing demand to decode at this scale underscores the importance of FP4 compute and the high-bandwidth NVLink domain provided by NVIDIA Blackwell systems. **Helix Parallelism**, co-designed with Blackwell, addresses the two dominant bottlenecks in long-context decoding: KV cache streaming and FFN weight loading. It does so by temporally disaggregating the parallelism strategies used for attention and FFN within each transformer layer, allowing the same pool of GPUs to operate in the configuration best suited to each stage's bottleneck.


Suggested change

The growing demand to decode at this scale underscores the importance of FP4 compute and the high-bandwidth NVLink domain provided by NVIDIA Blackwell systems. **Helix Parallelism**, co-designed with Blackwell, addresses the two dominant bottlenecks in long-context decoding: KV cache streaming and FFN weight loading. It does so by temporally disaggregating the parallelism strategies used for attention and FFN within each transformer layer, allowing the same pool of GPUs to operate in the configuration best suited to each stage's bottleneck.

**Helix Parallelism**, co-designed with Blackwell, addresses the two dominant bottlenecks in long-context decoding: KV cache streaming and FFN weight loading. It does so by using different parallelism strategies for attention and FFN within each transformer layer, allowing the same pool of GPUs to operate in the configuration best suited to each stage's bottleneck. In this blog, we focus on sequence parallelism for attention and tensor parallelism for FFN operations.

Suggest leaving out the first line - keep this more technical and less about promoting specific NVIDIA tech

schetlur-nv · 2026-04-30T21:50:21Z

+
+### FFN Weight Loading
+
+During autoregressive decoding, generating every new token requires loading large FFN weights from DRAM. With small batch sizes typical of latency-sensitive workloads, this cost cannot be amortized, making FFN weight reads a dominant source of latency. More TP helps here by splitting weights across GPUs, but the KV cache duplication ceiling from above limits how far TP can scale - if TP is capped at K shards, only K GPUs can be used to shard the FFN, which accounts for roughly two-thirds of model parameters.


Suggested change

During autoregressive decoding, generating every new token requires loading large FFN weights from DRAM. With small batch sizes typical of latency-sensitive workloads, this cost cannot be amortized, making FFN weight reads a dominant source of latency. More TP helps here by splitting weights across GPUs, but the KV cache duplication ceiling from above limits how far TP can scale - if TP is capped at K shards, only K GPUs can be used to shard the FFN, which accounts for roughly two-thirds of model parameters.

During autoregressive decoding, generating every new token requires loading large FFN weights from DRAM. With small batch sizes typical of latency-sensitive workloads, this cost cannot be amortized, making FFN weight reads a dominant source of latency. More TP helps here by splitting weights across GPUs, but the KV cache duplication ceiling from above limits how far TP can scale - if TP is capped at K shards, only K GPUs can be used to shard the FFN, which typically accounts for the vast majority of model parameters.

schetlur-nv · 2026-04-30T21:51:33Z

+
+### Roofline Motivation
+
+The following roofline analysis illustrates why decoupling attention and FFN sharding is essential. We model a dense LLM with batch B=8, MLA attention, query heads Q=128, KV heads K=1, QK head size Hsz=576, V head size Hsz=512, and FFN dimension F=65536 running on GB200 NVLink72. Both weights and KV cache are stored and fetched in FP8. Communication overhead from TP and KVP is not included; these plots show only the change in GPU DRAM-read latency as TP width and KVP width vary.


Why these particular numbers? Are they representative of, say, DSR1 at TP=8?

[None][doc] Blogpost for Helix Parallelism

1dba74b

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv requested a review from a team as a code owner April 28, 2026 06:10

brb-nv requested review from QiJune and Shixiaowei02 April 28, 2026 06:10

github-actions Bot assigned brb-nv Apr 28, 2026

brb-nv requested review from juney-nvidia, pcastonguay and schetlur-nv April 28, 2026 06:10

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

pcastonguay reviewed Apr 29, 2026

View reviewed changes

ankmore-nv reviewed Apr 29, 2026

View reviewed changes

schetlur-nv reviewed May 1, 2026

View reviewed changes

		When handling multi-million-token contexts, each GPU must read a massive history of past tokens (KV cache) from DRAM per sample at every decoding step. This constant streaming can saturate DRAM bandwidth, increase TTL, and quickly become a major bottleneck as context length grows.

		Consider Tensor Parallelism (TP) as an example: increasing TP can distribute weight loading across more GPUs, but in attention schemes like Grouped Query Attention (GQA) or Multi-Latent Attention (MLA), multiple query heads share a limited number of KV heads K. When TP exceeds K, the system must duplicate the multi-million-token KV cache across GPUs so each shard can serve its assigned query heads. Beyond this point, increasing TP neither shrinks per-GPU KV cache size nor speeds up KV reads, imposing a hard ceiling on attention time and forcing smaller batch sizes under tight TTL constraints. In the case of MLA (used by DeepSeek models), K is effectively 1, so any TP > 1 triggers full KV duplication.


		Modern AI applications increasingly rely on models that combine large parameter counts with multi-million-token context windows. Whether it is AI agents following months of conversation, legal assistants reasoning through gigabytes of case law, or coding copilots navigating sprawling repositories, preserving long-range context is essential for relevance and coherence. On top of that, users expect fast, interactive responses.

		The growing demand to decode at this scale underscores the importance of FP4 compute and the high-bandwidth NVLink domain provided by NVIDIA Blackwell systems. Helix Parallelism, co-designed with Blackwell, addresses the two dominant bottlenecks in long-context decoding: KV cache streaming and FFN weight loading. It does so by temporally disaggregating the parallelism strategies used for attention and FFN within each transformer layer, allowing the same pool of GPUs to operate in the configuration best suited to each stage's bottleneck.


		### High-Level Execution Flow

		Helix is a hybrid sharding strategy that temporally disaggregates the parallelism strategies of attention and FFN within a single transformer layer. Instead of using one fixed parallelism configuration for the entire layer, Helix switches between two configurations using the same pool of N GPUs:


		### FFN Weight Loading

		During autoregressive decoding, generating every new token requires loading large FFN weights from DRAM. With small batch sizes typical of latency-sensitive workloads, this cost cannot be amortized, making FFN weight reads a dominant source of latency. More TP helps here by splitting weights across GPUs, but the KV cache duplication ceiling from above limits how far TP can scale - if TP is capped at K shards, only K GPUs can be used to shard the FFN, which accounts for roughly two-thirds of model parameters.


		### Roofline Motivation

		The following roofline analysis illustrates why decoupling attention and FFN sharding is essential. We model a dense LLM with batch B=8, MLA attention, query heads Q=128, KV heads K=1, QK head size Hsz=576, V head size Hsz=512, and FFN dimension F=65536 running on GB200 NVLink72. Both weights and KV cache are stored and fetched in FP8. Communication overhead from TP and KVP is not included; these plots show only the change in GPU DRAM-read latency as TP width and KVP width vary.

Conversation

brb-nv commented Apr 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 28, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

pcastonguay left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

brb-nv commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading