Skip to content

[None][doc] Blogpost for Helix Parallelism#13547

Open
brb-nv wants to merge 1 commit intoNVIDIA:mainfrom
brb-nv:user/brb/helix-blog-post
Open

[None][doc] Blogpost for Helix Parallelism#13547
brb-nv wants to merge 1 commit intoNVIDIA:mainfrom
brb-nv:user/brb/helix-blog-post

Conversation

@brb-nv
Copy link
Copy Markdown
Collaborator

@brb-nv brb-nv commented Apr 28, 2026

Description

This MR adds a blogpost for Helix Parallelism.

Test Coverage

N/A

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

  • Documentation
    • Added a new blog post detailing Helix Parallelism techniques for scaling multi-million-token decoding with KV cache sharding, including performance optimization strategies, distributed architecture insights, and practical implementation considerations with performance benchmarks.

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
@brb-nv brb-nv requested a review from a team as a code owner April 28, 2026 06:10
@brb-nv brb-nv requested review from QiJune and Shixiaowei02 April 28, 2026 06:10
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

📝 Walkthrough

Walkthrough

A new blog post is added documenting Helix Parallelism for multi-million-token decoding with KV cache sharding. The post covers decoding bottlenecks, temporal disaggregation of parallelism between attention and FFN phases, distributed KV cache partitioning strategies, and TensorRT-LLM integration points with performance results.

Changes

Cohort / File(s) Summary
Documentation - Blog Post
docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md
New technical blog post (269 lines) detailing Helix Parallelism architecture, KV cache sharding techniques, distributed attention reconstruction via log-sum-exp rescaling, KV cache partitioning policies, TensorRT-LLM integration configuration and custom CUDA collectives, with DeepSeek-R1 performance analysis.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~15 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies that this PR adds documentation (blog post) for Helix Parallelism, which matches the changeset that introduces a new blog post file.
Description check ✅ Passed The PR description includes a brief explanation of the changes (blogpost for Helix Parallelism), marks test coverage as N/A (appropriate for documentation), and includes the required checklist with confirmation, matching the template structure.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md (1)

33-33: Small wording polish for concision.

“On top of that” reads a bit wordy here; a shorter transition improves flow.

Suggested wording tweak
-...preserving long-range context is essential for relevance and coherence. On top of that, users expect fast, interactive responses.
+...preserving long-range context is essential for relevance and coherence. Users also expect fast, interactive responses.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md`
at line 33, The sentence "On top of that, users expect fast, interactive
responses." is wordy; replace the transition "On top of that" with a shorter
alternative (e.g., "Additionally," or "Furthermore,") in the paragraph that
begins "Modern AI applications increasingly rely on models..." so the line reads
concisely like "Additionally, users expect fast, interactive responses." to
improve flow and concision.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md`:
- Around line 47-49: The text uses the ambiguous acronym "TTL" when discussing
decode responsiveness; update the two occurrences of "TTL" in the paragraph (the
instances describing DRAM bandwidth/attention ceilings and batch-size
constraints) to a clear term such as "latency" or "TTFT (time-to-first-token)"
and ensure a brief parenthetical clarifier is added on first use (e.g., "latency
(time-to-first-token, TTFT)") so subsequent mentions can use "latency" or "TTFT"
consistently to avoid misinterpretation.

---

Nitpick comments:
In
`@docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md`:
- Line 33: The sentence "On top of that, users expect fast, interactive
responses." is wordy; replace the transition "On top of that" with a shorter
alternative (e.g., "Additionally," or "Furthermore,") in the paragraph that
begins "Modern AI applications increasingly rely on models..." so the line reads
concisely like "Additionally, users expect fast, interactive responses." to
improve flow and concision.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 90cb031a-248f-4049-9cef-8923899b0990

📥 Commits

Reviewing files that changed from the base of the PR and between f3270f9 and 1dba74b.

⛔ Files ignored due to path filters (4)
  • docs/source/blogs/media/tech_blog21_attention_sharding.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog21_dsr1_fp4_pareto.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog21_helix_execution_flow.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog21_roofline_analysis.png is excluded by !**/*.png
📒 Files selected for processing (1)
  • docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md

Comment on lines +47 to +49
When handling multi-million-token contexts, each GPU must read a massive history of past tokens (KV cache) from DRAM per sample at every decoding step. This constant streaming can saturate DRAM bandwidth, increase TTL, and quickly become a major bottleneck as context length grows.

Consider Tensor Parallelism (TP) as an example: increasing TP can distribute weight loading across more GPUs, but in attention schemes like Grouped Query Attention (GQA) or Multi-Latent Attention (MLA), multiple query heads share a limited number of KV heads K. When TP exceeds K, the system must **duplicate** the multi-million-token KV cache across GPUs so each shard can serve its assigned query heads. Beyond this point, increasing TP neither shrinks per-GPU KV cache size nor speeds up KV reads, imposing a hard ceiling on attention time and forcing smaller batch sizes under tight TTL constraints. In the case of MLA (used by DeepSeek models), K is effectively 1, so any TP > 1 triggers full KV duplication.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify latency acronym (TTL) to avoid ambiguity.

At Line 47 and Line 49, TTL is unclear in this context. If this refers to decode responsiveness, use an explicit term like TTFT (time-to-first-token) or “latency” to prevent misinterpretation.

Suggested wording tweak
-This constant streaming can saturate DRAM bandwidth, increase TTL, and quickly become a major bottleneck as context length grows.
+This constant streaming can saturate DRAM bandwidth, increase latency (for example, TTFT), and quickly become a major bottleneck as context length grows.

-...forcing smaller batch sizes under tight TTL constraints.
+...forcing smaller batch sizes under tight latency constraints.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
When handling multi-million-token contexts, each GPU must read a massive history of past tokens (KV cache) from DRAM per sample at every decoding step. This constant streaming can saturate DRAM bandwidth, increase TTL, and quickly become a major bottleneck as context length grows.
Consider Tensor Parallelism (TP) as an example: increasing TP can distribute weight loading across more GPUs, but in attention schemes like Grouped Query Attention (GQA) or Multi-Latent Attention (MLA), multiple query heads share a limited number of KV heads K. When TP exceeds K, the system must **duplicate** the multi-million-token KV cache across GPUs so each shard can serve its assigned query heads. Beyond this point, increasing TP neither shrinks per-GPU KV cache size nor speeds up KV reads, imposing a hard ceiling on attention time and forcing smaller batch sizes under tight TTL constraints. In the case of MLA (used by DeepSeek models), K is effectively 1, so any TP > 1 triggers full KV duplication.
When handling multi-million-token contexts, each GPU must read a massive history of past tokens (KV cache) from DRAM per sample at every decoding step. This constant streaming can saturate DRAM bandwidth, increase latency (for example, TTFT), and quickly become a major bottleneck as context length grows.
Consider Tensor Parallelism (TP) as an example: increasing TP can distribute weight loading across more GPUs, but in attention schemes like Grouped Query Attention (GQA) or Multi-Latent Attention (MLA), multiple query heads share a limited number of KV heads K. When TP exceeds K, the system must **duplicate** the multi-million-token KV cache across GPUs so each shard can serve its assigned query heads. Beyond this point, increasing TP neither shrinks per-GPU KV cache size nor speeds up KV reads, imposing a hard ceiling on attention time and forcing smaller batch sizes under tight latency constraints. In the case of MLA (used by DeepSeek models), K is effectively 1, so any TP > 1 triggers full KV duplication.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md`
around lines 47 - 49, The text uses the ambiguous acronym "TTL" when discussing
decode responsiveness; update the two occurrences of "TTL" in the paragraph (the
instances describing DRAM bandwidth/attention ceilings and batch-size
constraints) to a clear term such as "latency" or "TTFT (time-to-first-token)"
and ensure a brief parenthetical clarifier is added on first use (e.g., "latency
(time-to-first-token, TTFT)") so subsequent mentions can use "latency" or "TTFT"
consistently to avoid misinterpretation.

Copy link
Copy Markdown
Collaborator

@pcastonguay pcastonguay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this chart, the baseline curve drop is unexpected. Do we know why it drops like that? Might be worth explaining in the post.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these pareto curves, I sometimes find it useful to add labels to the data points in the curve, to highlight the fact that configurations changes along the curve. Also, might be useful for users to know what TP,PP,CP configurations are used.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


- **(Left)** Increasing TP beyond K yields diminishing returns for attention because KV cache must be duplicated, while FFN weight reads continue to benefit linearly from more TP. This divergence motivates using different parallelism widths for the two stages.
- **(Middle)** Self-attention DRAM read time scales linearly with KV sequence length S at fixed TP. At 1M+ tokens, attention reads dominate total layer latency.
- **(Right)** KV Parallelism (KVP) distributes KV cache reads across GPUs, achieving linear scaling of attention time with KVP width. The same GPUs are then re-provisioned for TP in FFNs to reduce weight-read latency.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could expand this to EP and or TP instead of limiting it to TP only to make it more relevant to current models


Modern AI applications increasingly rely on models that combine large parameter counts with multi-million-token context windows. Whether it is AI agents following months of conversation, legal assistants reasoning through gigabytes of case law, or coding copilots navigating sprawling repositories, preserving long-range context is essential for relevance and coherence. On top of that, users expect fast, interactive responses.

The growing demand to decode at this scale underscores the importance of FP4 compute and the high-bandwidth NVLink domain provided by NVIDIA Blackwell systems. **Helix Parallelism**, co-designed with Blackwell, addresses the two dominant bottlenecks in long-context decoding: KV cache streaming and FFN weight loading. It does so by temporally disaggregating the parallelism strategies used for attention and FFN within each transformer layer, allowing the same pool of GPUs to operate in the configuration best suited to each stage's bottleneck.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO using "temporally disaggregating" is a little confusing and can be misconstrued with AFD like approach.

I would state it more simply as:
It does so by using different parallelism strategies for attention and FFN ....


### High-Level Execution Flow

Helix is a hybrid sharding strategy that temporally disaggregates the parallelism strategies of attention and FFN within a single transformer layer. Instead of using one fixed parallelism configuration for the entire layer, Helix switches between two configurations using the same pool of N GPUs:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see note in prev comment about using "temporally disaggregates"

2. After the All-to-All, each GPU has all partial outputs {O_i} and scalars {LSE_i}
3. The global result is reconstructed as: O = Σ_i (exp(LSE_i - LSE_global) · O_i), where LSE_global = log(Σ_i exp(LSE_i))

This is the same online softmax correction used in [Flash-Decoding](https://crfm.stanford.edu/2023/10/12/flashdecoding.html), extended to a distributed setting. The result is mathematically identical to single-GPU attention.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to add that each GPU does it's own flash-decode across SM. so in essence this is a hierarchical flash-decode technique, first level within the GPU and then distributed across GPUs

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


Modern AI applications increasingly rely on models that combine large parameter counts with multi-million-token context windows. Whether it is AI agents following months of conversation, legal assistants reasoning through gigabytes of case law, or coding copilots navigating sprawling repositories, preserving long-range context is essential for relevance and coherence. On top of that, users expect fast, interactive responses.

The growing demand to decode at this scale underscores the importance of FP4 compute and the high-bandwidth NVLink domain provided by NVIDIA Blackwell systems. **Helix Parallelism**, co-designed with Blackwell, addresses the two dominant bottlenecks in long-context decoding: KV cache streaming and FFN weight loading. It does so by temporally disaggregating the parallelism strategies used for attention and FFN within each transformer layer, allowing the same pool of GPUs to operate in the configuration best suited to each stage's bottleneck.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The growing demand to decode at this scale underscores the importance of FP4 compute and the high-bandwidth NVLink domain provided by NVIDIA Blackwell systems. **Helix Parallelism**, co-designed with Blackwell, addresses the two dominant bottlenecks in long-context decoding: KV cache streaming and FFN weight loading. It does so by temporally disaggregating the parallelism strategies used for attention and FFN within each transformer layer, allowing the same pool of GPUs to operate in the configuration best suited to each stage's bottleneck.
**Helix Parallelism**, co-designed with Blackwell, addresses the two dominant bottlenecks in long-context decoding: KV cache streaming and FFN weight loading. It does so by using different parallelism strategies for attention and FFN within each transformer layer, allowing the same pool of GPUs to operate in the configuration best suited to each stage's bottleneck. In this blog, we focus on sequence parallelism for attention and tensor parallelism for FFN operations.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest leaving out the first line - keep this more technical and less about promoting specific NVIDIA tech


### FFN Weight Loading

During autoregressive decoding, generating every new token requires loading large FFN weights from DRAM. With small batch sizes typical of latency-sensitive workloads, this cost cannot be amortized, making FFN weight reads a dominant source of latency. More TP helps here by splitting weights across GPUs, but the KV cache duplication ceiling from above limits how far TP can scale - if TP is capped at K shards, only K GPUs can be used to shard the FFN, which accounts for roughly two-thirds of model parameters.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
During autoregressive decoding, generating every new token requires loading large FFN weights from DRAM. With small batch sizes typical of latency-sensitive workloads, this cost cannot be amortized, making FFN weight reads a dominant source of latency. More TP helps here by splitting weights across GPUs, but the KV cache duplication ceiling from above limits how far TP can scale - if TP is capped at K shards, only K GPUs can be used to shard the FFN, which accounts for roughly two-thirds of model parameters.
During autoregressive decoding, generating every new token requires loading large FFN weights from DRAM. With small batch sizes typical of latency-sensitive workloads, this cost cannot be amortized, making FFN weight reads a dominant source of latency. More TP helps here by splitting weights across GPUs, but the KV cache duplication ceiling from above limits how far TP can scale - if TP is capped at K shards, only K GPUs can be used to shard the FFN, which typically accounts for the vast majority of model parameters.


### Roofline Motivation

The following roofline analysis illustrates why decoupling attention and FFN sharding is essential. We model a dense LLM with batch B=8, MLA attention, query heads Q=128, KV heads K=1, QK head size Hsz=576, V head size Hsz=512, and FFN dimension F=65536 running on GB200 NVLink72. Both weights and KV cache are stored and fetched in FP8. Communication overhead from TP and KVP is not included; these plots show only the change in GPU DRAM-read latency as TP width and KVP width vary.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these particular numbers? Are they representative of, say, DSR1 at TP=8?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants