-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Request implementation of Dual Chunk Attention (DCA), the technique used by Qwen 2/2.5 for efficient long-context training and inference on 100K+ token sequences.
Motivation
Training on sequences >100K tokens is essential for document understanding, code generation, and video models. Qwen 2.5 demonstrates that DCA enables 128K context with sub-quadratic memory:
- Quadratic memory growth - Standard attention scales O(n²), making 128K+ sequences impractical even with FlashAttention
- CP doesn't reduce attention complexity - Context Parallelism helps with parallelism but still computes full attention
- YaRN/ABF are position-encoding only - They extend positional range but don't address attention memory
DCA reduces memory from O(n²) to O(n·c) where c = chunk size, by combining local intra-chunk attention with global inter-chunk attention.
Current State
Megatron has strong long-context parallelism (CP with p2p, a2a, allgather, hierarchical CP, YaRN, FlashAttention) but lacks algorithmic attention optimizations like chunked or sparse attention patterns.
Ask
- New attention module -
DualChunkAttentionwith intra-chunk (local) + inter-chunk (global) attention - Integration with Context Parallelism, FlashAttention, and GQA/MQA
References
- Qwen2 Technical Report - DCA for 128K context
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request