Skip to content

Feature Request: Dual Chunk Attention for Long Context #2797

@sbhavani

Description

@sbhavani

Summary

Request implementation of Dual Chunk Attention (DCA), the technique used by Qwen 2/2.5 for efficient long-context training and inference on 100K+ token sequences.

Motivation

Training on sequences >100K tokens is essential for document understanding, code generation, and video models. Qwen 2.5 demonstrates that DCA enables 128K context with sub-quadratic memory:

  • Quadratic memory growth - Standard attention scales O(n²), making 128K+ sequences impractical even with FlashAttention
  • CP doesn't reduce attention complexity - Context Parallelism helps with parallelism but still computes full attention
  • YaRN/ABF are position-encoding only - They extend positional range but don't address attention memory

DCA reduces memory from O(n²) to O(n·c) where c = chunk size, by combining local intra-chunk attention with global inter-chunk attention.

Current State

Megatron has strong long-context parallelism (CP with p2p, a2a, allgather, hierarchical CP, YaRN, FlashAttention) but lacks algorithmic attention optimizations like chunked or sparse attention patterns.

Ask

  1. New attention module - DualChunkAttention with intra-chunk (local) + inter-chunk (global) attention
  2. Integration with Context Parallelism, FlashAttention, and GQA/MQA

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions