Skip to content

Why is streamingLLM incompatible with GQA while DuoAttention is compatible? #15

@BoxuanYang

Description

@BoxuanYang

Hi there,

I am reading your Duoattention paper and one paragraph confuses me.

Despite numerous efforts to overcome the challenges of attention mechanisms in long-context
inference, significant computational and memory issues persist. Architectural modifications, such
as Grouped-Query Attention (GQA)(Ainslie et al., 2023), require model pre-training and fail to
reduce computational costs. Linear Attention methods (Gu & Dao, 2023; Poli et al., 2023), while
less demanding in terms of computation and memory, often underperform in long-context scenarios
compared to Transformer models. Approximative attention methods, such as H2O (Zhang et al.,
2023b), StreamingLLM (Xiao et al., 2023b), TOVA (Oren et al., 2024), and FastGen (Ge et al., 2024),
often compromise accuracy in long-context applications and are incompatible with essential KV cache
optimization techniques like GQA

Why is streamingLLM incompatible with GQA while DuoAttention is compatible? It seems to me that each layer contains both retrieval head and streaminghead, hence it well be the case that all attention heads who shares the same KV matrices might contain both retrieval head and streaming head. Therefore for one attention group, full KV matrices need to be stored.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions