Why is streamingLLM incompatible with GQA while DuoAttention is compatible?

Hi there,

I am reading your Duoattention paper and one paragraph confuses me.

> Despite numerous efforts to overcome the challenges of attention mechanisms in long-context
inference, significant computational and memory issues persist. Architectural modifications, such
as Grouped-Query Attention (GQA)(Ainslie et al., 2023), require model pre-training and fail to
reduce computational costs. Linear Attention methods (Gu & Dao, 2023; Poli et al., 2023), while
less demanding in terms of computation and memory, often underperform in long-context scenarios
compared to Transformer models. Approximative attention methods, such as H2O (Zhang et al.,
2023b), StreamingLLM (Xiao et al., 2023b), TOVA (Oren et al., 2024), and FastGen (Ge et al., 2024),
often compromise accuracy in long-context applications and are incompatible with essential KV cache
optimization techniques like GQA

Why is streamingLLM incompatible with GQA while DuoAttention is compatible? It seems to me that each layer contains both retrieval head and streaminghead, hence it well be the case that all attention heads who shares the same KV matrices might contain both retrieval head and streaming head. Therefore for one attention group, full KV matrices need to be stored.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why is streamingLLM incompatible with GQA while DuoAttention is compatible? #15

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why is streamingLLM incompatible with GQA while DuoAttention is compatible? #15

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions