optimize_a_cc_me_absorb #4

jychen21 · 2024-08-22T09:24:32Z

We found that below two operations are memory inefficient:

q_nope = torch.matmul(q_nope, q_absorb);
attn_output = torch.matmul(attn_output, out_absorb.mT)

e.g.
q_nope: [B, M, 1, 128]
q_absorb: [1, M, 128, 512]

In this shape, we found that it will trigger elementwise data copy of B times, so we simply permute the q_nope from [B, M, 1, 128] to [1, M, B, 128], to reduce redundant memory movement and also make [1, 128] @ [128, 512] (GEMV) to [B, 128] @ [128, 512] (GEMM).

Below are the test details:

Accuracy

Performance (A100)
python benchmark.py A_CC_ME 1024 --bsz 32

before:
after:

wx-csy · 2024-08-29T05:51:30Z

Good point! Actually, we found that using einsum instead if matmul can achieve the similar results. Maybe torch is already capable to infer the optimal calculation plan with einsum. Could you please change to use einsum for better readability?

wx-csy · 2024-08-29T05:52:30Z

mla/impl/absorbed_cache_compressed_move_elision.py

@@ -174,7 +174,7 @@ def forward(self, hidden_states_q: torch.Tensor, q_position_ids: torch.LongTenso
            attn_weights, dim=-1, dtype=torch.float32
        ).to(q_nope.dtype)
        attn_output = torch.einsum('bhql,blc->bhqc', attn_weights, compressed_kv)
-        attn_output = torch.matmul(attn_output, out_absorb.mT) # torch.einsum('bhqc,hdc->bhqd', attn_output, out_absorb)
+        attn_output = attn_output = torch.matmul(attn_output.permute(2, 1, 0, 3), out_absorb.mT).permute(2, 1, 0, 3) # torch.einsum('bhqc,hdc->bhqd', attn_output, out_absorb)


can be changed to attn_output = torch.einsum('bhqc,hdc->bhqd', attn_output, out_absorb)

wx-csy · 2024-08-29T05:52:58Z

mla/impl/absorbed_cache_compressed_move_elision.py


        cos, sin = self.rotary_emb(q_pe)
        q_pe = apply_rotary_pos_emb(q_pe, cos, sin, q_position_ids)

-        q_nope = torch.matmul(q_nope, q_absorb) 
+        q_nope = torch.matmul(q_nope.transpose(0, 2), q_absorb).transpose(0, 2)


can be changed to q_nope = torch.einsum('bhqd,hdc->bhqc', q_nope, q_absorb)

wx-csy · 2024-08-29T05:53:21Z

mla/impl/absorbed_cache_compressed_move_elision.py

-        q_absorb = kv_b_proj[:, :self.qk_nope_head_dim,:]
-        out_absorb = kv_b_proj[:, self.qk_nope_head_dim:, :]
+        q_absorb = kv_b_proj[:, :self.qk_nope_head_dim,:].unsqueeze(0)
+        out_absorb = kv_b_proj[:, self.qk_nope_head_dim:, :].unsqueeze(0)


no need to unsqueeze here if einsum is used in line 164

optimize_a_cc_me_absorb

63fe244

wx-csy reviewed Aug 29, 2024

View reviewed changes

jukofyork mentioned this pull request Feb 7, 2025

Optimized DeepSeek V2/V3 implementation (MLA) ggml-org/llama.cpp#11446

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimize_a_cc_me_absorb #4

optimize_a_cc_me_absorb #4

Uh oh!

jychen21 commented Aug 22, 2024

Uh oh!

wx-csy commented Aug 29, 2024

Uh oh!

wx-csy Aug 29, 2024

Uh oh!

wx-csy Aug 29, 2024

Uh oh!

wx-csy Aug 29, 2024

Uh oh!

Uh oh!

optimize_a_cc_me_absorb #4

Are you sure you want to change the base?

optimize_a_cc_me_absorb #4

Uh oh!

Conversation

jychen21 commented Aug 22, 2024

Uh oh!

wx-csy commented Aug 29, 2024

Uh oh!

wx-csy Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

wx-csy Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

wx-csy Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!