MLA KV Cache Fusion TPU Inference by mourado · Pull Request #1856 · vllm-project/tpu-inference

mourado · 2026-03-04T19:07:16Z

This PR fuses the KV Cache update into the MLA attention kernel using Pallas.

Baseline — no fusion
Two separate passes over HBM:

1. Update KV cache — scatter new tokens into cache_kv in HBM (one read + write per token)
2. Run attention — DMA-fetch KV blocks from HBM, run Flash Attention on VPU

The cache update and attention are sequential. The VPU sits idle while the cache writes finish.

Fused — KV Cache update inside Pallas attention
A single Pallas kernel does both in one pass. Pallas on TPU exposes two independent execution lanes:

1. Scalar unit — runs DMA commands (prefetch next KV block + write new tokens into cache)
2. Vector unit (VPU) — runs Flash Attention on the current KV block

These run concurrently. While the VPU computes attention on block N, the scalar unit simultaneously writes new tokens into the cache and prefetches block N+1. The KV write latency is fully hidden behind VPU compute.

Checklist
I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

Copilot

Pull request overview

This PR aims to improve TPU inference performance for DeepSeek MLA attention by fusing KV-cache updates into the MLA ragged paged attention kernel (Pallas), enabling overlap between KV writes/prefetch (scalar lane) and attention compute (VPU lane). It also updates sharding configuration to introduce an attn_dp_expert mesh axis and adjusts quantization/random-weight-loading utilities accordingly.

Changes:

Introduce an additional mesh axis (attn_dp_expert) and propagate it through sharding strategy + TPU runner mesh construction.
Add MLA attention-side KV quantization plumbing (scales + key quantization) and adjust output sharding constraints.
Add a new MLA v1 “baseline” kernel module and update Qwix random-weight-loading scale key construction.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`tpu_inference/runner/tpu_runner.py`	Imports multihost utils, updates mesh shape to include `attn_dp_expert`, and tweaks compilation padding buckets for KV packing/alignment.
`tpu_inference/models/jax/utils/qwix/qwix_utils.py`	Changes scale-key derivation for Qwix random weight loading to handle deeper module paths.
`tpu_inference/models/jax/deepseek_v3.py`	Adjusts attention output sharding constraint placement; adds KV quantization for MLA inputs and new sharding knobs.
`tpu_inference/layers/common/sharding.py`	Adds `attn_dp_expert` axis, extends sharding axis-name groupings, and updates DP size computation/validation.
`tpu_inference/layers/common/quantization/__init__.py`	Makes `quantize_kv` accept `value=None` for key-only quantization.
`tpu_inference/kernels/mla/v1/baseline.py`	Adds a new MLA v1 kernel/baseline implementation, including KV update logic and Pallas call scaffolding.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tpu_inference/runner/tpu_runner.py

Copilot · 2026-03-04T19:14:16Z

tpu_inference/runner/tpu_runner.py

    def _create_single_slice_mesh(self) -> jax.Array:
        sharding_strategy: ShardingConfigManager = self.vllm_config.sharding_config
        mesh_shape = (
            sharding_strategy.model_dp_size,
            sharding_strategy.attn_dp_size,
+            sharding_strategy.attn_dp_expert_size,
            sharding_strategy.expert_size,
            sharding_strategy.tp_size,
        )


Adding attn_dp_expert_size introduces a 5D mesh shape for single-slice, but _create_multi_slice_mesh() still builds a 4D ici_mesh_shape while the mesh axis names (MESH_AXIS_NAMES) are now 5D. This will likely cause a shape/axis mismatch (or silently incorrect sharding) when NUM_SLICES > 1; update the multi-slice mesh construction to include the new attn_dp_expert axis (and adjust dcn_mesh_shape accordingly).

Copilot · 2026-03-04T19:14:16Z

tpu_inference/runner/tpu_runner.py

+        additional_sizes = self.vllm_config.additional_config.get("compilation_sizes", [])
        # [16, 32, 64, 128, 256, 512, 1024, 2048]
+        cache_dtype = self.cache_config.cache_dtype
+        if cache_dtype == "auto":
+            cache_dtype = self.dtype
+        kv_cache_dtype = to_jax_dtype(cache_dtype)
+        kv_packing = common_utils.get_dtype_packing(kv_cache_dtype)
        self.num_tokens_paddings = runner_utils.get_token_paddings(
-            min_token_size=max(16, self.dp_size),
+            min_token_size=max(16, self.dp_size * kv_packing),
            max_token_size=scheduler_config.max_num_batched_tokens *
            self.dp_size,
            padding_gap=vllm_envs.VLLM_TPU_BUCKET_PADDING_GAP)
+        self.num_tokens_paddings = sorted(self.num_tokens_paddings + additional_sizes)
        self.num_tokens_paddings_per_dp = [
            padding // self.dp_size for padding in self.num_tokens_paddings


additional_sizes are appended directly into num_tokens_paddings, but later num_tokens_paddings_per_dp is computed via padding // self.dp_size. If any additional_sizes entries are not multiples of dp_size (and/or the kv_packing alignment you just introduced), the per-DP padding will be truncated and can create inconsistent shapes between global/per-DP token counts. Consider validating/rounding additional_sizes to the required alignment before merging them into the paddings list.

Copilot · 2026-03-04T19:14:16Z

tpu_inference/models/jax/deepseek_v3.py

+                num_queries_per_block=num_queries_per_block,
+                q_scale=q_scale,
+                k_scale=k_scale,
+                v_scale=k_scale)


v_scale is computed (and set to self._v_scale for quantized KV) but the MLA kernel call passes v_scale=k_scale. This applies the wrong dequant scale to the attention output; it should pass the value scale (v_scale).

Suggested change

v_scale=k_scale)

v_scale=v_scale)

Copilot · 2026-03-04T19:14:16Z

tpu_inference/layers/common/quantization/__init__.py

 ) -> Tuple[jax.Array, jax.Array]:
    """Static quantize key and value tensors."""
    key = static_per_tensor_quantize_tensor(dtype, key, k_scale)
+    if value is None:
+        return key, None


quantize_kv now allows value=None and returns (key, None), but the return type annotation is still Tuple[jax.Array, jax.Array]. Also, k_scale/v_scale are defaulted to None but are passed directly into static_per_tensor_quantize_tensor, which expects a real scale value (will error if a caller relies on the defaults). Consider updating the return type to include None (e.g. jax.Array | None) and either make scales required again or add an explicit check/ValueError when they are None.

Suggested change

) -> Tuple[jax.Array, jax.Array]:

"""Static quantize key and value tensors."""

key = static_per_tensor_quantize_tensor(dtype, key, k_scale)

if value is None:

return key, None

) -> tuple[jax.Array, jax.Array | None]:

"""Static quantize key and value tensors."""

if k_scale is None:

raise ValueError("k_scale must be provided for quantize_kv.")

key = static_per_tensor_quantize_tensor(dtype, key, k_scale)

if value is None:

return key, None

if v_scale is None:

raise ValueError(

"v_scale must be provided for quantize_kv when value is not None."

)

Copilot · 2026-03-04T19:14:17Z

tpu_inference/kernels/mla/v1/baseline.py

+def update_kv_cache(
+        new_kv_c: jax.Array,  # [num_tokens, actual_lkv_dim]
+        new_k_pe: jax.Array,  # [num_tokens, actual_r_dim]
+        cache_kv: jax.
+    Array,  # [total_num_pages, page_size_per_kv_packing, kv_packing, lkv_dim+r_dim]
+        kv_lens: jax.Array,  # i32[max_num_seqs]
+        page_indices: jax.Array,  # i32[max_num_seqs * pages_per_seq]
+        cu_q_lens: jax.Array,  # i32[max_num_seqs + 1]
+        distribution: jax.Array,  # i32[3]
+) -> tuple[jax.Array, jax.Array]:
+    """Update KV cache with new tokens."""
+    actual_r_dim = new_k_pe.shape[-1]
+    r_dim = align_to(actual_r_dim, 128)
+    if actual_r_dim != r_dim:
+        new_k_pe = jnp.pad(new_k_pe, ((0, 0), (0, r_dim - actual_r_dim)),
+                           constant_values=0)
+    actual_lkv_dim = new_kv_c.shape[-1]
+    lkv_dim = align_to(actual_lkv_dim, 128)
+    if actual_lkv_dim != lkv_dim:
+        new_kv_c = jnp.pad(new_kv_c, ((0, 0), (0, lkv_dim - actual_lkv_dim)),
+                           constant_values=0)
+    kv_dim = r_dim + lkv_dim
+    _, page_size_per_kv_packing, kv_packing, cache_kv_dim = cache_kv.shape
+    assert kv_dim == cache_kv_dim
+    page_size = page_size_per_kv_packing * kv_packing
+
+    max_num_seqs = kv_lens.shape[0]
+    num_page_indices = page_indices.shape[0]
+    pages_per_seq = num_page_indices // max_num_seqs
+
+    def seq_loop_body(i, cache_kv):
+        q_start, q_end = cu_q_lens[i], cu_q_lens[i + 1]
+        q_len = q_end - q_start
+        kv_len = kv_lens[i]
+
+        def token_loop_body(j, cache_kv_):
+            token_idx_in_seq = kv_len - q_len + j
+            page_num_in_seq = token_idx_in_seq // page_size
+            page_indices_start = i * pages_per_seq
+            page_idx = page_indices[page_indices_start + page_num_in_seq]
+            row = (token_idx_in_seq % page_size) // kv_packing
+            col = (token_idx_in_seq % page_size) % kv_packing
+
+            cache_kv_ = cache_kv_.at[page_idx, row, col,
+                                     ..., :lkv_dim].set(new_kv_c[q_start + j])
+            cache_kv_ = cache_kv_.at[page_idx, row, col, ...,
+                                     lkv_dim:].set(new_k_pe[q_start + j])
+            return cache_kv_
+
+        return lax.fori_loop(0, q_len, token_loop_body, cache_kv)
+
+    cache_kv = lax.fori_loop(0, distribution[-1], seq_loop_body, cache_kv)
+
+    return cache_kv


update_kv_cache is annotated as returning tuple[jax.Array, jax.Array], but it actually returns a single cache_kv array. This is inconsistent with its implementation and with how callers use it (as a single array), and will confuse type-checkers/readers; update the return annotation (and docstring if needed) to match the actual return value.

Copilot · 2026-03-04T19:14:17Z

tpu_inference/kernels/mla/v1/baseline.py

+    # Fused KV-Cache update: handled inside the Pallas kernel.
+    # Use JAX-compatible update_kv_cache to update the cache in a jit-friendly way
+    # Parallelize KV cache update across sequences using vmap
+    # Pad new_kv_c and new_k_pe to aligned dims before update
+    actual_lkv_dim = new_kv_c.shape[-1]
+    actual_r_dim = new_k_pe.shape[-1]
+    lkv_dim = align_to(actual_lkv_dim, 128)
+    r_dim = align_to(actual_r_dim, 128)
+    if actual_lkv_dim != lkv_dim:
+        new_kv_c = jnp.pad(new_kv_c, ((0, 0), (0, lkv_dim - actual_lkv_dim)), constant_values=0)
+    if actual_r_dim != r_dim:
+        new_k_pe = jnp.pad(new_k_pe, ((0, 0), (0, r_dim - actual_r_dim)), constant_values=0)
+
+    def update_kv_cache_per_seq(seq_idx, cache_kv):
+        q_start = cu_q_lens[seq_idx]
+        q_end = cu_q_lens[seq_idx + 1]
+        q_len = q_end - q_start
+        kv_len = kv_lens[seq_idx]
+        _, page_size_per_kv_packing, kv_packing, _ = cache_kv.shape
+        page_size = page_size_per_kv_packing * kv_packing
+        num_page_indices = page_indices.shape[0]
+        max_num_seqs = kv_lens.shape[0]
+        pages_per_seq = num_page_indices // max_num_seqs
+        def update_token(j, cache_kv_):
+            token_idx_in_seq = kv_len - q_len + j
+            page_num_in_seq = token_idx_in_seq // page_size
+            page_indices_start = seq_idx * pages_per_seq
+            page_idx = page_indices[page_indices_start + page_num_in_seq]
+            row = (token_idx_in_seq % page_size) // kv_packing
+            col = (token_idx_in_seq % page_size) % kv_packing
+            cache_kv_ = cache_kv_.at[page_idx, row, col, ..., :lkv_dim].set(new_kv_c[q_start + j])
+            cache_kv_ = cache_kv_.at[page_idx, row, col, ..., lkv_dim:].set(new_k_pe[q_start + j])
+            return cache_kv_
+        cache_kv = jax.lax.fori_loop(0, q_len, update_token, cache_kv)
+        return cache_kv
+
+    seq_indices = jnp.arange(kv_lens.shape[0])
+    cache_kv = jax.lax.fori_loop(0, seq_indices.shape[0], update_kv_cache_per_seq, cache_kv)
+


This block claims the KV-cache update is "handled inside the Pallas kernel", but the code performs a full KV update pass in Python/JAX (fori_loop over sequences and tokens) before launching the Pallas attention kernel. That contradicts the intended fusion/concurrency described in the PR and will reintroduce the extra HBM pass (and likely compile a large scatter loop). Either move the KV update into _mla_ragged_paged_attention_kernel (scalar lane) or update the comments/PR description and keep this as an explicit unfused baseline path.

Copilot · 2026-03-04T19:14:17Z

tpu_inference/layers/common/sharding.py

+        # Currently tensor_parallelism is also used for other things like determining number of Ray workers.
+        pc_tensor_parallelism = parallel_config.tensor_parallel_size
+        ss_tensor_parallelsim = sharding_strategy.get("tensor_parallelism", 1)
        data_parallelism = parallel_config.data_parallel_size
        expert_parallelism = sharding_strategy.get("expert_parallelism", 1)
        sequence_parallelism = sharding_strategy.get("sequence_parallelism", 1)
        device_indexes = sharding_strategy.get("device_indexes", None)

        enable_dp_attention = sharding_strategy.get("enable_dp_attention",
                                                    False)
+        if pc_tensor_parallelism != ss_tensor_parallelsim and ss_tensor_parallelsim > 1:
+            # The user has explicitly set the tensor parallelism in the sharding config.
+            tensor_parallelism = ss_tensor_parallelsim
+        else:
+            tensor_parallelism = pc_tensor_parallelism


ss_tensor_parallelsim is misspelled (should be ss_tensor_parallelism). Beyond readability, this makes it easier to accidentally introduce a second similarly-named variable later; consider renaming for clarity/consistency.

Copilot · 2026-03-04T19:14:18Z

tpu_inference/layers/common/sharding.py

        sharding_strategy = vllm_config.additional_config.get(
            "sharding", {}).get("sharding_strategy", {})
        parallel_config = vllm_config.parallel_config
-        tensor_parallelism = parallel_config.tensor_parallel_size
+        # Currently tensor_parallelism is also used for other things like determining number of Ray workers.
+        pc_tensor_parallelism = parallel_config.tensor_parallel_size
+        ss_tensor_parallelsim = sharding_strategy.get("tensor_parallelism", 1)
        data_parallelism = parallel_config.data_parallel_size
        expert_parallelism = sharding_strategy.get("expert_parallelism", 1)
        sequence_parallelism = sharding_strategy.get("sequence_parallelism", 1)
        device_indexes = sharding_strategy.get("device_indexes", None)

        enable_dp_attention = sharding_strategy.get("enable_dp_attention",
                                                    False)
+        if pc_tensor_parallelism != ss_tensor_parallelsim and ss_tensor_parallelsim > 1:
+            # The user has explicitly set the tensor parallelism in the sharding config.
+            tensor_parallelism = ss_tensor_parallelsim
+        else:
+            tensor_parallelism = pc_tensor_parallelism
+


The logic for overriding tensor parallelism only applies when ss_tensor_parallelsim > 1. If a user explicitly sets tensor_parallelism to 1 in the sharding strategy (to override a larger parallel_config.tensor_parallel_size), that configuration will be ignored despite being explicitly provided. Consider checking for presence of the key (e.g. 'tensor_parallelism' in sharding_strategy) rather than > 1 so explicit overrides to 1 are honored.

m-liu · 2026-03-04T19:39:01Z

tpu_inference/kernels/mla/v1/baseline.py

There's already a kernel.py in main. Can you diff base based on that file instead of submitting a whole new file?

Yes, these changes are specifically for MLA optimization. I initially added new files to avoid breaking the base version during testing, but I'll move the optimized fused logic into the existing kernel.py now and remove the redundant files to keep the diff clean.

m-liu · 2026-03-04T19:40:04Z

tpu_inference/layers/common/sharding.py

are these changes related to MLA?

m-liu · 2026-03-04T19:42:25Z

tpu_inference/kernels/mla/v1/baseline.py

we need to add test coverage before submission imo (see: tests/kernels/mla_v1_test.py)
Please work with Jaehong to expand test coverage and get those committed.

Agreed on the test coverage. add cases to mla_v1_test.py that specifically exercise the new fused version and vectorization paths. I'll include those in the next push.

kyuyeunk · 2026-03-04T19:49:07Z

there so many conflicts in this pr. please update the branch to main's head first.

…pdated some shardings in DSV3.

… and __init__.py

kyuyeunk · 2026-03-09T19:40:21Z

Is this kernel connected to anywhere? I don't see a way for e2e to trigger this code path?

…de24 actions

…, license header)

Copilot AI review requested due to automatic review settings March 4, 2026 19:07

mourado requested review from bythew3i, bzgoogle, gpolovets1, jrplatin, kyuyeunk, sixiang-google, vipannalla and wenxindongwork as code owners March 4, 2026 19:07

Copilot started reviewing on behalf of mourado March 4, 2026 19:07 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

m-liu reviewed Mar 4, 2026

View reviewed changes

tpu_inference/layers/common/sharding.py

Copy link

m-liu Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these changes related to MLA?

m-liu reviewed Mar 4, 2026

View reviewed changes

gpolovets1 and others added 6 commits March 9, 2026 16:31

Adding back previous changes to supported 2D-TP & MLA.

bc94344

Fixed attn_dp sharding calculation when using lower precision. Also u…

5b2c825

…pdated some shardings in DSV3.

feat(mla): add full suite of baseline and fused kernels

03312a9

Update MLA kernel

06c3f97

Remove experimental kernel variants, keep only kernel.py, baseline.py…

a5a3296

… and __init__.py

feat(mla): implement v2 fused kernels and cleanup v1 experiments

0f41147

mourado force-pushed the bouache_mla/tpu_inference branch from 474def3 to 0f41147 Compare March 9, 2026 16:50

feat(mla): update v2 kernel with KV-cache fusion implementation

0ad1f20

mourado added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 11, 2026

mourado added 2 commits March 11, 2026 16:42

chore(mla): remove v1/baseline.py

ff25114

fix(mla): remove unused kv_left_per_kv_packing (F841) and opt into No…

9e9d0db

…de24 actions

mourado requested review from QiliangCui and jcyang43 as code owners March 11, 2026 17:34

fix(mla): apply pre-commit formatting fixes (line wrapping, alignment…

ed3c1f3

…, license header)

fix(mla): remove stray triple-quote from v2/__init__.py (SyntaxError)

9dcd016

Conversation

mourado commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

m-liu Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

mourado Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m-liu Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

m-liu Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

mourado Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

kyuyeunk commented Mar 4, 2026

Uh oh!

kyuyeunk commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mourado Mar 9, 2026 •

edited

Loading