feat(mem_cache): page-major (layer-major within a page) KV/state layout#29533
Open
ch-wan wants to merge 1 commit into
Open
feat(mem_cache): page-major (layer-major within a page) KV/state layout#29533ch-wan wants to merge 1 commit into
ch-wan wants to merge 1 commit into
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Contributor
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
563a295 to
a82369c
Compare
Collaborator
Author
|
/tag-and-rerun-ci |
a82369c to
b7594bf
Compare
b7594bf to
192a875
Compare
192a875 to
9392ffd
Compare
Opt-in physical layout (--enable-page-major-kv-layout) that makes the page the outermost axis: each page is laid out layer-major in one contiguous byte buffer for the Mamba state, full-KV, and SWA-KV caches instead of per-layer tensors. At page_size=1 this is a token-granularity envelope; independent of any shared/virtual-slot allocator. - mem_cache/layout/page_major.py: strided-view builders + byte geometry. - PageMajorMHATokenToKVPool subclass (kv_cache_layout=page_major_layer_major) via _store_kv_layer / _move_kv_impl template hooks on MHATokenToKVPool; layout-incompatible methods raise instead of silently mis-indexing. MambaPool envelope branch for the conv/temporal state. - Triton decode/extend + store_cache_4d kernels: page-aware strides, a byte-identical no-op at page_size=1 (PAGE_SIZE constexpr). - GDN prefill gather/scatter in gdn_backend.forward_extend so the strided envelope state is persisted correctly to the pool (the prefill conv / chunk_gated_delta_rule kernels assume a contiguous slot layout). - server_args flag + Triton-backend validator; model_runner_kv_cache_mixin routes the layout into the plain-MHA, SWA-hybrid, and Mamba-hybrid pools. - Removed the dead enable_kvcache_transpose param. - Tests: store_cache_4d / decode+extend parity, CPU view/move, and e2e page-major accuracy (gpt-oss, qwen) in the label-gated extra suite. Docs. Co-Authored-By: lch1475369 <lch1475369@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
9392ffd to
ea2e68d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
SGLang stores the KV cache (and Mamba conv/SSM state) as per-layer tensors — layer 0's slots, then layer 1's slots, etc. (a layer-major layout). Each token's K/V for a given layer is contiguous, but a single token/page is scattered across
num_layersseparate allocations.This PR adds an opt-in physical layout,
--enable-page-major-kv-layout, that flips the outermost axis to the page: each page's whole depth — all layers' K/V (and all Mamba conv/temporal state) — lives in one contiguous byte buffer, laid out layer-major within the page. Atpage_size=1this is a per-token envelope. Co-locating a page's whole depth is a building block for page-granular KV operations (movement, transfer, offload, allocation) and improves locality for those paths.The layout is off by default and behavior-preserving when off — the hot paths are byte-identical (the page-aware Triton kernels constexpr-fold to the legacy addressing at
page_size=1).Modifications
mem_cache/layout/page_major.py— standalone strided-view builders (build_page_major_mha_views,build_page_major_mamba_views) + byte geometry; hold no allocator state.PageMajorMHATokenToKVPool— a subclass ofMHATokenToKVPool(not in-class branching) selected via new_store_kv_layer/_move_kv_cache_impltemplate hooks. Layout-incompatible inherited methods (contiguous-buf-infos, CPU offload, prefix-commit)raise NotImplementedErrorinstead of silently mis-indexing the 4-D strided views.MambaPoolgains an envelope branch for conv/temporal state.store_cache_4dkernels — page-aware strides behind aPAGE_SIZEconstexpr; atpage_size=1the page math is dead-code-eliminated, so the SASS is identical to today.gdn_backend.forward_extend) — the prefill conv (causal_conv1d_fwd) andchunk_gated_delta_rulekernels write state back assuming a contiguous slot layout; under the strided envelope they silently dropped the write. The hybrid prefill now runs on contiguous per-sequence copies and scatters the updated state back. (TODO(ch-wan)left to make those kernels stride-aware and drop the copies.)server_args—--enable-page-major-kv-layoutflag + a validator requiring the Triton attention / linear-attn / Mamba backends;model_runner_kv_cache_mixinroutes the layout into the plain-MHA, SWA-hybrid, and Mamba-hybrid pools.enable_kvcache_transposeparameter (was alwaysFalse).store_cache_4d, decode/extend), CPU view/move tests, two e2e accuracy tests (gpt-oss, qwen) in the label-gatedextrasuite, and the server-arg doc.Accuracy Tests
GSM8K (5-shot/300, Triton backend), page-major vs baseline:
A GDN-prefill state-persistence bug (page-major dropped Qwen3.5 to ~0.61) was found and fixed; the table reflects the fix. A dedicated review confirmed the disabled path is behavior-preserving with no measurable overhead.
Speed Tests and Profiling
Not yet benchmarked — this PR is a layout/correctness foundation, off by default. The
page_size=1path is a verified no-op (constexpr-folded), so no regression is expected when disabled. Throughput/locality benchmarking of the enabled path is a follow-up.Notes / follow-ups
--enable-page-major-kv-layoutis not yet supported with fp4 KV cache (asserted) or the speculative-decode target-verify path..contiguous()gather/scatter under the envelope (TODO(ch-wan)); making the conv /chunk_gated_delta_rulekernels stride-aware would remove it.Checklist
server_arguments.mdx)Co-Authored-By: lch1475369 lch1475369@gmail.com
CI States
Latest PR Test (Base): ✅ Run #28309127030
Latest PR Test (Extra): ✅ Run #28309127002