feat(mem_cache): Hybrid Memory Pool system (Step 2: HybridReqToTokenPool)#1033
Closed
Rodrian7 wants to merge 7 commits into
Closed
feat(mem_cache): Hybrid Memory Pool system (Step 2: HybridReqToTokenPool)#1033Rodrian7 wants to merge 7 commits into
Rodrian7 wants to merge 7 commits into
Conversation
Migrated from epic/support_kimi_linear with DP support added.
Pure buffer pool for linear recurrent layers (KDA/Mamba/GDN).
Key changes vs epic:
- max_num_reqs → size (align with upstream sglang MambaPool)
- dp_size param with slot dim sharded on P("data", ...)
- total_slots = ceil_to(size+1, dp_size) for DP divisibility
…alloc Coordinates KV slot + recurrent state slot allocation atomically for hybrid linear-recurrent models (Kimi-Linear, Mamba). Companion to RecurrentStatePool from sgl-project#1031. Design: - Host-side allocator state (recurrent_free_slots, mapping) lives on this class so RecurrentStatePool stays a pure pytree leaf safe for JIT donate. The class is intentionally NOT registered as a pytree node (would otherwise reset on tree_unflatten). - Reuse semantics align with the parent ReqToTokenPool contract: a req holding recurrent_pool_idx (e.g. the next chunk of a chunked prefill) keeps its slot; no sticky-flag bookkeeping needed. - Atomic alloc: pre-check recurrent capacity, then super().alloc, then batch-slice the new recurrent slots with a single clear-on-alloc call. - Slot 0 of RecurrentStatePool is the dummy slot; mapping defaults to 0 so an unallocated req lands on the dummy slot. Also adds Req.recurrent_pool_idx field as part of the contract. Tests cover fresh / reuse / atomic-on-miss / clear-on-alloc / jit-donate cycle / dp-capacity boundaries. Base ReqToTokenPool tests unaffected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
The first cut paired the DP-sharded RecurrentStatePool from sgl-project#1031 with a single-list slot allocator copied from epic. With dp_size > 1 the buffer's first dim is sharded along the 'data' axis (each rank physically holds a distinct slot range), so a single global free list would hand out slots that cross DP rank boundaries — read/write at those slots would land in the wrong rank's local buffer view. Switch the allocator to per-DP: maintain one free list per rank with LOCAL indices [1..slots_per_rank], and route alloc/free by req.dp_rank. Callers (prepare_for_extend / decode) iterate per-DP, so all reqs in a single alloc() call share the same dp_rank. Tests updated: dp_size=1 cases unchanged in semantics but now index into recurrent_free_slots[0]. DP test class rewritten with four per-rank cases (init local indexing, alloc routing, capacity miss isolation, free routing). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
HybridReqToTokenPoolextendingReqToTokenPoolto coordinate KV slot + recurrent state slot allocation atomically for hybrid linear-recurrent models (Kimi-Linear, Mamba)RecurrentStatePool(feat(mem_cache): Hybrid Memory Pool system (Step 1: RecurrentStatePool) #1031): host-side allocator state lives here so the buffer pool stays a pure pytree leaf safe for JIT donateReq.recurrent_pool_idxfield as part of the contractStacking note
This is Step 2 of N for the hybrid memory pool stack (Step 1: #1031). The diff currently contains the #1031 commits as base; will rebase onto main once #1031 merges, leaving only the ~420-line Step 2 diff.
Design
ReqToTokenPool.alloc(reqs)contract: a req holdingrecurrent_pool_idx(e.g. the next chunk of a chunked prefill) keeps its slot. No sticky-flag bookkeeping needed.super().alloc, then batch-slice the new recurrent slots with a singleclear_slotcall (avoids per-slot JIT scatter overhead).free(req)is recurrent-first / KV-second (inverse of alloc), so a partially constructed req can never end up with a freed KV slot but a still-held recurrent slot.Test plan
ReqToTokenPooltests unaffected