Skip to content

feat(mem_cache): Hybrid Memory Pool system (Step 2: HybridReqToTokenPool)#1033

Closed
Rodrian7 wants to merge 7 commits into
sgl-project:mainfrom
Rodrian7:feat/hybrid-req-to-token-pool
Closed

feat(mem_cache): Hybrid Memory Pool system (Step 2: HybridReqToTokenPool)#1033
Rodrian7 wants to merge 7 commits into
sgl-project:mainfrom
Rodrian7:feat/hybrid-req-to-token-pool

Conversation

@Rodrian7
Copy link
Copy Markdown
Collaborator

@Rodrian7 Rodrian7 commented May 7, 2026

Summary

  • Add HybridReqToTokenPool extending ReqToTokenPool to coordinate KV slot + recurrent state slot allocation atomically for hybrid linear-recurrent models (Kimi-Linear, Mamba)
  • Companion to RecurrentStatePool (feat(mem_cache): Hybrid Memory Pool system (Step 1: RecurrentStatePool) #1031): host-side allocator state lives here so the buffer pool stays a pure pytree leaf safe for JIT donate
  • Add Req.recurrent_pool_idx field as part of the contract

Stacking note

This is Step 2 of N for the hybrid memory pool stack (Step 1: #1031). The diff currently contains the #1031 commits as base; will rebase onto main once #1031 merges, leaving only the ~420-line Step 2 diff.

Design

  • Reuse semantics align with the parent ReqToTokenPool.alloc(reqs) contract: a req holding recurrent_pool_idx (e.g. the next chunk of a chunked prefill) keeps its slot. No sticky-flag bookkeeping needed.
  • Atomic alloc: pre-check recurrent capacity, then super().alloc, then batch-slice the new recurrent slots with a single clear_slot call (avoids per-slot JIT scatter overhead).
  • free(req) is recurrent-first / KV-second (inverse of alloc), so a partially constructed req can never end up with a freed KV slot but a still-held recurrent slot.
  • Class is intentionally NOT registered as a pytree node — host-side allocator state must not enter a JIT-donated pytree (would otherwise reset on tree_unflatten).

Test plan

  • Unit tests: 11 cases + 7 subTests covering fresh / reuse / atomic-on-miss / clear-on-alloc / jit-donate-cycle / dp-capacity boundaries
  • Base ReqToTokenPool tests unaffected
  • Integration test after upcoming Steps land

JamesBrianD and others added 6 commits May 6, 2026 18:19
Migrated from epic/support_kimi_linear with DP support added.
Pure buffer pool for linear recurrent layers (KDA/Mamba/GDN).

Key changes vs epic:
- max_num_reqs → size (align with upstream sglang MambaPool)
- dp_size param with slot dim sharded on P("data", ...)
- total_slots = ceil_to(size+1, dp_size) for DP divisibility
…alloc

Coordinates KV slot + recurrent state slot allocation atomically for
hybrid linear-recurrent models (Kimi-Linear, Mamba). Companion to
RecurrentStatePool from sgl-project#1031.

Design:
- Host-side allocator state (recurrent_free_slots, mapping) lives on
  this class so RecurrentStatePool stays a pure pytree leaf safe for
  JIT donate. The class is intentionally NOT registered as a pytree
  node (would otherwise reset on tree_unflatten).
- Reuse semantics align with the parent ReqToTokenPool contract: a req
  holding recurrent_pool_idx (e.g. the next chunk of a chunked prefill)
  keeps its slot; no sticky-flag bookkeeping needed.
- Atomic alloc: pre-check recurrent capacity, then super().alloc, then
  batch-slice the new recurrent slots with a single clear-on-alloc call.
- Slot 0 of RecurrentStatePool is the dummy slot; mapping defaults to 0
  so an unallocated req lands on the dummy slot.

Also adds Req.recurrent_pool_idx field as part of the contract.

Tests cover fresh / reuse / atomic-on-miss / clear-on-alloc / jit-donate
cycle / dp-capacity boundaries. Base ReqToTokenPool tests unaffected.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Rodrian7 Rodrian7 closed this May 7, 2026
@Rodrian7 Rodrian7 reopened this May 7, 2026
The first cut paired the DP-sharded RecurrentStatePool from sgl-project#1031 with a
single-list slot allocator copied from epic. With dp_size > 1 the buffer's
first dim is sharded along the 'data' axis (each rank physically holds a
distinct slot range), so a single global free list would hand out slots
that cross DP rank boundaries — read/write at those slots would land in
the wrong rank's local buffer view.

Switch the allocator to per-DP: maintain one free list per rank with
LOCAL indices [1..slots_per_rank], and route alloc/free by req.dp_rank.
Callers (prepare_for_extend / decode) iterate per-DP, so all reqs in a
single alloc() call share the same dp_rank.

Tests updated: dp_size=1 cases unchanged in semantics but now index into
recurrent_free_slots[0]. DP test class rewritten with four per-rank cases
(init local indexing, alloc routing, capacity miss isolation, free
routing).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Rodrian7 Rodrian7 closed this May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants