feat(mem_cache): Hybrid Memory Pool system (Step 2: HybridReqToTokenPool) by Rodrian7 · Pull Request #1033 · sgl-project/sglang-jax

Rodrian7 · 2026-05-07T03:47:53Z

Summary

Add HybridReqToTokenPool extending ReqToTokenPool to coordinate KV slot + recurrent state slot allocation atomically for hybrid linear-recurrent models (Kimi-Linear, Mamba)
Companion to RecurrentStatePool (feat(mem_cache): Hybrid Memory Pool system (Step 1: RecurrentStatePool) #1031): host-side allocator state lives here so the buffer pool stays a pure pytree leaf safe for JIT donate
Add Req.recurrent_pool_idx field as part of the contract

Stacking note

This is Step 2 of N for the hybrid memory pool stack (Step 1: #1031). The diff currently contains the #1031 commits as base; will rebase onto main once #1031 merges, leaving only the ~420-line Step 2 diff.

Design

Reuse semantics align with the parent ReqToTokenPool.alloc(reqs) contract: a req holding recurrent_pool_idx (e.g. the next chunk of a chunked prefill) keeps its slot. No sticky-flag bookkeeping needed.
Atomic alloc: pre-check recurrent capacity, then super().alloc, then batch-slice the new recurrent slots with a single clear_slot call (avoids per-slot JIT scatter overhead).
free(req) is recurrent-first / KV-second (inverse of alloc), so a partially constructed req can never end up with a freed KV slot but a still-held recurrent slot.
Class is intentionally NOT registered as a pytree node — host-side allocator state must not enter a JIT-donated pytree (would otherwise reset on tree_unflatten).

Test plan

Unit tests: 11 cases + 7 subTests covering fresh / reuse / atomic-on-miss / clear-on-alloc / jit-donate-cycle / dp-capacity boundaries
Base ReqToTokenPool tests unaffected
Integration test after upcoming Steps land

Migrated from epic/support_kimi_linear with DP support added. Pure buffer pool for linear recurrent layers (KDA/Mamba/GDN). Key changes vs epic: - max_num_reqs → size (align with upstream sglang MambaPool) - dp_size param with slot dim sharded on P("data", ...) - total_slots = ceil_to(size+1, dp_size) for DP divisibility

…alloc Coordinates KV slot + recurrent state slot allocation atomically for hybrid linear-recurrent models (Kimi-Linear, Mamba). Companion to RecurrentStatePool from sgl-project#1031. Design: - Host-side allocator state (recurrent_free_slots, mapping) lives on this class so RecurrentStatePool stays a pure pytree leaf safe for JIT donate. The class is intentionally NOT registered as a pytree node (would otherwise reset on tree_unflatten). - Reuse semantics align with the parent ReqToTokenPool contract: a req holding recurrent_pool_idx (e.g. the next chunk of a chunked prefill) keeps its slot; no sticky-flag bookkeeping needed. - Atomic alloc: pre-check recurrent capacity, then super().alloc, then batch-slice the new recurrent slots with a single clear-on-alloc call. - Slot 0 of RecurrentStatePool is the dummy slot; mapping defaults to 0 so an unallocated req lands on the dummy slot. Also adds Req.recurrent_pool_idx field as part of the contract. Tests cover fresh / reuse / atomic-on-miss / clear-on-alloc / jit-donate cycle / dp-capacity boundaries. Base ReqToTokenPool tests unaffected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gemini-code-assist · 2026-05-07T03:47:56Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

The first cut paired the DP-sharded RecurrentStatePool from sgl-project#1031 with a single-list slot allocator copied from epic. With dp_size > 1 the buffer's first dim is sharded along the 'data' axis (each rank physically holds a distinct slot range), so a single global free list would hand out slots that cross DP rank boundaries — read/write at those slots would land in the wrong rank's local buffer view. Switch the allocator to per-DP: maintain one free list per rank with LOCAL indices [1..slots_per_rank], and route alloc/free by req.dp_rank. Callers (prepare_for_extend / decode) iterate per-DP, so all reqs in a single alloc() call share the same dp_rank. Tests updated: dp_size=1 cases unchanged in semantics but now index into recurrent_free_slots[0]. DP test class rewritten with four per-rank cases (init local indexing, alloc routing, capacity miss isolation, free routing). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

JamesBrianD and others added 6 commits May 6, 2026 18:19

chore: trim verbose docstrings in RecurrentStatePool

1a8792d

chore: remove redundant parameter asserts in RecurrentStatePool

b19d8f0

refactor: simplify replace_buffer and pytree in RecurrentStatePool

b7cc7fb

revert: restore total_slots/sharding in pytree aux_data

d9ce8b1

Rodrian7 closed this May 7, 2026

Rodrian7 reopened this May 7, 2026

Rodrian7 closed this May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mem_cache): Hybrid Memory Pool system (Step 2: HybridReqToTokenPool)#1033

feat(mem_cache): Hybrid Memory Pool system (Step 2: HybridReqToTokenPool)#1033
Rodrian7 wants to merge 7 commits into
sgl-project:mainfrom
Rodrian7:feat/hybrid-req-to-token-pool

Rodrian7 commented May 7, 2026

Uh oh!

gemini-code-assist Bot commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Rodrian7 commented May 7, 2026

Summary

Stacking note

Design

Test plan

Uh oh!

gemini-code-assist Bot commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants