You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to deploy Kimi-K2-Instruct on 4 nodes (each with 8×H100) using SGLang v0.5.8, with TP=8 and PP=4.
The server starts successfully, but after a few concurrent requests (e.g., parallel=3, number=15) it crashes with the following error: ValueError: req_to_token_pool memory leak detected! available_size=5, total_size=4
This causes the downstream client to receive incomplete responses (ClientPayloadError). The issue occurs even at low concurrency (parallel=3).
The server runs normally when concurrency is limited to parallel=2 (i.e., up to 2 concurrent requests), but crashes consistently at parallel=3 or higher.
I haven't yet tried upgrading to the latest main branch.
Logs
2026-02-25T12:04:45.069466991+08:00 stderr F [2026-02-25 12:04:45 PP0 TP0] Decode batch, #running-req: 1, #token: 44920, token usage: 0.03, cuda graph: True, gen throughput (token/s): 171.04, #queue-req: 0,
2026-02-25T12:04:45.283949727+08:00 stderr F [2026-02-25 12:04:45 PP0 TP6] Scheduler hit an exception: Traceback (most recent call last):
2026-02-25T12:04:45.28397665+08:00 stderr F File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2972, in run_scheduler_process
2026-02-25T12:04:45.28398023+08:00 stderr F scheduler.event_loop_pp()
2026-02-25T12:04:45.28398242+08:00 stderr F File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
2026-02-25T12:04:45.283984576+08:00 stderr F return func(*args, **kwargs)
2026-02-25T12:04:45.283986261+08:00 stderr F ^^^^^^^^^^^^^^^^^^^^^
2026-02-25T12:04:45.283988389+08:00 stderr F File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_pp_mixin.py", line 144, in event_loop_pp
2026-02-25T12:04:45.283992993+08:00 stderr F self.self_check_during_idle()
2026-02-25T12:04:45.283996624+08:00 stderr F File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 332, in self_check_during_idle
2026-02-25T12:04:45.283999938+08:00 stderr F self.check_memory()
2026-02-25T12:04:45.28400641+08:00 stderr F File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 251, in check_memory
2026-02-25T12:04:45.284008826+08:00 stderr F self._check_req_pool()
2026-02-25T12:04:45.284010453+08:00 stderr F File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 227, in _check_req_pool
2026-02-25T12:04:45.284015144+08:00 stderr F raise_error_or_warn(
2026-02-25T12:04:45.284018511+08:00 stderr F File "/sgl-workspace/sglang/python/sglang/srt/utils/common.py", line 3950, in raise_error_or_warn
2026-02-25T12:04:45.284023168+08:00 stderr F raise ValueError(message)
2026-02-25T12:04:45.284027092+08:00 stderr F ValueError: req_to_token_pool memory leak detected!available_size=5, total_size=4
2026-02-25T12:04:45.284030074+08:00 stderr F
2026-02-25T12:04:45.284033501+08:00 stderr F
……
2026-02-25T12:04:45.38174578+08:00 stderr F [2026-02-25 12:04:45] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
2026-02-25T12:04:45.436771108+08:00 stderr F [2026-02-25 12:04:45] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
2026-02-25T12:04:45.491214702+08:00 stderr F [2026-02-25 12:04:45] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
2026-02-25T12:04:45.551562056+08:00 stderr F [2026-02-25 12:04:45] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
2026-02-25T12:04:45.604836207+08:00 stderr F [2026-02-25 12:04:45] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
2026-02-25T12:04:45.653975892+08:00 stderr F [2026-02-25 12:04:45] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
2026-02-25T12:04:45.701907129+08:00 stderr F [2026-02-25 12:04:45] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
2026-02-25T12:04:45.751586072+08:00 stderr F [2026-02-25 12:04:45] ERROR: Traceback (most recent call last):
2026-02-25T12:04:45.751593718+08:00 stderr F File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
2026-02-25T12:04:45.751596438+08:00 stderr F return runner.run(main)
2026-02-25T12:04:45.751598201+08:00 stderr F ^^^^^^^^^^^^^^^^
2026-02-25T12:04:45.751599607+08:00 stderr F File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
2026-02-25T12:04:45.751601454+08:00 stderr F return self._loop.run_until_complete(task)
2026-02-25T12:04:45.751603127+08:00 stderr F ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-25T12:04:45.751605364+08:00 stderr F File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
2026-02-25T12:04:45.751608325+08:00 stderr F File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
2026-02-25T12:04:45.751611964+08:00 stderr F File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
2026-02-25T12:04:45.751614248+08:00 stderr F File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
2026-02-25T12:04:45.75161601+08:00 stderr F File "uvloop/loop.pyx", line 476, in uvloop.loop.Loop._on_idle
2026-02-25T12:04:45.751617862+08:00 stderr F File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
2026-02-25T12:04:45.751620578+08:00 stderr F File "uvloop/cbhandles.pyx", line 61, in uvloop.loop.Handle._run
2026-02-25T12:04:45.751623553+08:00 stderr F File "/sgl-workspace/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 2361, in running_phase_sigquit_handler
2026-02-25T12:04:45.751626654+08:00 stderr F kill_process_tree(os.getpid())
2026-02-25T12:04:45.751642312+08:00 stderr F File "/sgl-workspace/sglang/python/sglang/srt/utils/common.py", line 1094, in kill_process_tree
2026-02-25T12:04:45.751647253+08:00 stderr F sys.exit(0)
2026-02-25T12:04:45.751650532+08:00 stderr F SystemExit: 0
2026-02-25T12:04:45.751653395+08:00 stderr F
2026-02-25T12:04:45.751656805+08:00 stderr F During handling of the above exception, another exception occurred:
2026-02-25T12:04:45.751659905+08:00 stderr F
2026-02-25T12:04:45.75166308+08:00 stderr F Traceback (most recent call last):
2026-02-25T12:04:45.751667539+08:00 stderr F File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 701, in lifespan
2026-02-25T12:04:45.751670493+08:00 stderr F await receive()
2026-02-25T12:04:45.751674135+08:00 stderr F File "/usr/local/lib/python3.12/dist-packages/uvicorn/lifespan/on.py", line 137, in receive
2026-02-25T12:04:45.751677158+08:00 stderr F return await self.receive_queue.get()
2026-02-25T12:04:45.751680017+08:00 stderr F ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-02-25T12:04:45.751683413+08:00 stderr F File "/usr/lib/python3.12/asyncio/queues.py", line 158, in get
2026-02-25T12:04:45.751686422+08:00 stderr F await getter
2026-02-25T12:04:45.751689728+08:00 stderr F asyncio.exceptions.CancelledError
2026-02-25T12:04:45.751692509+08:00 stderr F
2026-02-25T12:04:45.754086061+08:00 stderr F [2026-02-25 12:04:45] ERROR: Exception in ASGI application
2026-02-25T12:04:45.754090594+08:00 stderr F Traceback (most recent call last):
2026-02-25T12:04:45.754092347+08:00 stderr F File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
2026-02-25T12:04:45.75409433+08:00 stderr F return runner.run(main)
2026-02-25T12:04:45.754096095+08:00 stderr F ^^^^^^^^^^^^^^^^
……
2026-02-25T12:04:45.756102603+08:00 stderr F File "/usr/lib/python3.12/asyncio/locks.py", line 212, in wait
2026-02-25T12:04:45.756112467+08:00 stderr F await fut
2026-02-25T12:04:45.756114278+08:00 stderr F asyncio.exceptions.CancelledError
Evalscope logs
2026-02-25 13:56:27 - evalscope - ERROR: Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/aiohttp/client_proto.py", line 144, in connection_lost
uncompleted = self._parser.feed_eof()
^^^^^^^^^^^^^^^^^^^^^^^
File "aiohttp/_http_parser.pyx", line 506, in aiohttp._http_parser.HttpParser.feed_eof
aiohttp.http_exceptions.TransferEncodingError: 400, message:
Not enough data to satisfy transfer length header.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/evalscope/perf/plugin/api/default_api.py", line 105, in process_request
async for chunk_bytes in response.content.iter_any():
File "/usr/local/lib/python3.12/dist-packages/aiohttp/streams.py", line 52, in __anext__
rv = await self.read_func()
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/aiohttp/streams.py", line 464, in readany
await self._wait("readany")
File "/usr/local/lib/python3.12/dist-packages/aiohttp/streams.py", line 371, in _wait
await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data to satisfy transfer length header.'>
2026-02-25 13:56:27 - evalscope - ERROR: Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/aiohttp/client_proto.py", line 144, in connection_lost
uncompleted = self._parser.feed_eof()
^^^^^^^^^^^^^^^^^^^^^^^
File "aiohttp/_http_parser.pyx", line 506, in aiohttp._http_parser.HttpParser.feed_eof
aiohttp.http_exceptions.TransferEncodingError: 400, message:
Not enough data to satisfy transfer length header.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/evalscope/perf/plugin/api/default_api.py", line 105, in process_request
async for chunk_bytes in response.content.iter_any():
File "/usr/local/lib/python3.12/dist-packages/aiohttp/streams.py", line 52, in __anext__
rv = await self.read_func()
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/aiohttp/streams.py", line 464, in readany
await self._wait("readany")
File "/usr/local/lib/python3.12/dist-packages/aiohttp/streams.py", line 371, in _wait
await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data to satisfy transfer length header.'>
……
Has anyone encountered a similar issue? Could this be a known bug in the memory pool management for distributed deployments? Any suggestions for workarounds or fixes would be greatly appreciated.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I'm trying to deploy Kimi-K2-Instruct on 4 nodes (each with 8×H100) using SGLang v0.5.8, with TP=8 and PP=4.
The server starts successfully, but after a few concurrent requests (e.g., parallel=3, number=15) it crashes with the following error:
ValueError: req_to_token_pool memory leak detected! available_size=5, total_size=4
This causes the downstream client to receive incomplete responses (ClientPayloadError). The issue occurs even at low concurrency (parallel=3).
Environment
OS: Ubuntu 24.04.2 LTS
Python: 3.12.3
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 560.35.03
PyTorch: 2.9.1+cu129
sglang: 0.5.8
sgl_kernel: 0.3.21
flashinfer_python: 0.6.1
flashinfer_cubin: 0.6.1
flashinfer_jit_cache: 0.6.1+cu129
triton: 3.5.1
transformers: 4.57.1
Command
Test load (using evalscope)
What I've tried
Logs
Evalscope logs
Has anyone encountered a similar issue? Could this be a known bug in the memory pool management for distributed deployments? Any suggestions for workarounds or fixes would be greatly appreciated.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions