Support wake up/sleep into router by klhhhhh · Pull Request #35 · zhaochenyang20/sglang-diffusion-routing

klhhhhh · 2026-02-27T10:28:20Z

Summary

This PR integrates GPU sleep / wake into the diffusion router as a first-class control-plane feature.

The router now broadcasts sleep/wake requests to all healthy workers, tracks worker sleep state locally, and gates routing to prevent inference traffic from being sent to sleeping workers.

Key Changes

Add router-level APIs:
- POST /release_memory_occupation
- POST /resume_memory_occupation
Broadcast sleep/wake to all non-dead workers using existing _broadcast_to_workers
Track sleeping workers separately from dead workers
Exclude sleeping workers from routing decisions
Exclude sleeping workers from health check failure accumulation
Reset health failure counters on successful wake
Determine success based on HTTP status code (200), not response body fields

Behavior

After release_memory_occupation, inference requests are no longer routed to sleeping workers
After resume_memory_occupation, routing and inference resume normally
Prevents intentional sleep from being misclassified as worker failure

Testing

Manually tested end-to-end:

Inference → sleep → routing blocked → wake → inference restored
Verified GPU memory release and recovery
Verified health loop does not quarantine sleeping workers

Test scripts

Launch router

sglang-d-router --port 30081 --launcher-config examples/local_launcher.yaml

Generation

curl -sS -X POST http://127.0.0.1:30081/v1/images/generations   -H "Content-Type: application/json"   -d '{"prompt":"a cute cat","n":1,"size":"512x512"}' | head

Sleep

 curl -i -sS -X POST http://127.0.0.1:30081/release_memory_occupation   -H "Content-Type: application/json" -d '{}'

Wake Up

 curl -i -sS -X POST http://127.0.0.1:30081/resume_memory_occupation   -H "Content-Type: application/json" -d '{}'

Worker State Model & Transitions

This PR introduces a clear and minimal worker state model to support GPU sleep / wake without conflating intentional sleep with worker failure.

States

Each worker can independently be in the following orthogonal states:

Alive (default)
Registered worker, eligible for routing and health checks.
Sleeping (sleeping_workers)
Worker has intentionally released GPU memory via release_memory_occupation.
Sleeping is not a failure.
Dead (dead_workers)
Worker has failed consecutive health checks and is quarantined from routing.

Note: sleeping and dead are tracked independently to avoid misclassifying
intentional sleep as failure.

State Transitions

1. Sleep (`release_memory_occupation`)

Alive  --(sleep request, 200)-->  Sleeping

Rules:

Only non-dead workers are eligible to enter sleeping.
Sleeping workers are excluded from:
- data-plane routing
- health checks
A failed sleep request does not change state.

2. Wake (`resume_memory_occupation`)

Sleeping  --(wake request, 200)-->  Alive

Rules:

Wake requests are sent only to sleeping workers.
On successful wake:
- worker is removed from sleeping_workers
- health failure counter is reset
- worker is removed from dead_workers if previously marked dead

This allows a worker that was incorrectly marked dead during sleep
(e.g. due to missed health checks) to be fully recovered.

3. Health Check Failure

Alive  --(N failed health checks)-->  Dead

Rules:

Health checks exclude sleeping workers.
Sleeping workers will never accumulate health failures.
Only alive, non-sleeping workers can transition to dead.

4. Deregistration

Any State  --(deregister)-->  Removed

Rules:

Deregistration clears all runtime state:
- request counters
- failure counters
- sleeping state
- dead state
- video-job mappings

Behavioral Guarantees

Data Plane (Generation / Routing)

Dead workers are never selected.
Sleeping workers are never selected.
Only alive, non-sleeping workers receive generation traffic.

Control Plane (Sleep / Wake)

Dead workers never receive control requests.
Sleeping workers receive only wake requests.
Wake can revive a worker even if it was previously marked dead.

Health Plane

Health checks exclude sleeping workers entirely.
Intentional sleep cannot cause a worker to be marked dead.

Design Rationale

This design explicitly separates:

Intentional resource release (sleep)
Unintentional failure (dead)

By keeping these states orthogonal:

Sleeping does not trigger false failure detection
Wake provides a clean recovery path
Routing and health logic remain simple and predictable

No additional intermediate or “pending” states are introduced to keep
the model minimal and easy to reason about.

gemini-code-assist · 2026-02-27T10:28:34Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the diffusion router by integrating GPU sleep and wake functionality as a core control-plane feature. It allows the router to broadcast sleep/wake commands to workers, maintain an accurate state of sleeping workers, and intelligently adjust routing decisions to prevent sending inference requests to inactive GPUs. This ensures efficient resource utilization and prevents false positives in worker health monitoring during intentional memory release.

Highlights

New API Endpoints: Added POST /release_memory_occupation and POST /resume_memory_occupation endpoints to the router for explicit control over worker GPU memory state.
Worker State Tracking: Introduced a new 'sleeping_workers' set to track workers that have intentionally released GPU memory, distinct from 'dead_workers'.
Intelligent Routing: Modified routing logic to exclude sleeping workers from inference traffic and health check failure accumulation, preventing misclassification of intentional sleep as failure.
Health State Reset: Implemented a mechanism to reset health failure counters for workers upon successful wake-up, ensuring a clean slate after resuming operations.

Changelog

src/sglang_diffusion_routing/router/diffusion_router.py
- Added a new set sleeping_workers to track workers in a sleeping state.
- Registered new API routes /release_memory_occupation and /resume_memory_occupation.
- Implemented release_memory_occupation to broadcast sleep requests to healthy workers and mark successful ones as sleeping.
- Implemented resume_memory_occupation to broadcast wake requests, unmark sleeping workers, and reset their health failure counters.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

klhhhhh · 2026-02-27T10:28:59Z

It's just first version without test, and unites will be done soon.

gemini-code-assist

Code Review

This pull request introduces functionality for workers to sleep and wake up, managed by the router, including new API endpoints and state tracking. However, it critically lacks essential security controls, notably the absence of authentication on these sensitive routes and the insecure forwarding of all request headers to potentially untrusted workers. The implementation is also incomplete, as the core logic for routing and health-checking does not properly exclude sleeping workers, which could lead to service instability and Denial of Service. Additionally, there's a state leak when workers are deregistered, and automated tests for this new functionality are missing. Consider enhancing observability by including the sleeping status in the _build_worker_payload.

gemini-code-assist · 2026-02-27T10:35:35Z

+        self.app.post("/release_memory_occupation")(self.release_memory_occupation)
+        self.app.post("/resume_memory_occupation")(self.resume_memory_occupation)


The new endpoints /release_memory_occupation and /resume_memory_occupation lack any authentication or authorization checks. Since these endpoints perform administrative actions (broadcasting sleep/wake commands to all workers and modifying the router's internal state), exposing them without access control allows any user with network access to the router to disrupt the service by putting all workers to sleep. This is a Missing Function-Level Access Control vulnerability.

gemini-code-assist · 2026-02-27T10:35:35Z

+        headers = dict(request.headers)
+        headers.pop("content-length", None)
+        headers.setdefault("content-type", "application/json")


The release_memory_occupation and resume_memory_occupation methods insecurely forward all request headers to all healthy workers without sanitization. This poses a risk of sensitive headers (e.g., Authorization, Cookie) being leaked to malicious workers, especially given the unauthenticated /workers endpoint. It is crucial to only forward necessary headers or use an allow-list. Additionally, these methods are nearly identical; refactoring their common logic into a private helper method would significantly improve maintainability and reduce code duplication.

gemini-code-assist · 2026-02-27T10:35:35Z

+    async def release_memory_occupation(self, request: Request):
+        """Broadcast sleep to all healthy workers and mark them as sleeping on success."""
+        healthy_workers = [
+            url for url in self.worker_request_counts if url not in self.dead_workers
+        ]
+        if not healthy_workers:
+            return JSONResponse(
+                status_code=503,
+                content={"error": "No healthy workers available in the pool"},
+            )
+
+        body = await request.body()
+        headers = dict(request.headers)
+        headers.pop("content-length", None)
+        headers.setdefault("content-type", "application/json")
+
+        results = await self._broadcast_to_workers(
+            "release_memory_occupation", body, headers
+        )
+
+        for item in results:
+            if item.get("status_code") == 200:
+                self.sleeping_workers.add(item["worker_url"])
+
+        return JSONResponse(content={"results": results})
+
+    async def resume_memory_occupation(self, request: Request):
+        """Broadcast wake to all healthy workers and unmark sleeping on success."""
+        healthy_workers = [
+            url for url in self.worker_request_counts if url not in self.dead_workers
+        ]
+        if not healthy_workers:
+            return JSONResponse(
+                status_code=503,
+                content={"error": "No healthy workers available in the pool"},
+            )
+        body = await request.body()
+        headers = dict(request.headers)
+        headers.pop("content-length", None)
+        headers.setdefault("content-type", "application/json")
+
+        results = await self._broadcast_to_workers(
+            "resume_memory_occupation", body, headers
+        )
+
+        for item in results:
+            if item.get("status_code") == 200:
+                self.sleeping_workers.discard(item["worker_url"])
+                # Reset health failure counter on successful wake:
+                # waking is an explicit recovery point and should not inherit failures
+                # accumulated during intentional sleep.


The PR introduces a sleeping_workers state to track workers that have released memory, but this state is never actually used to gate routing or health checks in the provided code. Specifically, _select_worker_by_routing and _health_check_loop still only check dead_workers. This means inference traffic will continue to be sent to workers that are supposed to be 'sleeping', which likely leads to errors or crashes on the workers, causing a Denial of Service. The implementation is incomplete and fails to achieve its stated security/stability goal.

klhhhhh · 2026-02-27T22:12:59Z

some test below pass

klhhhhh · 2026-02-27T23:40:58Z

@zhaochenyang20 done, right now wake up/sleep it's for all the workers in routers, I am not sure if I need to implement sleep/wake up some specific number of workers.

zhaochenyang20 · 2026-02-28T22:31:30Z

总体评价 / Overall Assessment

总体评级: Request Changes

核心设计思路正确（独立 sleeping_workers
状态、区分有意睡眠与故障），但实现不完整。sleeping_workers
作为新的状态维度，没有在所有必要的过滤/清理点生效，存在多个 P0
级阻塞问题。

P0 阻塞问题 / P0 Blockers

_health_check_loop 未排除 sleeping workers（最严重）

[中文] 健康检查循环仍然只过滤 dead_workers。sleeping 的 worker
已释放显存，/health 大概率返回非 200 或超时。后果链：sleeping
worker 累积失败 → 被标记为 dead → resume_memory_occupation 调用时
worker 已是 dead 状态且不在 sleeping 集合中 → 永久不可恢复。

[English] The health check loop still only filters dead_workers.
A sleeping worker (GPU memory released) will fail health checks,
accumulate failures, get marked dead, and become permanently
unrecoverable — the most critical bug in this PR.

deregister_worker 不清理 sleeping_workers

[中文] 注销 worker 时未清理 sleeping_workers，导致重新注册同 URL
的 worker 会继承幽灵 sleeping 状态。

[English] Deregistering a worker leaves a stale entry in
sleeping_workers. Re-registering the same URL inherits ghost
sleeping state.

_video_capable_workers 未排除 sleeping workers

[中文] sleeping worker 仍被视为 video-capable
候选者，可能被选中处理视频请求但无法响应。

[English] Sleeping workers are still considered video-capable
candidates and may be selected for video requests they cannot
serve.

_forward_to_registered_worker 未检查 sleeping workers

[中文] 通过 video_id 映射到 sleeping worker 的请求仍会被转发。

[English] Requests mapped to sleeping workers via video_id will
still be forwarded.

resume_memory_occupation 应同时清理 dead_workers

[中文] 如果 health check 已在 resume 前将 worker 标记为
dead，仅重置 failure count 不够，还需要
self.dead_workers.discard(url)。

[English] If the health check already marked the worker dead
before resume, resetting failure count alone is insufficient —
dead_workers must also be cleared.

P2 维护性问题 / P2 Maintainability

代码重复：两个新方法 + update_weights_from_disk 共享 ~15
行完全相同的样板代码，应提取公共辅助方法
_build_worker_payload 缺少 is_sleeping：GET /workers API
无法展示 sleeping 状态
health 端点缺少 sleeping 维度：所有 worker 都 sleeping 时仍报告
"healthy"
healthy_workers 变量计算后未使用：仅作空池检查，实际广播目标由
_broadcast_to_workers 重新计算

AI 生成代码检测 / AI-Generated Code Detection

[中文] 代码展现明显 AI 辅助痕迹：模式化复制
update_weights_from_disk 但额外添加了 codebase 中无先例的 header
清理；只修改了最表面的 _select_worker_by_routing
而遗漏所有需要深入理解系统的集成点；注释过度解释显而易见的内容但
忽略了真正需要注释的地方。

[English] The code shows clear AI-assisted generation:
pattern-copied from update_weights_from_disk with header cleaning
that has no codebase precedent; only the most superficial
integration point was modified while all deeper ones were missed;
comments over-explain the obvious while missing what truly needs
documentation.

结论 / Conclusion

必须修复 (P0): 5 个阻塞项（health check loop、deregister、video
routing、registered worker forwarding、dead_workers 清理）

强烈建议 (P2): 提取公共 broadcast 辅助方法、暴露 is_sleeping
状态、补充 sleeping 维度到 health 端点

补充测试 (P4): 空池 503、health check 排除 sleeping、所有 worker
sleeping 路由异常、deregister 清理

klhhhhh · 2026-03-02T10:27:11Z

Check the new comment in the pr summary about Worker State Model & Transitions

import wake/sleep into router

9138c61

run lint

c752c64

gemini-code-assist Bot reviewed Feb 27, 2026

View reviewed changes

klhhhhh added 4 commits February 27, 2026 23:31

exclude sleeping worker in select

50b5153

add test for wake up and sleep

34e2c5f

update readme

b34b151

run lint

9bede57

klhhhhh added 3 commits March 2, 2026 10:16

fix bugs in sleep and dead race

c43443a

fix bugs in status race v2

e075f6c

run lint

68b69c8

zhaochenyang20 mentioned this pull request Mar 4, 2026

Patch Up Diffusion Utils #36

Open

zhaochenyang20 merged commit 967959d into zhaochenyang20:main Mar 4, 2026
2 checks passed

alphabetc1 mentioned this pull request Mar 4, 2026

[Feature] Support sleep and wake up #21

Closed

BBuf mentioned this pull request Apr 20, 2026

SGLang Diffusion 外部影响力调研：kernel、feature 与平台采用情况 BBuf/how-to-optim-algorithm-in-cuda#14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support wake up/sleep into router#35

Support wake up/sleep into router#35
zhaochenyang20 merged 9 commits intozhaochenyang20:mainfrom
klhhhhh:main

klhhhhh commented Feb 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Feb 27, 2026

Uh oh!

klhhhhh commented Feb 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Feb 27, 2026

Uh oh!

gemini-code-assist Bot Feb 27, 2026

Uh oh!

gemini-code-assist Bot Feb 27, 2026

Uh oh!

klhhhhh commented Feb 27, 2026

Uh oh!

klhhhhh commented Feb 27, 2026

Uh oh!

zhaochenyang20 commented Feb 28, 2026

Uh oh!

klhhhhh commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		self.app.post("/release_memory_occupation")(self.release_memory_occupation)
		self.app.post("/resume_memory_occupation")(self.resume_memory_occupation)

Conversation

klhhhhh commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Behavior

Testing

Test scripts

Worker State Model & Transitions

States

State Transitions

1. Sleep (release_memory_occupation)

2. Wake (resume_memory_occupation)

3. Health Check Failure

4. Deregistration

Behavioral Guarantees

Data Plane (Generation / Routing)

Control Plane (Sleep / Wake)

Health Plane

Design Rationale

Uh oh!

gemini-code-assist Bot commented Feb 27, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

klhhhhh commented Feb 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

klhhhhh commented Feb 27, 2026

Uh oh!

klhhhhh commented Feb 27, 2026

Uh oh!

zhaochenyang20 commented Feb 28, 2026

Uh oh!

klhhhhh commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

klhhhhh commented Feb 27, 2026 •

edited

Loading

1. Sleep (`release_memory_occupation`)

2. Wake (`resume_memory_occupation`)