Support wake up/sleep into router#35
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the diffusion router by integrating GPU sleep and wake functionality as a core control-plane feature. It allows the router to broadcast sleep/wake commands to workers, maintain an accurate state of sleeping workers, and intelligently adjust routing decisions to prevent sending inference requests to inactive GPUs. This ensures efficient resource utilization and prevents false positives in worker health monitoring during intentional memory release. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
It's just first version without test, and unites will be done soon. |
There was a problem hiding this comment.
Code Review
This pull request introduces functionality for workers to sleep and wake up, managed by the router, including new API endpoints and state tracking. However, it critically lacks essential security controls, notably the absence of authentication on these sensitive routes and the insecure forwarding of all request headers to potentially untrusted workers. The implementation is also incomplete, as the core logic for routing and health-checking does not properly exclude sleeping workers, which could lead to service instability and Denial of Service. Additionally, there's a state leak when workers are deregistered, and automated tests for this new functionality are missing. Consider enhancing observability by including the sleeping status in the _build_worker_payload.
| self.app.post("/release_memory_occupation")(self.release_memory_occupation) | ||
| self.app.post("/resume_memory_occupation")(self.resume_memory_occupation) |
There was a problem hiding this comment.
The new endpoints /release_memory_occupation and /resume_memory_occupation lack any authentication or authorization checks. Since these endpoints perform administrative actions (broadcasting sleep/wake commands to all workers and modifying the router's internal state), exposing them without access control allows any user with network access to the router to disrupt the service by putting all workers to sleep. This is a Missing Function-Level Access Control vulnerability.
| headers = dict(request.headers) | ||
| headers.pop("content-length", None) | ||
| headers.setdefault("content-type", "application/json") |
There was a problem hiding this comment.
The release_memory_occupation and resume_memory_occupation methods insecurely forward all request headers to all healthy workers without sanitization. This poses a risk of sensitive headers (e.g., Authorization, Cookie) being leaked to malicious workers, especially given the unauthenticated /workers endpoint. It is crucial to only forward necessary headers or use an allow-list. Additionally, these methods are nearly identical; refactoring their common logic into a private helper method would significantly improve maintainability and reduce code duplication.
| async def release_memory_occupation(self, request: Request): | ||
| """Broadcast sleep to all healthy workers and mark them as sleeping on success.""" | ||
| healthy_workers = [ | ||
| url for url in self.worker_request_counts if url not in self.dead_workers | ||
| ] | ||
| if not healthy_workers: | ||
| return JSONResponse( | ||
| status_code=503, | ||
| content={"error": "No healthy workers available in the pool"}, | ||
| ) | ||
|
|
||
| body = await request.body() | ||
| headers = dict(request.headers) | ||
| headers.pop("content-length", None) | ||
| headers.setdefault("content-type", "application/json") | ||
|
|
||
| results = await self._broadcast_to_workers( | ||
| "release_memory_occupation", body, headers | ||
| ) | ||
|
|
||
| for item in results: | ||
| if item.get("status_code") == 200: | ||
| self.sleeping_workers.add(item["worker_url"]) | ||
|
|
||
| return JSONResponse(content={"results": results}) | ||
|
|
||
| async def resume_memory_occupation(self, request: Request): | ||
| """Broadcast wake to all healthy workers and unmark sleeping on success.""" | ||
| healthy_workers = [ | ||
| url for url in self.worker_request_counts if url not in self.dead_workers | ||
| ] | ||
| if not healthy_workers: | ||
| return JSONResponse( | ||
| status_code=503, | ||
| content={"error": "No healthy workers available in the pool"}, | ||
| ) | ||
| body = await request.body() | ||
| headers = dict(request.headers) | ||
| headers.pop("content-length", None) | ||
| headers.setdefault("content-type", "application/json") | ||
|
|
||
| results = await self._broadcast_to_workers( | ||
| "resume_memory_occupation", body, headers | ||
| ) | ||
|
|
||
| for item in results: | ||
| if item.get("status_code") == 200: | ||
| self.sleeping_workers.discard(item["worker_url"]) | ||
| # Reset health failure counter on successful wake: | ||
| # waking is an explicit recovery point and should not inherit failures | ||
| # accumulated during intentional sleep. |
There was a problem hiding this comment.
The PR introduces a sleeping_workers state to track workers that have released memory, but this state is never actually used to gate routing or health checks in the provided code. Specifically, _select_worker_by_routing and _health_check_loop still only check dead_workers. This means inference traffic will continue to be sent to workers that are supposed to be 'sleeping', which likely leads to errors or crashes on the workers, causing a Denial of Service. The implementation is incomplete and fails to achieve its stated security/stability goal.
|
@zhaochenyang20 done, right now wake up/sleep it's for all the workers in routers, I am not sure if I need to implement sleep/wake up some specific number of workers. |
|
总体评价 / Overall Assessment 总体评级: Request Changes 核心设计思路正确(独立 sleeping_workers P0 阻塞问题 / P0 Blockers
[中文] 健康检查循环仍然只过滤 dead_workers。sleeping 的 worker [English] The health check loop still only filters dead_workers.
[中文] 注销 worker 时未清理 sleeping_workers,导致重新注册同 URL [English] Deregistering a worker leaves a stale entry in
[中文] sleeping worker 仍被视为 video-capable [English] Sleeping workers are still considered video-capable
[中文] 通过 video_id 映射到 sleeping worker 的请求仍会被转发。 [English] Requests mapped to sleeping workers via video_id will
[中文] 如果 health check 已在 resume 前将 worker 标记为 [English] If the health check already marked the worker dead P2 维护性问题 / P2 Maintainability
AI 生成代码检测 / AI-Generated Code Detection [中文] 代码展现明显 AI 辅助痕迹:模式化复制 [English] The code shows clear AI-assisted generation: 结论 / Conclusion 必须修复 (P0): 5 个阻塞项(health check loop、deregister、video 强烈建议 (P2): 提取公共 broadcast 辅助方法、暴露 is_sleeping 补充测试 (P4): 空池 503、health check 排除 sleeping、所有 worker |
|
Check the new comment in the pr summary about Worker State Model & Transitions |


Summary
This PR integrates GPU sleep / wake into the diffusion router as a first-class control-plane feature.
The router now broadcasts sleep/wake requests to all healthy workers, tracks worker sleep state locally, and gates routing to prevent inference traffic from being sent to sleeping workers.
Key Changes
Add router-level APIs:
POST /release_memory_occupationPOST /resume_memory_occupationBroadcast sleep/wake to all non-dead workers using existing
_broadcast_to_workersTrack sleeping workers separately from dead workers
Exclude sleeping workers from routing decisions
Exclude sleeping workers from health check failure accumulation
Reset health failure counters on successful wake
Determine success based on HTTP status code (200), not response body fields
Behavior
release_memory_occupation, inference requests are no longer routed to sleeping workersresume_memory_occupation, routing and inference resume normallyTesting
Manually tested end-to-end:
Test scripts
Launch router
Generation
Sleep
Wake Up
Worker State Model & Transitions
This PR introduces a clear and minimal worker state model to support GPU sleep / wake without conflating intentional sleep with worker failure.
States
Each worker can independently be in the following orthogonal states:
Alive (default)
Registered worker, eligible for routing and health checks.
Sleeping (
sleeping_workers)Worker has intentionally released GPU memory via
release_memory_occupation.Sleeping is not a failure.
Dead (
dead_workers)Worker has failed consecutive health checks and is quarantined from routing.
State Transitions
1. Sleep (
release_memory_occupation)Rules:
Only non-dead workers are eligible to enter sleeping.
Sleeping workers are excluded from:
A failed sleep request does not change state.
2. Wake (
resume_memory_occupation)Rules:
Wake requests are sent only to sleeping workers.
On successful wake:
sleeping_workersdead_workersif previously marked deadThis allows a worker that was incorrectly marked dead during sleep
(e.g. due to missed health checks) to be fully recovered.
3. Health Check Failure
Rules:
dead.4. Deregistration
Rules:
Deregistration clears all runtime state:
Behavioral Guarantees
Data Plane (Generation / Routing)
Control Plane (Sleep / Wake)
Health Plane
Design Rationale
This design explicitly separates:
By keeping these states orthogonal:
No additional intermediate or “pending” states are introduced to keep
the model minimal and easy to reason about.