Skip to content

[Server] Recycle guaranteed executor workers via max_tasks_per_child#9964

Open
aylei wants to merge 1 commit into
masterfrom
yelei/worker-max-tasks-per-child
Open

[Server] Recycle guaranteed executor workers via max_tasks_per_child#9964
aylei wants to merge 1 commit into
masterfrom
yelei/worker-max-tasks-per-child

Conversation

@aylei

@aylei aylei commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

Guaranteed executor workers (PoolExecutor) live for the lifetime of the API server. A worker's RSS therefore only ever grows to the high-water mark of the heaviest request it has handled and is never reclaimed, so under sustained load the pool's cumulative memory creeps toward the container limit.

This adds an opt-in SKYPILOT_API_SERVER_WORKER_MAX_TASKS_PER_CHILD environment variable that maps to ProcessPoolExecutor's max_tasks_per_child (added in Python 3.11): a guaranteed worker is recycled after it has handled that many requests, returning its memory to the OS. This is the official replacement for the existing DisposableExecutor workaround noted in process.py (TODO(aylei): use the official max_tasks_per_child when upgrade to 3.11).

  • Unset by default → no behavior change.
  • Ignored with a warning on Python < 3.11 (the parameter does not exist there).
  • Applies to the guaranteed pool only; burst workers (DisposableExecutor) are already disposed after each task, and the setting is preserved across a BrokenProcessPool rebuild.

Test plan

  • tests/unit_tests/test_sky/server/requests/test_process.py:
    • test_pool_executor_recycles_after_max_tasks — with max_tasks_per_child=2 and a single worker, submitting 4 tasks yields worker PIDs [A, A, B, B] (recycled after every 2 tasks). Skipped on Python < 3.11.
    • test_pool_executor_no_recycle_by_default — without the setting, all tasks run on one PID.
    • test_burstable_executor_max_tasks_per_child_routing — the setting reaches the guaranteed pool kwargs and is not forwarded to burst workers.
  • tests/unit_tests/test_sky/server/test_config.py: env parsing (valid / invalid / unset), the Python < 3.11 gate, and propagation into both worker configs.
  • pytest tests/unit_tests/test_sky/server/requests/test_process.py tests/unit_tests/test_sky/server/requests/test_executor.py tests/unit_tests/test_sky/server/test_config.py all pass on Python 3.11.

Guaranteed executor workers (the PoolExecutor pool) live for the lifetime of the API server, so a worker's RSS only ever grows to the high-water mark of the heaviest request it has served and is never reclaimed. Under sustained load the pool's cumulative memory creeps toward the container limit.

Add an opt-in `SKYPILOT_API_SERVER_WORKER_MAX_TASKS_PER_CHILD` env var that maps to ProcessPoolExecutor's `max_tasks_per_child` (added in Python 3.11), recycling a worker after it has handled that many requests so its memory is returned to the OS. Unset by default (no behavior change); ignored with a warning on Python < 3.11. Applies to the guaranteed pool only — burst workers are already disposed after each task.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for recycling guaranteed worker processes after they have handled a specified number of tasks, bounding their high-water-mark RSS. This is achieved by exposing a new environment variable, SKYPILOT_API_SERVER_WORKER_MAX_TASKS_PER_CHILD, which maps to ProcessPoolExecutor's max_tasks_per_child parameter (available in Python 3.11+). The configuration is propagated through the server configuration to the worker executors, and comprehensive unit tests are added to verify the recycling behavior and environment variable parsing. There are no review comments to assess, and I have no additional feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant