Skip to content

feat(gateway): make vLLM sampler requests asynchronous and rate-limited#119

Closed
droot wants to merge 1 commit into
gke-labs:mainfrom
droot:feature/async-vllm-sampling
Closed

feat(gateway): make vLLM sampler requests asynchronous and rate-limited#119
droot wants to merge 1 commit into
gke-labs:mainfrom
droot:feature/async-vllm-sampling

Conversation

@droot

@droot droot commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator
  • Dispatch vLLM token generation requests to a background task instead of blocking the FastAPI handler, aligning its async behavior with the Torch sampler backend.

  • Introduce VLLM_CONCURRENCY_LIMIT (default 512) and _vllm_semaphore to prevent socket/file-descriptor exhaustion and connection drop errors under heavy surges.

  • Maintain a global _background_tasks set to hold strong references to running background tasks and prevent premature garbage collection.

@droot

droot commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

Pl. do not merge it. I haven't tested it yet.

- Dispatch vLLM token generation requests to a background task instead of blocking the FastAPI handler, aligning its async behavior with the Torch sampler backend.

- Introduce VLLM_CONCURRENCY_LIMIT (default 512) and _vllm_semaphore to prevent socket/file-descriptor exhaustion and connection drop errors under heavy surges.

- Maintain a global _background_tasks set to hold strong references to running background tasks and prevent premature garbage collection.
@droot droot force-pushed the feature/async-vllm-sampling branch from 4f7d79e to 7c1a533 Compare June 11, 2026 15:00
@droot droot closed this Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant