Skip to content

Add a health check for the worker processes#686

Merged
jaredoconnell merged 2 commits intovllm-project:mainfrom
jaredoconnell:feat/process-health-check
Apr 8, 2026
Merged

Add a health check for the worker processes#686
jaredoconnell merged 2 commits intovllm-project:mainfrom
jaredoconnell:feat/process-health-check

Conversation

@jaredoconnell
Copy link
Copy Markdown
Collaborator

Summary

Creates an Async IO task that polls for failure of the worker processes.

This is necessary because if this happens presently, it doesn't detect the failure, and continues waiting idefinitely for the process to be ready, causing a hang.

Details

  • Polls the worker processes
  • In the event of a failure, it creates a human-readable error message with the exit code or the type of failure.

Here is what it looks like with a segmentation fault, which I've been getting.

╭─ Benchmarks ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ [--:--:--] ⠏   0% constant@1.00 (pending )                                                                                                                                                                                                                  │
│                                                                                                                                                                                                                                                             │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Generating... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (1/1) [ 0:00:28 < -:--:-- ]26-04-03 12:41:20|ERROR            |guidellm.scheduler.worker_group:_process_health_monitor:292 - Worker process 13820 died unexpectedly (signal 11)
26-04-03 12:41:20|ERROR            |guidellm.scheduler.worker_group:_process_health_monitor:292 - Worker process 13821 died unexpectedly (signal 11)
26-04-03 12:41:20|ERROR            |guidellm.scheduler.worker_group:_process_health_monitor:292 - Worker process 13822 died unexpectedly (signal 11)
26-04-03 12:41:20|ERROR            |guidellm.scheduler.worker_group:_process_health_monitor:292 - Worker process 13823 died unexpectedly (signal 11)
26-04-03 12:41:20|ERROR            |guidellm.scheduler.worker_group:_process_health_monitor:292 - Worker process 13824 died unexpectedly (signal 11)
26-04-03 12:41:20|ERROR            |guidellm.scheduler.worker_group:_process_health_monitor:292 - Worker process 13825 died unexpectedly (signal 11)
Traceback (most recent call last):
  File "/Users/joconnel/Documents/projects/ai/guidellm/.venv3.12/bin/guidellm", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/Users/joconnel/Documents/projects/ai/guidellm/.venv3.12/lib/python3.12/site-packages/click/core.py", line 1485, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joconnel/Documents/projects/ai/guidellm/.venv3.12/lib/python3.12/site-packages/click/core.py", line 1406, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/joconnel/Documents/projects/ai/guidellm/.venv3.12/lib/python3.12/site-packages/click/core.py", line 1873, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joconnel/Documents/projects/ai/guidellm/.venv3.12/lib/python3.12/site-packages/click/core.py", line 1873, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joconnel/Documents/projects/ai/guidellm/.venv3.12/lib/python3.12/site-packages/click/core.py", line 1269, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joconnel/Documents/projects/ai/guidellm/.venv3.12/lib/python3.12/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joconnel/Documents/projects/ai/guidellm/src/guidellm/__main__.py", line 476, in run
    asyncio.run(
  File "/opt/homebrew/Cellar/python@3.12/3.12.10_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.12/3.12.10_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/Users/joconnel/Documents/projects/ai/guidellm/src/guidellm/benchmark/entrypoints.py", line 554, in benchmark_generative_text
    async for benchmark in benchmarker.run(
  File "/Users/joconnel/Documents/projects/ai/guidellm/src/guidellm/benchmark/benchmarker.py", line 133, in run
    async for (
  File "/Users/joconnel/Documents/projects/ai/guidellm/src/guidellm/scheduler/scheduler.py", line 143, in run
    raise err
  File "/Users/joconnel/Documents/projects/ai/guidellm/src/guidellm/scheduler/scheduler.py", line 126, in run
    await worker_group.create_processes()
  File "/Users/joconnel/Documents/projects/ai/guidellm/src/guidellm/scheduler/worker_group.py", line 273, in create_processes
    raise RuntimeError(f"Worker process group startup failed: {detail}")
RuntimeError: Worker process group startup failed: Worker process 13820 died unexpectedly (signal 11); Worker process 13821 died unexpectedly (signal 11); Worker process 13822 died unexpectedly (signal 11); Worker process 13823 died unexpectedly (signal
11); Worker process 13824 died unexpectedly (signal 11); Worker process 13825 died unexpectedly (signal 11); Worker process 13826 died unexpectedly (signal 11); Worker process 13827 died unexpectedly (signal 11); Worker process 13828 died unexpectedly
(signal 11); Worker process 13829 died unexpectedly (signal 11). Check system logs for details. Consider an alternative multiprocessing start method (spawn, fork, forkserver) via the GUIDELLM__MP_CONTEXT_TYPE environment variable

In this situation, all of them had segmentation faults at the same time for some reason. The system reported a segmentation fault to me.

Test Plan

  • You can probably kill a worker process to see it work.

  • "I certify that all code in this PR is my own, except as noted below."

Use of AI

  • Includes AI-assisted code completion
  • Includes code generated by an AI application
  • Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

This is necessary because if this happens presently, it doesn't detect the failure, and continues waiting idefinitely for the process to be ready, causing a hang.

Generated-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
sjmonson
sjmonson previously approved these changes Apr 4, 2026
Copy link
Copy Markdown
Collaborator

@sjmonson sjmonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't test but looks fine. See minor nits below.

Signed-off-by: Jared O'Connell <joconnel@redhat.com>
@dbutenhof dbutenhof added the bug Represents a user-visible defect label Apr 7, 2026
@jaredoconnell jaredoconnell merged commit 328103f into vllm-project:main Apr 8, 2026
18 checks passed
@jaredoconnell jaredoconnell deleted the feat/process-health-check branch April 8, 2026 02:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Represents a user-visible defect

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants