Skip to content

[Serve][LLM] Fractional GPU: auto-derived VLLM_RAY_PER_WORKER_GPUS never reaches the vLLM engine #63875

@mbartholet

Description

@mbartholet

What happened + What you expected to happen

When serving a model with a fractional placement_group_config (e.g. bundles: [{GPU: 0.25}]) via Ray Serve LLM, the vLLM engine fails to start:

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

It only starts if I manually set VLLM_RAY_PER_WORKER_GPUS in llm_config.runtime_env.env_vars. The auto-derivation that Ray added for fractional serving has no effect on the engine, so fractional GPU serving is silently broken out of the box.

Expected: the value derived from the fractional bundle is visible to the vLLM engine process, so fractional serving works without manually setting the env var.

Root cause

  1. VLLMEngineConfig.get_runtime_env_with_local_env_vars() correctly derives VLLM_RAY_PER_WORKER_GPUS from the fractional bundle:
    python/ray/llm/_internal/serve/engines/vllm/vllm_models.py:138-143 (via _detect_fractional_gpu_from_pg).

  2. But that runtime_env is only applied to helper tasks — the engine-config probe task (.../engines/vllm/vllm_engine.py:438) and node initialization (.../serve/utils/node_initialization_utils.py:61).

  3. The replica (LLMServer) actor's runtime_env is built separately and only from llm_config.runtime_env — the derived value is never merged in:
    python/ray/llm/_internal/serve/core/server/llm_server.py:730-736

    ray_actor_options["runtime_env"] = {
        **default_runtime_env,
        **ray_actor_options.get("runtime_env", {}),
        **(llm_config.runtime_env if llm_config.runtime_env else {}),
    }
  4. The vLLM EngineCore runs as a multiprocessing child of the replica actor (AsyncLLMmake_async_mp_clientAsyncMPClientlaunch_core_engines) and therefore inherits only the replica's os.environ. It reads envs.VLLM_RAY_PER_WORKER_GPUS (default 1.0) to size each worker's GPU request:
    vllm/v1/executor/ray_executor.py (line 161 in vLLM v0.18.0 / 152 in v0.22.0):

    num_gpus = envs.VLLM_RAY_PER_WORKER_GPUS   # default 1.0
    ...
    worker = ray.remote(num_gpus=num_gpus, scheduling_strategy=...)

Because the derived value never lands on the replica actor, EngineCore reads 1.0, requests a whole GPU per worker, and that worker cannot be placed in the GPU: 0.25 bundle. The engine hangs and aborts — surfaced as the unhelpful Failed core proc(s): {} (empty, because it never registered a proc).

Workaround

Set it explicitly so it flows through llm_server.py:735 onto the replica actor (then inherited by EngineCore):

runtime_env:
  env_vars:
    VLLM_RAY_PER_WORKER_GPUS: "0.25"   # must match the bundle GPU fraction

Suggested fix

In _get_deployment_options (llm_server.py:730-736), merge the auto-derived env var into the replica's ray_actor_options["runtime_env"]["env_vars"] — e.g. fold engine_config.get_runtime_env_with_local_env_vars() into that merge so the derived VLLM_RAY_PER_WORKER_GPUS lands on the replica actor, not only on the helper tasks. That makes the existing auto-derivation actually reach the engine.

Reproduction

Deploy any TP=1 model via build_openai_app with a fractional bundle and no manual env var:

from ray.serve.llm import LLMConfig, build_openai_app
from ray import serve

cfg = LLMConfig(
    model_loading_config=dict(model_id="Qwen/Qwen3-Embedding-0.6B"),
    engine_kwargs=dict(gpu_memory_utilization=0.2, enforce_eager=True, max_model_len=4096),
    placement_group_config=dict(bundles=[dict(GPU=0.25)], strategy="STRICT_PACK"),
)
serve.run(build_openai_app({"llm_configs": [cfg]}), blocking=True)
  • Without VLLM_RAY_PER_WORKER_GPUS in runtime_env.env_vars → engine fails to start (Failed core proc(s): {}).
  • With runtime_env=dict(env_vars={"VLLM_RAY_PER_WORKER_GPUS": "0.25"}) → starts and serves correctly.

Versions and environment

  • Ray: 2.55.1
  • vLLM: 0.18.0 (confirmed identical code path on 0.22.0ray_executor.py reads envs.VLLM_RAY_PER_WORKER_GPUS the same way, so this is independent of vLLM version)
  • CUDA: 12.9
  • Cluster: on-prem, multi-node, multi-GPU
  • vLLM engine: V1 (AsyncLLM), distributed_executor_backend="ray" (forced by Ray Serve LLM for GPU mode)

Issue Severity

Medium: fractional GPU serving — a documented feature — does not work as documented without an undocumented manual env var. There is a working workaround.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions