What happened + What you expected to happen
When serving a model with a fractional placement_group_config (e.g. bundles: [{GPU: 0.25}]) via Ray Serve LLM, the vLLM engine fails to start:
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
It only starts if I manually set VLLM_RAY_PER_WORKER_GPUS in llm_config.runtime_env.env_vars. The auto-derivation that Ray added for fractional serving has no effect on the engine, so fractional GPU serving is silently broken out of the box.
Expected: the value derived from the fractional bundle is visible to the vLLM engine process, so fractional serving works without manually setting the env var.
Root cause
-
VLLMEngineConfig.get_runtime_env_with_local_env_vars() correctly derives VLLM_RAY_PER_WORKER_GPUS from the fractional bundle:
python/ray/llm/_internal/serve/engines/vllm/vllm_models.py:138-143 (via _detect_fractional_gpu_from_pg).
-
But that runtime_env is only applied to helper tasks — the engine-config probe task (.../engines/vllm/vllm_engine.py:438) and node initialization (.../serve/utils/node_initialization_utils.py:61).
-
The replica (LLMServer) actor's runtime_env is built separately and only from llm_config.runtime_env — the derived value is never merged in:
python/ray/llm/_internal/serve/core/server/llm_server.py:730-736
ray_actor_options["runtime_env"] = {
**default_runtime_env,
**ray_actor_options.get("runtime_env", {}),
**(llm_config.runtime_env if llm_config.runtime_env else {}),
}
-
The vLLM EngineCore runs as a multiprocessing child of the replica actor (AsyncLLM → make_async_mp_client → AsyncMPClient → launch_core_engines) and therefore inherits only the replica's os.environ. It reads envs.VLLM_RAY_PER_WORKER_GPUS (default 1.0) to size each worker's GPU request:
vllm/v1/executor/ray_executor.py (line 161 in vLLM v0.18.0 / 152 in v0.22.0):
num_gpus = envs.VLLM_RAY_PER_WORKER_GPUS # default 1.0
...
worker = ray.remote(num_gpus=num_gpus, scheduling_strategy=...)
Because the derived value never lands on the replica actor, EngineCore reads 1.0, requests a whole GPU per worker, and that worker cannot be placed in the GPU: 0.25 bundle. The engine hangs and aborts — surfaced as the unhelpful Failed core proc(s): {} (empty, because it never registered a proc).
Workaround
Set it explicitly so it flows through llm_server.py:735 onto the replica actor (then inherited by EngineCore):
runtime_env:
env_vars:
VLLM_RAY_PER_WORKER_GPUS: "0.25" # must match the bundle GPU fraction
Suggested fix
In _get_deployment_options (llm_server.py:730-736), merge the auto-derived env var into the replica's ray_actor_options["runtime_env"]["env_vars"] — e.g. fold engine_config.get_runtime_env_with_local_env_vars() into that merge so the derived VLLM_RAY_PER_WORKER_GPUS lands on the replica actor, not only on the helper tasks. That makes the existing auto-derivation actually reach the engine.
Reproduction
Deploy any TP=1 model via build_openai_app with a fractional bundle and no manual env var:
from ray.serve.llm import LLMConfig, build_openai_app
from ray import serve
cfg = LLMConfig(
model_loading_config=dict(model_id="Qwen/Qwen3-Embedding-0.6B"),
engine_kwargs=dict(gpu_memory_utilization=0.2, enforce_eager=True, max_model_len=4096),
placement_group_config=dict(bundles=[dict(GPU=0.25)], strategy="STRICT_PACK"),
)
serve.run(build_openai_app({"llm_configs": [cfg]}), blocking=True)
- Without
VLLM_RAY_PER_WORKER_GPUS in runtime_env.env_vars → engine fails to start (Failed core proc(s): {}).
- With
runtime_env=dict(env_vars={"VLLM_RAY_PER_WORKER_GPUS": "0.25"}) → starts and serves correctly.
Versions and environment
- Ray: 2.55.1
- vLLM: 0.18.0 (confirmed identical code path on 0.22.0 —
ray_executor.py reads envs.VLLM_RAY_PER_WORKER_GPUS the same way, so this is independent of vLLM version)
- CUDA: 12.9
- Cluster: on-prem, multi-node, multi-GPU
- vLLM engine: V1 (
AsyncLLM), distributed_executor_backend="ray" (forced by Ray Serve LLM for GPU mode)
Issue Severity
Medium: fractional GPU serving — a documented feature — does not work as documented without an undocumented manual env var. There is a working workaround.
What happened + What you expected to happen
When serving a model with a fractional
placement_group_config(e.g.bundles: [{GPU: 0.25}]) via Ray Serve LLM, the vLLM engine fails to start:It only starts if I manually set
VLLM_RAY_PER_WORKER_GPUSinllm_config.runtime_env.env_vars. The auto-derivation that Ray added for fractional serving has no effect on the engine, so fractional GPU serving is silently broken out of the box.Expected: the value derived from the fractional bundle is visible to the vLLM engine process, so fractional serving works without manually setting the env var.
Root cause
VLLMEngineConfig.get_runtime_env_with_local_env_vars()correctly derivesVLLM_RAY_PER_WORKER_GPUSfrom the fractional bundle:python/ray/llm/_internal/serve/engines/vllm/vllm_models.py:138-143(via_detect_fractional_gpu_from_pg).But that runtime_env is only applied to helper tasks — the engine-config probe task (
.../engines/vllm/vllm_engine.py:438) and node initialization (.../serve/utils/node_initialization_utils.py:61).The replica (
LLMServer) actor's runtime_env is built separately and only fromllm_config.runtime_env— the derived value is never merged in:python/ray/llm/_internal/serve/core/server/llm_server.py:730-736The vLLM
EngineCoreruns as a multiprocessing child of the replica actor (AsyncLLM→make_async_mp_client→AsyncMPClient→launch_core_engines) and therefore inherits only the replica'sos.environ. It readsenvs.VLLM_RAY_PER_WORKER_GPUS(default 1.0) to size each worker's GPU request:vllm/v1/executor/ray_executor.py(line 161 in vLLM v0.18.0 / 152 in v0.22.0):Because the derived value never lands on the replica actor,
EngineCorereads1.0, requests a whole GPU per worker, and that worker cannot be placed in theGPU: 0.25bundle. The engine hangs and aborts — surfaced as the unhelpfulFailed core proc(s): {}(empty, because it never registered a proc).Workaround
Set it explicitly so it flows through
llm_server.py:735onto the replica actor (then inherited byEngineCore):Suggested fix
In
_get_deployment_options(llm_server.py:730-736), merge the auto-derived env var into the replica'sray_actor_options["runtime_env"]["env_vars"]— e.g. foldengine_config.get_runtime_env_with_local_env_vars()into that merge so the derivedVLLM_RAY_PER_WORKER_GPUSlands on the replica actor, not only on the helper tasks. That makes the existing auto-derivation actually reach the engine.Reproduction
Deploy any TP=1 model via
build_openai_appwith a fractional bundle and no manual env var:VLLM_RAY_PER_WORKER_GPUSinruntime_env.env_vars→ engine fails to start (Failed core proc(s): {}).runtime_env=dict(env_vars={"VLLM_RAY_PER_WORKER_GPUS": "0.25"})→ starts and serves correctly.Versions and environment
ray_executor.pyreadsenvs.VLLM_RAY_PER_WORKER_GPUSthe same way, so this is independent of vLLM version)AsyncLLM),distributed_executor_backend="ray"(forced by Ray Serve LLM for GPU mode)Issue Severity
Medium: fractional GPU serving — a documented feature — does not work as documented without an undocumented manual env var. There is a working workaround.