[Serve][LLM] Fractional GPU: auto-derived VLLM_RAY_PER_WORKER_GPUS never reaches the vLLM engine

### What happened + What you expected to happen

When serving a model with a **fractional** `placement_group_config` (e.g. `bundles: [{GPU: 0.25}]`) via Ray Serve LLM, the vLLM engine fails to start:

```
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
```

It only starts if I **manually** set `VLLM_RAY_PER_WORKER_GPUS` in `llm_config.runtime_env.env_vars`. The auto-derivation that Ray added for fractional serving has no effect on the engine, so fractional GPU serving is silently broken out of the box.

**Expected:** the value derived from the fractional bundle is visible to the vLLM engine process, so fractional serving works without manually setting the env var.

### Root cause

1. `VLLMEngineConfig.get_runtime_env_with_local_env_vars()` correctly derives `VLLM_RAY_PER_WORKER_GPUS` from the fractional bundle:
   `python/ray/llm/_internal/serve/engines/vllm/vllm_models.py:138-143` (via `_detect_fractional_gpu_from_pg`).

2. But that runtime_env is only applied to **helper tasks** — the engine-config probe task (`.../engines/vllm/vllm_engine.py:438`) and node initialization (`.../serve/utils/node_initialization_utils.py:61`).

3. The **replica (`LLMServer`) actor's** runtime_env is built separately and **only** from `llm_config.runtime_env` — the derived value is never merged in:
   `python/ray/llm/_internal/serve/core/server/llm_server.py:730-736`
   ```python
   ray_actor_options["runtime_env"] = {
       **default_runtime_env,
       **ray_actor_options.get("runtime_env", {}),
       **(llm_config.runtime_env if llm_config.runtime_env else {}),
   }
   ```

4. The vLLM `EngineCore` runs as a **multiprocessing child of the replica actor** (`AsyncLLM` → `make_async_mp_client` → `AsyncMPClient` → `launch_core_engines`) and therefore inherits only the replica's `os.environ`. It reads `envs.VLLM_RAY_PER_WORKER_GPUS` (default **1.0**) to size each worker's GPU request:
   `vllm/v1/executor/ray_executor.py` (line **161** in vLLM v0.18.0 / **152** in v0.22.0):
   ```python
   num_gpus = envs.VLLM_RAY_PER_WORKER_GPUS   # default 1.0
   ...
   worker = ray.remote(num_gpus=num_gpus, scheduling_strategy=...)
   ```

Because the derived value never lands on the replica actor, `EngineCore` reads `1.0`, requests a **whole GPU** per worker, and that worker cannot be placed in the `GPU: 0.25` bundle. The engine hangs and aborts — surfaced as the unhelpful `Failed core proc(s): {}` (empty, because it never registered a proc).

### Workaround

Set it explicitly so it flows through `llm_server.py:735` onto the replica actor (then inherited by `EngineCore`):

```yaml
runtime_env:
  env_vars:
    VLLM_RAY_PER_WORKER_GPUS: "0.25"   # must match the bundle GPU fraction
```

### Suggested fix

In `_get_deployment_options` (`llm_server.py:730-736`), merge the auto-derived env var into the replica's `ray_actor_options["runtime_env"]["env_vars"]` — e.g. fold `engine_config.get_runtime_env_with_local_env_vars()` into that merge so the derived `VLLM_RAY_PER_WORKER_GPUS` lands on the replica actor, not only on the helper tasks. That makes the existing auto-derivation actually reach the engine.

### Reproduction

Deploy any **TP=1** model via `build_openai_app` with a fractional bundle and no manual env var:

```python
from ray.serve.llm import LLMConfig, build_openai_app
from ray import serve

cfg = LLMConfig(
    model_loading_config=dict(model_id="Qwen/Qwen3-Embedding-0.6B"),
    engine_kwargs=dict(gpu_memory_utilization=0.2, enforce_eager=True, max_model_len=4096),
    placement_group_config=dict(bundles=[dict(GPU=0.25)], strategy="STRICT_PACK"),
)
serve.run(build_openai_app({"llm_configs": [cfg]}), blocking=True)
```

- **Without** `VLLM_RAY_PER_WORKER_GPUS` in `runtime_env.env_vars` → engine fails to start (`Failed core proc(s): {}`).
- **With** `runtime_env=dict(env_vars={"VLLM_RAY_PER_WORKER_GPUS": "0.25"})` → starts and serves correctly.

### Versions and environment

- Ray: **2.55.1**
- vLLM: **0.18.0** (confirmed identical code path on **0.22.0** — `ray_executor.py` reads `envs.VLLM_RAY_PER_WORKER_GPUS` the same way, so this is independent of vLLM version)
- CUDA: 12.9
- Cluster: on-prem, multi-node, multi-GPU
- vLLM engine: V1 (`AsyncLLM`), `distributed_executor_backend="ray"` (forced by Ray Serve LLM for GPU mode)

### Issue Severity

Medium: fractional GPU serving — a documented feature — does not work as documented without an undocumented manual env var. There is a working workaround.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve][LLM] Fractional GPU: auto-derived VLLM_RAY_PER_WORKER_GPUS never reaches the vLLM engine #63875

What happened + What you expected to happen

Root cause

Workaround

Suggested fix

Reproduction

Versions and environment

Issue Severity

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Serve][LLM] Fractional GPU: auto-derived VLLM_RAY_PER_WORKER_GPUS never reaches the vLLM engine #63875

Description

What happened + What you expected to happen

Root cause

Workaround

Suggested fix

Reproduction

Versions and environment

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions