- 
                Notifications
    
You must be signed in to change notification settings  - Fork 6.9k
 
Description
What happened + What you expected to happen
Hello, I'm trying to deploy one ray service on an AKS with multiple model deployed through LLMConfig.
No matter the resource allocation I give in the ray_actor_options, when I deploy the service, Ray will always ask for more than 1 GPU and thus never allocating resources to my services.
I tried to reduce to use different sizes in ray_actor_options (it was 0.9 for the LLM model first time I had this issue, then reduced to 0.5, but nothing changed), however no matter how many fraction of GPUs I give to the two deployments, there will be always a placement that will sum up more than 1 gpu, thus making everything stuck.
Versions / Dependencies
stock rayproject/ray-llm:2.46.0-py311-cu124
Ray 2.46.0
py311
cu124
Differences in libraries from original image:
"vllm>=0.8.5" "transformers>=4.56.0"
Hardware:
A100 node pool on an AKS on azure
Reproduction script
This is my python
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app
# =========================
# Qwen3 8B Chat Model
# =========================
chat_llm = LLMConfig(
    model_loading_config={
        "model_id": "Qwen/Qwen3-8B",
    },
    engine_kwargs={
        "max_model_len": 8000,               # full long context
        "dtype": "bfloat16",
        "gpu_memory_utilization": 0.5,       # use 50% of A100 GPU memory
        "trust_remote_code": True,
        "enable_auto_tool_choice": True,      # enables automatic tool usage
        "tool_call_parser": "hermes",         # for function/tool-call reasoning
    },
    deployment_config={
        "ray_actor_options": {
            "num_gpus": 0.5,
            "num_cpus": 12,
        },
        "autoscaling_config": {
            "min_replicas": 1,
            "max_replicas": 1,
            "target_ongoing_requests": 64,
        },
        "max_ongoing_requests": 128,
    },
)
# =========================
# Qwen3 0.6B Embedding Model
# =========================
embedding_llm = LLMConfig(
    model_loading_config={
        "model_id": "Qwen/Qwen3-Embedding-0.6B",
    },
    engine_kwargs={
        "max_model_len": 1000,
        "dtype": "bfloat16",
        "trust_remote_code": True,
        "task": "embed",
    },
    deployment_config={
        "ray_actor_options": {
            "num_gpus": 0.1,
            "num_cpus": 2,
        },
        "autoscaling_config": {
            "min_replicas": 1,
            "max_replicas": 1,
            "target_ongoing_requests": 64,
        },
        "max_ongoing_requests": 128,
    },
)
# =========================
# Build one OpenAI-compatible app
# =========================
llm_app = build_openai_app({
    "llm_configs": [chat_llm, embedding_llm]
})
This is the ray status I get when I deploy the pod:
(base) ray@ray-qwen3-openai-llm-embed-6nq7t-head-thw9q:/serve_app$ ray status
======== Autoscaler status: 2025-10-30 15:53:43.266595 ========
Node status
---------------------------------------------------------------
Active:
 1 headgroup
Idle:
 1 gpu-group
Pending:
 (no pending nodes)
Recent failures:
 (no failures)
Resources
---------------------------------------------------------------
Total Usage:
 1.0/24.0 CPU
 0.0/1.0 GPU
 0B/37.31GiB memory
 0B/11.84GiB object_store_memory
From request_resources:
 (none)
Pending Demands:
 {'CPU': 2.0, 'GPU': 0.1}: 1+ pending tasks/actors (1+ using placement groups)
 {'CPU': 12.0, 'GPU': 0.5}: 1+ pending tasks/actors (1+ using placement groups)
 {'GPU': 1.1, 'CPU': 2.0} * 1 (PACK): 2+ pending placement groups`
My yaml config file:
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: ray-qwen3-openai-llm-embed
spec:
  serveConfigV2: |
    applications:
    - name: qwen3
      import_path: serve_qwen3_openai_app:llm_app
      route_prefix: "/"
      deployments:
        # --- Deployment 1: The Chat/LLM Model ---
        - name: Qwen3-Chat
          # We explicitly define the resources needed for this deployment
          ray_actor_options:
            num_gpus: 0.6
            num_cpus: 12
        # --- Deployment 2: The Embedder Model ---
        - name: EmbeddingService
          num_replicas: 1
          ray_actor_options:
            num_gpus: 0.1
            num_cpus: 2
  rayClusterConfig:
    rayVersion: "2.46.0"
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
      template:
        metadata:
          annotations:
            ray.io/disable-probes: "true"   # ✅ Prevent operator from overwriting probes
        spec:
          containers:
          - name: ray-head
            image: <container_registry>/ray-qwen3-llm-embed-openai:latest
            env:
            - name: PYTHONPATH
              value: /serve_app
            command: ["/bin/bash", "-c"]
            args:
              - |
                ray start --head --dashboard-host=0.0.0.0 --port=6379 && \
                serve run serve_qwen3_openai_app:llm_app
            resources:
              limits:
                cpu: 4
                memory: 8Gi
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265 # Ray dashboard
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            # Dummy probes (won’t be used if annotation disables them)
            livenessProbe:
              exec:
                command: ["/bin/sh", "-c", "echo live"]
              initialDelaySeconds: 3600
              periodSeconds: 600
              timeoutSeconds: 5
              failureThreshold: 120
            readinessProbe:
              exec:
                command: ["/bin/sh", "-c", "echo ready"]
              initialDelaySeconds: 3600
              periodSeconds: 600
              timeoutSeconds: 5
              failureThreshold: 120
    workerGroupSpecs:
    - groupName: gpu-group
      replicas: 1
      rayStartParams:
        num-gpus: "1"
        #resources: '{"accelerator_type:A100": 1}'
      template:
        metadata:
          annotations:
            ray.io/disable-probes: "true"   # ✅ Disable probes for worker too
        spec:
          tolerations:
          - key: "nvidia.com/gpu"
            operator: "Equal"
            value: "present"
            effect: "NoSchedule"
          containers:
          - name: ray-worker
            image: <container_registry>/ray-qwen3-llm-embed-openai:latest
            env:
            - name: PYTHONPATH
              value: /serve_app
            resources:
              limits:
                nvidia.com/gpu: "1"
                cpu: 20
                memory: 32Gi
            # Dummy probes (won’t be active due to annotation)
            livenessProbe:
              exec:
                command: ["/bin/sh", "-c", "echo live"]
              initialDelaySeconds: 3600
              periodSeconds: 600
              timeoutSeconds: 5
              failureThreshold: 120
            readinessProbe:
              exec:
                command: ["/bin/sh", "-c", "echo ready"]
              initialDelaySeconds: 3600
              periodSeconds: 600
              timeoutSeconds: 5
              failureThreshold: 120
Issue Severity
None