Skip to content

Model Not Properly Registered to Gateway: Shows as "random/model" Instead of "Qwen3-0.6B" #233

@GGGsk

Description

@GGGsk

Description

I'm experiencing an issue with model registration in our LLM deployment. Despite configuring the model name as "Qwen3-0.6B" in the values.yaml file, the model is showing up as "random/model" when querying the /v1/models endpoint.

Configuration

In my values.yaml, I have:

  • modelArtifacts.name: "Qwen3-0.6B"
  • routing.modelName: Qwen3-0.6B
  • vLLM startup command with --served-model-name Qwen3-0.6B
  • inferencePool.modelName: Qwen3-0.6B
  • Both decode and prefill containers configured with the same model name

What I've Tried

  1. Verified all model name configurations are consistent
  2. Redeployed the service multiple times
  3. Restarted both vLLM and routing service pods
  4. Checked vLLM logs for model loading errors (no errors found)
  5. Verified model files exist at the expected path /models/Qwen3-0___6B
  6. Enabled inferenceModel.create: true as suggested

Expected Behavior

The model should be registered and accessible as "Qwen3-0.6B" when querying the /v1/models endpoint.

Actual Behavior

The /v1/models endpoint returns:

{
  "data": [
    {
      "created": 1758531410,
      "id": "random/model",
      "object": "model",
      "owned_by": "vllm",
      "parent": null,
      "root": "random/model"
    }
  ],
  "object": "list"
}

helm install in namespace "works" and gateway in "llm-d-infra"
my values.yaml


modelArtifacts:
  uri: "pvc://pvc-2d4821f257a24dcdaadde41a2433d94d/Qwen3-0___6B"
  name: "Qwen3-0.6B"
  mountPath: "/models"

routing:
  modelName: Qwen3-0.6B
  servicePort: 8000
  parentRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: llm-d-infra-inference-gateway
      namespace: llm-d-infra

  proxy:
    image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.2.0
    targetPort: 8200
    connector: nixlv2
    secure: false

  inferenceModel:
    create: true

  inferencePool:
    create: true
    name: test-llmd-llm-d-modelservice
    targetPortNumber: 8200
    modelServerType: vllm
    modelName: Qwen3-0.6B
    modelServers:
      matchLabels:
        llm-d.ai/inferenceServing: "true"

  httpRoute:
    create: false

  epp:
    create: true
    service:
      type: ClusterIP
      port: 9002
      targetPort: 9002
      appProtocol: http2
    image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.2.1
    replicas: 1
    debugLevel: 4
    disableReadinessProbe: false
    disableLivenessProbe: false
    pluginsConfigFile: "prefix-cache-tracking-config.yaml"
    env: []
    resources:
      limits:
        cpu: 1000m
        memory: 1Gi
      requests:
        cpu: 1000m
        memory: 1Gi

decode:
  create: true
  replicas: 1
  monitoring:
    podmonitor:
      enabled: true
      portName: "metrics"
      path: "/metrics"
      interval: "30s"
  containers:
    - name: "vllm"
      image: "ghcr.io/llm-d/llm-d-dev:pr-170"
      modelCommand: custom
      command:
        - "/bin/sh"
        - "-c"
      args:
        - "vllm serve /models/Qwen3-0___6B --host 0.0.0.0 --port 8200 --served-model-name Qwen3-0.6B --max-model-len 1024"
      env:
        - name: UCX_TLS
          value: "cuda_ipc,cuda_copy,tcp"
        - name: VLLM_NIXL_SIDE_CHANNEL_HOST
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: VLLM_NIXL_SIDE_CHANNEL_PORT
          value: "5557"
        - name: VLLM_LOGGING_LEVEL
          value: DEBUG
      ports:
        - containerPort: 5557
          protocol: TCP
        - containerPort: 8200
          name: metrics
          protocol: TCP
      resources:
        limits:
          nvidia.com/gpu: "1"
        requests:
          nvidia.com/gpu: "1"
      mountModelVolume: true
      volumeMounts:
        - name: metrics-volume
          mountPath: /.config
        - name: torch-compile-cache
          mountPath: /.cache
  volumes:
    - name: metrics-volume
      emptyDir: {}
    - name: torch-compile-cache
      emptyDir: {}

prefill:
  create: true
  replicas: 1
  monitoring:
    podmonitor:
      enabled: true
      portName: "metrics"
      path: "/metrics"
      interval: "30s"
  containers:
    - name: "vllm-prefill"
      image: "ghcr.io/llm-d/llm-d-dev:pr-170"
      modelCommand: custom
      command:
        - "/bin/sh"
        - "-c"
      args:
        - "vllm serve /models/Qwen3-0___6B --host 0.0.0.0 --port 8200 --served-model-name Qwen3-0.6B --max-model-len 1024"
      env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: UCX_TLS
          value: "cuda_ipc,cuda_copy,tcp"
        - name: VLLM_NIXL_SIDE_CHANNEL_HOST
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: VLLM_NIXL_SIDE_CHANNEL_PORT
          value: "5558"
        - name: VLLM_LOGGING_LEVEL
          value: DEBUG
      ports:
        - containerPort: 5558
          protocol: TCP
        - containerPort: 8300
          name: metrics
          protocol: TCP
      resources:
        limits:
          nvidia.com/gpu: "1"
        requests:
          nvidia.com/gpu: "1"
      mountModelVolume: true
      volumeMounts:
        - name: metrics-volume
          mountPath: /.config
        - name: torch-compile-cache
          mountPath: /.cache
  volumes:
    - name: metrics-volume
      emptyDir: {}
    - name: torch-compile-cache
      emptyDir: {}

accelerator:
  type: "nvidia"
  resources:
    nvidia: "nvidia.com/gpu"
  env: {}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions