Skip to content

RAGEngine status condition updates #1641

@pauldotyu

Description

@pauldotyu

Describe the bug

RAGEngine controller does not update observedGeneration in status conditions after reconciling spec changes, causing kubectl wait --for=condition=ServiceReady to timeout indefinitely even when the service is actually ready and all conditions show Status: True.

This makes it difficult to write robust deployment scripts and CI/CD pipelines that need to wait for RAGEngine resources to be fully reconciled after updates. The standard Kubernetes pattern of checking observedGeneration == generation is broken, as kubectl wait --for=condition=X expects the condition to be satisfied at the current generation.

Steps To Reproduce

  1. Deploy a RAGEngine resource:
kubectl apply -f - <<EOF
apiVersion: kaito.sh/v1alpha1
kind: RAGEngine
metadata:
  name: rag-gpt-oss-20b
spec:
  compute:
    count: 1
    instanceType: Standard_D8ds_v6
    labelSelector:
      matchLabels:
        apps: rag-gpt-oss-20b
  embedding:
    local:
      modelID: BAAI/bge-small-en-v1.5
  inferenceService:
    contextWindowSize: 131072
    url: http://my-llm-service/v1/chat/completions
EOF
  1. Wait for initial deployment to complete successfully (generation: 1, observedGeneration: 1):
kubectl wait --for=condition=ServiceReady ragengine/rag-gpt-oss-20b --timeout=10m
  1. Update the RAGEngine spec (e.g., change the inference service URL):
kubectl patch ragengine rag-gpt-oss-20b --type=merge -p '{"spec":{"inferenceService":{"url":"http://updated-llm-service/v1/chat/completions"}}}'
  1. Attempt to wait for the ServiceReady condition after the update:
kubectl wait --for=condition=ServiceReady ragengine/rag-gpt-oss-20b --timeout=10m
  1. The command times out after 10 minutes, even though kubectl describe shows the resource is ready

The RAGEngine resource has generation: X (indicating X modifications to the spec) but all status conditions still show observedGeneration: Y, suggesting the controller successfully reconciles the spec changes but fails to update the observedGeneration field in the status subresource.

Expected behavior

After the controller reconciles the spec changes, it should update the observedGeneration field in all status conditions to match the current metadata.generation value. The kubectl wait command should succeed once the condition is satisfied at the current generation.

Logs

$ k describe ragengine rag-gpt-oss-20b
Name:         rag-gpt-oss-20b
Namespace:    default
Labels:       <none>
Annotations:  ragengine.kaito.io/hash: 1575fce61279e41c20eea0200f1162429654063de7e9b31d9f981e3b232e348c
              ragengine.kaito.io/revision: 3
API Version:  kaito.sh/v1alpha1
Kind:         RAGEngine
Metadata:
  Creation Timestamp:  2025-11-05T00:33:56Z
  Finalizers:
    ragengine.finalizer.kaito.sh
  Generation:        4                              # <-- Current generation
  Resource Version:  576316
  UID:               841bf389-f626-4500-bb57-479c6e4cbc7f
Spec:
  Compute:
    Count:          1
    Instance Type:  Standard_D8ds_v6
    Label Selector:
      Match Labels:
        Apps:  rag-gpt-oss-20b
  Embedding:
    Local:
      Model ID:  BAAI/bge-small-en-v1.5
  Inference Service:
    Context Window Size:  131072
    URL:                  http://updated-llm-service/v1/chat/completions
Status:
  Conditions:
    Last Transition Time:  2025-11-05T00:39:23Z
    Message:               nodeClaim plugins have been installed successfully
    Observed Generation:   1                                               # <-- Last observed generation
    Reason:                installNodePluginsSuccess
    Status:                True
    Type:                  NodeClaimReady
    Last Transition Time:  2025-11-05T00:39:23Z
    Message:               ragengine resource is ready
    Observed Generation:   1                                              # <-- Last observed generation
    Reason:                ragengineResourceStatusSuccess
    Status:                True
    Type:                  ResourceReady
    Last Transition Time:  2025-11-05T00:42:11Z
    Message:               ragengine succeeds
    Observed Generation:   1                                              # <-- Last observed generation
    Reason:                ragengineSucceeded
    Status:                True
    Type:                  RAGEngineSucceeded
    Last Transition Time:  2025-11-05T00:42:11Z
    Message:               Inference has been deployed successfully
    Observed Generation:   1                                              # <-- Last observed generation
    Reason:                RAGEngineServiceSuccess
    Status:                True
    Type:                  ServiceReady
  Worker Nodes:
    aks-ws651a575a7-38124524-vmss000000
Events:  <none>

All conditions show Status: True (indicating successful reconciliation) but observedGeneration remains at 1 despite the resource being at generation 3. The spec changes are applied (visible when checking the resource YAML), but the status is not properly updated.

Environment

  • Kubernetes version (use kubectl version):
  • OS (e.g: cat /etc/os-release):
  • Install tools:
  • Others:

Additional context

Workaround: Using JSONPath-based wait works because it doesn't check observedGeneration:

kubectl wait --for=jsonpath='{.status.conditions[?(@.type=="ServiceReady")].status}'=True ragengine/rag-gpt-oss-20b --timeout=10m

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions