-
Notifications
You must be signed in to change notification settings - Fork 152
Description
Describe the bug
RAGEngine controller does not update observedGeneration in status conditions after reconciling spec changes, causing kubectl wait --for=condition=ServiceReady to timeout indefinitely even when the service is actually ready and all conditions show Status: True.
This makes it difficult to write robust deployment scripts and CI/CD pipelines that need to wait for RAGEngine resources to be fully reconciled after updates. The standard Kubernetes pattern of checking observedGeneration == generation is broken, as kubectl wait --for=condition=X expects the condition to be satisfied at the current generation.
Steps To Reproduce
- Deploy a RAGEngine resource:
kubectl apply -f - <<EOF
apiVersion: kaito.sh/v1alpha1
kind: RAGEngine
metadata:
name: rag-gpt-oss-20b
spec:
compute:
count: 1
instanceType: Standard_D8ds_v6
labelSelector:
matchLabels:
apps: rag-gpt-oss-20b
embedding:
local:
modelID: BAAI/bge-small-en-v1.5
inferenceService:
contextWindowSize: 131072
url: http://my-llm-service/v1/chat/completions
EOF- Wait for initial deployment to complete successfully (generation: 1, observedGeneration: 1):
kubectl wait --for=condition=ServiceReady ragengine/rag-gpt-oss-20b --timeout=10m- Update the RAGEngine spec (e.g., change the inference service URL):
kubectl patch ragengine rag-gpt-oss-20b --type=merge -p '{"spec":{"inferenceService":{"url":"http://updated-llm-service/v1/chat/completions"}}}'- Attempt to wait for the ServiceReady condition after the update:
kubectl wait --for=condition=ServiceReady ragengine/rag-gpt-oss-20b --timeout=10m- The command times out after 10 minutes, even though kubectl describe shows the resource is ready
The RAGEngine resource has generation: X (indicating X modifications to the spec) but all status conditions still show observedGeneration: Y, suggesting the controller successfully reconciles the spec changes but fails to update the observedGeneration field in the status subresource.
Expected behavior
After the controller reconciles the spec changes, it should update the observedGeneration field in all status conditions to match the current metadata.generation value. The kubectl wait command should succeed once the condition is satisfied at the current generation.
Logs
$ k describe ragengine rag-gpt-oss-20b
Name: rag-gpt-oss-20b
Namespace: default
Labels: <none>
Annotations: ragengine.kaito.io/hash: 1575fce61279e41c20eea0200f1162429654063de7e9b31d9f981e3b232e348c
ragengine.kaito.io/revision: 3
API Version: kaito.sh/v1alpha1
Kind: RAGEngine
Metadata:
Creation Timestamp: 2025-11-05T00:33:56Z
Finalizers:
ragengine.finalizer.kaito.sh
Generation: 4 # <-- Current generation
Resource Version: 576316
UID: 841bf389-f626-4500-bb57-479c6e4cbc7f
Spec:
Compute:
Count: 1
Instance Type: Standard_D8ds_v6
Label Selector:
Match Labels:
Apps: rag-gpt-oss-20b
Embedding:
Local:
Model ID: BAAI/bge-small-en-v1.5
Inference Service:
Context Window Size: 131072
URL: http://updated-llm-service/v1/chat/completions
Status:
Conditions:
Last Transition Time: 2025-11-05T00:39:23Z
Message: nodeClaim plugins have been installed successfully
Observed Generation: 1 # <-- Last observed generation
Reason: installNodePluginsSuccess
Status: True
Type: NodeClaimReady
Last Transition Time: 2025-11-05T00:39:23Z
Message: ragengine resource is ready
Observed Generation: 1 # <-- Last observed generation
Reason: ragengineResourceStatusSuccess
Status: True
Type: ResourceReady
Last Transition Time: 2025-11-05T00:42:11Z
Message: ragengine succeeds
Observed Generation: 1 # <-- Last observed generation
Reason: ragengineSucceeded
Status: True
Type: RAGEngineSucceeded
Last Transition Time: 2025-11-05T00:42:11Z
Message: Inference has been deployed successfully
Observed Generation: 1 # <-- Last observed generation
Reason: RAGEngineServiceSuccess
Status: True
Type: ServiceReady
Worker Nodes:
aks-ws651a575a7-38124524-vmss000000
Events: <none>
All conditions show Status: True (indicating successful reconciliation) but observedGeneration remains at 1 despite the resource being at generation 3. The spec changes are applied (visible when checking the resource YAML), but the status is not properly updated.
Environment
- Kubernetes version (use
kubectl version): - OS (e.g:
cat /etc/os-release): - Install tools:
- Others:
Additional context
Workaround: Using JSONPath-based wait works because it doesn't check observedGeneration:
kubectl wait --for=jsonpath='{.status.conditions[?(@.type=="ServiceReady")].status}'=True ragengine/rag-gpt-oss-20b --timeout=10mMetadata
Metadata
Assignees
Labels
Type
Projects
Status