You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(rhai): fix e2e test failures for failing jobs and no-metrics jobs
Two bugs were causing RHAI progression tracking e2e tests to fail:
1. TrainJobFailed condition never persisted (status overwritten by patch response)
When ReconcileProgression patches the TrainJob annotations, the Kubernetes API
server responds with the full persisted object, overwriting trainJob.Status in
memory. The TrainJobFailed condition set by setTrainJobStatus was silently lost,
so r.client.Status().Patch() was never called and the condition never reached
the API server.
Fix: save reconciledStatus before ReconcileProgression and restore it after.
2. trainerStatus annotation synthesized when metrics were never reachable
updateFinalStatus was creating an annotation even when no prior annotation
existed (i.e. the metrics endpoint was completely unreachable during the job).
The test expects no annotation in that case.
Fix: updateFinalStatus returns early if no existing annotation is found.
PollAndUpdateFinalProgress skips the patch and returns (true, nil) when
updateFinalStatus made no changes, preventing an infinite requeue loop.
Fixes: RHOAIENG-59039
Made-with: Cursor
0 commit comments