Accept Succeeded pods for final metrics polling in progression tracking#29
Accept Succeeded pods for final metrics polling in progression tracking#29abhijeet-dhumal wants to merge 1 commit intoopendatahub-io:mainfrom
Conversation
…racking Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Sorry I didn't think of this earlier, but are you sure the Shouldn't opendatahub-io/kubeflow-sdk#33 be enough to ensure that the training job pod stays around long enough for the final scrape? |
|
Closing this PR and will follow it in : #33 |
When a TrainJob completes successfully, the final training metrics (100% progress, 0s remaining) are not captured in the TrainJob annotations. Instead, annotations show stale values from the last poll before completion (e.g., 95% progress, 12s remaining).
Timeline of issue :
The
GetPrimaryPod()function only accepts pods inRunningphase withPodReady=true. When training completes:on_train_end()sets final metricsSucceededphaseGetPrimaryPod()rejects the pod because:pod.Status.Phase == PodSucceeded(notPodRunning)isPodReady(pod) == false(Succeeded pods are never "ready")Checklist: