Skip to content

Accept Succeeded pods for final metrics polling in progression tracking#29

Closed
abhijeet-dhumal wants to merge 1 commit intoopendatahub-io:mainfrom
abhijeet-dhumal:fix-termination-edgecase
Closed

Accept Succeeded pods for final metrics polling in progression tracking#29
abhijeet-dhumal wants to merge 1 commit intoopendatahub-io:mainfrom
abhijeet-dhumal:fix-termination-edgecase

Conversation

@abhijeet-dhumal
Copy link
Copy Markdown
Member

When a TrainJob completes successfully, the final training metrics (100% progress, 0s remaining) are not captured in the TrainJob annotations. Instead, annotations show stale values from the last poll before completion (e.g., 95% progress, 12s remaining).

Timeline of issue :
The GetPrimaryPod() function only accepts pods in Running phase with PodReady=true. When training completes:

  1. Training finishes - on_train_end() sets final metrics
  2. SDK waits 45s to keep metrics server alive : fix: Wait 45s after training completion to allow final metrics capture kubeflow-sdk#33
  3. Wait ends → Python process exits (code 0)
  4. Pod transitions to Succeeded phase
  5. Controller attempts final metrics poll
  6. GetPrimaryPod() rejects the pod because:
    • pod.Status.Phase == PodSucceeded (not PodRunning)
    • isPodReady(pod) == false (Succeeded pods are never "ready")

Checklist:

  • Docs included if any changes are user facing

…racking

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Nov 26, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@robert-bell
Copy link
Copy Markdown
Collaborator

Sorry I didn't think of this earlier, but are you sure the preStop hook is necessary? My understanding is the hook isn't called on normal pod termination (e.g. when training has completed) so it won't help make sure the final progress status is scraped.

Shouldn't opendatahub-io/kubeflow-sdk#33 be enough to ensure that the training job pod stays around long enough for the final scrape?

@abhijeet-dhumal
Copy link
Copy Markdown
Member Author

Closing this PR and will follow it in : #33

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants