Skip to content

Termination message capture#33

Merged
abhijeet-dhumal merged 3 commits intoopendatahub-io:mainfrom
abhijeet-dhumal:termination-message-capture
Dec 5, 2025
Merged

Termination message capture#33
abhijeet-dhumal merged 3 commits intoopendatahub-io:mainfrom
abhijeet-dhumal:termination-message-capture

Conversation

@abhijeet-dhumal
Copy link
Copy Markdown
Member

@abhijeet-dhumal abhijeet-dhumal commented Dec 2, 2025

Add Termination Message-Based Final Metrics Capture

Summary

Add robust final metrics capture by reading pod termination messages, providing a reliable fallback when HTTP metrics endpoints become unavailable during pod shutdown. Includes comprehensive unit tests.

This is a 2nd part of #30, I'm targeting this entire fix into smaller atomic fixes via separate PRs here, : #32, #33, #34

Problem

Training progress metrics are currently captured via HTTP polling during execution. However, when pods terminate (normal completion, failure, preemption, OOM), the HTTP endpoint becomes unavailable before final metrics can be captured, leading to:

  • Lost final progress updates (especially for epoch-based training)
  • Inaccurate completion status
  • Missing failure metrics when training crashes

Solution

Implement termination message-based capture as specified in the Kubeflow Training SDK:

  1. Add CaptureMetricsFromTerminationMessage() function
  • Update PollAndUpdateFinalProgress() to read from terminated pod containers
  • Improve IsFinalStatusCaptured() with better grace period logic (3 polling cycles + buffer)
  • Fix patch handling in PollAndUpdateProgress() for safer updates
  • Support both "node" and "trainer" container names
  1. Unit tests

Checklist:

  • Docs included if any changes are user facing

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Dec 2, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review December 2, 2025 14:11
Comment thread pkg/rhai/progression/progression.go Outdated
// Look for the trainer container (typically named "node")
for _, containerStatus := range pod.Status.ContainerStatuses {
// Check if this is the trainer container
if containerStatus.Name != "node" && containerStatus.Name != "trainer" {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment on #30. That makes sense.

Should we replace "node" with constants.Node here? And when would the container be called "trainer"?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @robert-bell great catch! 🙌

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have address this concern please check!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks Abhijeet!

Copy link
Copy Markdown
Collaborator

@robert-bell robert-bell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks for this.

Just a minor comment which should be addressed, but I'm happy for you to merge this without another round of review.

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Copy link
Copy Markdown

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks!

@abhijeet-dhumal abhijeet-dhumal merged commit e9add5e into opendatahub-io:main Dec 5, 2025
9 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants