Termination message capture#33
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| // Look for the trainer container (typically named "node") | ||
| for _, containerStatus := range pod.Status.ContainerStatuses { | ||
| // Check if this is the trainer container | ||
| if containerStatus.Name != "node" && containerStatus.Name != "trainer" { |
There was a problem hiding this comment.
Thanks for your comment on #30. That makes sense.
Should we replace "node" with constants.Node here? And when would the container be called "trainer"?
There was a problem hiding this comment.
I have address this concern please check!
robert-bell
left a comment
There was a problem hiding this comment.
lgtm thanks for this.
Just a minor comment which should be addressed, but I'm happy for you to merge this without another round of review.
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Add Termination Message-Based Final Metrics Capture
Summary
Add robust final metrics capture by reading pod termination messages, providing a reliable fallback when HTTP metrics endpoints become unavailable during pod shutdown. Includes comprehensive unit tests.
This is a 2nd part of #30, I'm targeting this entire fix into smaller atomic fixes via separate PRs here, : #32, #33, #34
Problem
Training progress metrics are currently captured via HTTP polling during execution. However, when pods terminate (normal completion, failure, preemption, OOM), the HTTP endpoint becomes unavailable before final metrics can be captured, leading to:
Solution
Implement termination message-based capture as specified in the Kubeflow Training SDK:
Checklist: