fix: Fix JobSet Immutability and Add Termination Message-Based Metric… by abhijeet-dhumal · Pull Request #30 · opendatahub-io/trainer

abhijeet-dhumal · 2025-12-01T15:36:39Z

This PR enhances the progression tracking feature by adding support for capturing final metrics from pod termination messages (written by SDK), removes the pre-stop hook dependency, and significantly improves test coverage with comprehensive unit and e2e tests replicating TransfomersTrainer based progression callback wrappers.

Related Kubeflow SDK wrapper instrumentation in on_train_end callback : opendatahub-io/kubeflow-sdk#35

I'm targeting this entire fix into smaller atomic fixes via separate PRs here, : #32, #33, #34

Testing

Unit Tests

go test ./pkg/rhai/progression -v

E2E Tests

go test ./pkg/rhai/e2e/... -v -timeout 30m

go test -v ./pkg/rhai/e2e/... --ginkgo.focus "should capture final status even when job fails" --ginkgo.v

All tests passed ✅

Checklist:

Docs included if any changes are user facing

coderabbitai · 2025-12-01T15:36:47Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…s Capture Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

robert-bell

lgtm, though I'd be keen to split this into smaller PRs, to make it easier to revert things separately if necessary.

robert-bell · 2025-12-02T09:31:59Z

+				ginkgo.GinkgoWriter.Printf("Warning: Failed to delete namespace %s: %v\n", testNs.Name, err)
+			}
+		} else {
+			ginkgo.GinkgoWriter.Printf("✗ Tests failed - keeping namespace for debugging: %s\n", testNs.Name)


is this going to leave a bunch of namespaces in the cluster? How do we clean them up?

can we also move theese changes to a separate PR so we can easily revert it if necessary?

nope, it will run all tests serially in the single test namespace

robert-bell · 2025-12-02T09:59:48Z

+	// Look for the trainer container (typically named "node")
+	for _, containerStatus := range pod.Status.ContainerStatuses {
+		// Check if this is the trainer container
+		if containerStatus.Name != "node" && containerStatus.Name != "trainer" {


question: isn't the container name set by the TrainingRuntime? Can the user not set the name to something completely arbitrary? Should we just select the first container that has a termination message?

We need to worry too much about the approach if the feature ends up upstream as I've proposed a slightly different approach there that doesn't have this ambiguity.

Good catch! You're right that the TrainingRuntime defines the container name, but no, users can't set it arbitrarily.
The TrainingRuntime webhook enforces specific container names at admission time: Trainer containers must be named "node" (defined in constants.Node)
If a TrainingRuntime tries to use a different container name for the trainer, the webhook rejects it.

I'd be happy to discuss if I've misunderstood you!

No that makes sense thanks!

robert-bell · 2025-12-02T12:53:02Z

 	}

-	// Do not update the JobSet if it already exists and is not suspended
+	// Check if JobSet already exists


nit: can you pull this into a separate PR please just in case we need to revert the other changes?

astefanutti · 2025-12-02T14:56:23Z

+				oldTrainJob := trainJob.DeepCopy()
+				if err := UpdateTrainerStatusAnnotation(trainJob, annotationStatus); err != nil {
+					return false, fmt.Errorf("failed to update trainer status annotation: %w", err)
+				}
+				patch := client.MergeFrom(oldTrainJob)
+				if err := c.Patch(ctx, trainJob, patch); err != nil {
+					return false, fmt.Errorf("failed to patch TrainJob annotations: %w", err)
+				}


nit: Maybe move the status patching logic outside the if statements to avoid duplicating the logic between the termination case and the polling one.

astefanutti · 2025-12-02T14:57:00Z

+	if oldJobSet != nil {
+		oldSuspend := ptr.Deref(oldJobSet.Spec.Suspend, false)
+		newSuspend := ptr.Deref(trainJob.Spec.Suspend, false)
+
+		// Use strategic merge patch for suspend changes to avoid immutable field validation
+		if oldSuspend != newSuspend {
+			patch := client.MergeFrom(oldJobSet.DeepCopy())
+			oldJobSet.Spec.Suspend = ptr.To(newSuspend)
+			if err := j.client.Patch(ctx, oldJobSet, patch); err != nil {
+				return nil, fmt.Errorf("failed to patch JobSet suspend field: %w", err)
+			}
+			return nil, nil
+		}
+
+		// Skip update if both TrainJob and JobSet are already running
+		if !newSuspend && !oldSuspend {
+			return nil, nil
+		}
 	}


Is it still need now we moved away from trying to inject a preStop hook?

#32 (comment)

abhijeet-dhumal · 2025-12-03T10:15:57Z

Hi @astefanutti
As suggested by @robert-bell , I'm targeting this entire fix into smaller atomic fixes via separate PRs here, : #32, #33, #34

abhijeet-dhumal marked this pull request as ready for review December 2, 2025 05:31

abhijeet-dhumal requested review from astefanutti, efazal, kapil27 and robert-bell December 2, 2025 05:31

abhijeet-dhumal force-pushed the fix-jobset-suspend-prestop-progression branch from 6b098e3 to f9e2f89 Compare December 2, 2025 06:56

fix: Fix JobSet Immutability and Add Termination Message-Based Metric…

408f150

…s Capture Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

abhijeet-dhumal force-pushed the fix-jobset-suspend-prestop-progression branch from f9e2f89 to 408f150 Compare December 2, 2025 06:58

robert-bell reviewed Dec 2, 2025

View reviewed changes

robert-bell mentioned this pull request Dec 2, 2025

Termination message capture #33

Merged

1 task

astefanutti reviewed Dec 2, 2025

View reviewed changes

abhijeet-dhumal closed this Dec 3, 2025

This was referenced Dec 3, 2025

test: Update test infrastructure for termination message support #34

Closed

Remove pre-stop hook #32

Merged

Conversation

abhijeet-dhumal commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Unit Tests

E2E Tests

Uh oh!

coderabbitai Bot commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

robert-bell left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robert-bell Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abhijeet-dhumal commented Dec 1, 2025 •

edited

Loading

coderabbitai Bot commented Dec 1, 2025 •

edited

Loading

robert-bell Dec 2, 2025 •

edited

Loading