Skip to content

fix: Fix JobSet Immutability and Add Termination Message-Based Metric…#30

Closed
abhijeet-dhumal wants to merge 1 commit intoopendatahub-io:mainfrom
abhijeet-dhumal:fix-jobset-suspend-prestop-progression
Closed

fix: Fix JobSet Immutability and Add Termination Message-Based Metric…#30
abhijeet-dhumal wants to merge 1 commit intoopendatahub-io:mainfrom
abhijeet-dhumal:fix-jobset-suspend-prestop-progression

Conversation

@abhijeet-dhumal
Copy link
Copy Markdown
Member

@abhijeet-dhumal abhijeet-dhumal commented Dec 1, 2025

This PR enhances the progression tracking feature by adding support for capturing final metrics from pod termination messages (written by SDK), removes the pre-stop hook dependency, and significantly improves test coverage with comprehensive unit and e2e tests replicating TransfomersTrainer based progression callback wrappers.

Related Kubeflow SDK wrapper instrumentation in on_train_end callback : opendatahub-io/kubeflow-sdk#35

I'm targeting this entire fix into smaller atomic fixes via separate PRs here, : #32, #33, #34

Testing

Unit Tests

go test ./pkg/rhai/progression -v

E2E Tests

go test ./pkg/rhai/e2e/... -v -timeout 30m
go test -v ./pkg/rhai/e2e/... --ginkgo.focus "should capture final status even when job fails" --ginkgo.v

All tests passed ✅
Screenshot 2025-12-02 at 11 43 08 AM
Screenshot 2025-12-02 at 11 44 09 AM

Screenshot 2025-12-01 at 8 55 23 PM

Checklist:

  • Docs included if any changes are user facing

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Dec 1, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review December 2, 2025 05:31
@abhijeet-dhumal abhijeet-dhumal force-pushed the fix-jobset-suspend-prestop-progression branch from 6b098e3 to f9e2f89 Compare December 2, 2025 06:56
…s Capture

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@abhijeet-dhumal abhijeet-dhumal force-pushed the fix-jobset-suspend-prestop-progression branch from f9e2f89 to 408f150 Compare December 2, 2025 06:58
Copy link
Copy Markdown
Collaborator

@robert-bell robert-bell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, though I'd be keen to split this into smaller PRs, to make it easier to revert things separately if necessary.

ginkgo.GinkgoWriter.Printf("Warning: Failed to delete namespace %s: %v\n", testNs.Name, err)
}
} else {
ginkgo.GinkgoWriter.Printf("✗ Tests failed - keeping namespace for debugging: %s\n", testNs.Name)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this going to leave a bunch of namespaces in the cluster? How do we clean them up?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also move theese changes to a separate PR so we can easily revert it if necessary?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, it will run all tests serially in the single test namespace

// Look for the trainer container (typically named "node")
for _, containerStatus := range pod.Status.ContainerStatuses {
// Check if this is the trainer container
if containerStatus.Name != "node" && containerStatus.Name != "trainer" {
Copy link
Copy Markdown
Collaborator

@robert-bell robert-bell Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: isn't the container name set by the TrainingRuntime? Can the user not set the name to something completely arbitrary? Should we just select the first container that has a termination message?

We need to worry too much about the approach if the feature ends up upstream as I've proposed a slightly different approach there that doesn't have this ambiguity.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! You're right that the TrainingRuntime defines the container name, but no, users can't set it arbitrarily.
The TrainingRuntime webhook enforces specific container names at admission time: Trainer containers must be named "node" (defined in constants.Node)
If a TrainingRuntime tries to use a different container name for the trainer, the webhook rejects it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to discuss if I've misunderstood you!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that makes sense thanks!

}

// Do not update the JobSet if it already exists and is not suspended
// Check if JobSet already exists
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you pull this into a separate PR please just in case we need to revert the other changes?

@robert-bell robert-bell mentioned this pull request Dec 2, 2025
1 task
Comment on lines +619 to +626
oldTrainJob := trainJob.DeepCopy()
if err := UpdateTrainerStatusAnnotation(trainJob, annotationStatus); err != nil {
return false, fmt.Errorf("failed to update trainer status annotation: %w", err)
}
patch := client.MergeFrom(oldTrainJob)
if err := c.Patch(ctx, trainJob, patch); err != nil {
return false, fmt.Errorf("failed to patch TrainJob annotations: %w", err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe move the status patching logic outside the if statements to avoid duplicating the logic between the termination case and the polling one.

Comment on lines +251 to 269
if oldJobSet != nil {
oldSuspend := ptr.Deref(oldJobSet.Spec.Suspend, false)
newSuspend := ptr.Deref(trainJob.Spec.Suspend, false)

// Use strategic merge patch for suspend changes to avoid immutable field validation
if oldSuspend != newSuspend {
patch := client.MergeFrom(oldJobSet.DeepCopy())
oldJobSet.Spec.Suspend = ptr.To(newSuspend)
if err := j.client.Patch(ctx, oldJobSet, patch); err != nil {
return nil, fmt.Errorf("failed to patch JobSet suspend field: %w", err)
}
return nil, nil
}

// Skip update if both TrainJob and JobSet are already running
if !newSuspend && !oldSuspend {
return nil, nil
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still need now we moved away from trying to inject a preStop hook?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhijeet-dhumal
Copy link
Copy Markdown
Member Author

Hi @astefanutti
As suggested by @robert-bell , I'm targeting this entire fix into smaller atomic fixes via separate PRs here, : #32, #33, #34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants