Skip to content

RHOAIENG-63097: Add spec integrity checks to Trainer upgrade tests#920

Merged
openshift-merge-bot[bot] merged 3 commits into
opendatahub-io:mainfrom
sutaakar:RHOAIENG-63097-trainer-upgrade-spec-integrity
Jun 10, 2026
Merged

RHOAIENG-63097: Add spec integrity checks to Trainer upgrade tests#920
openshift-merge-bot[bot] merged 3 commits into
opendatahub-io:mainfrom
sutaakar:RHOAIENG-63097-trainer-upgrade-spec-integrity

Conversation

@sutaakar

@sutaakar sutaakar commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add generation + spec JSON baseline tracking to all Trainer v2 upgrade tests (TrainingRuntime, custom CTR + Kueue, specific runtime + Kueue, default runtime + Kueue)
  • Pre-upgrade tests store resource generation, serialized spec, and RHOAI version in ConfigMaps
  • Post-upgrade tests compare generation; on mismatch, log before/after specs and assert via an explicit allowlist of upgrade paths where spec mutations are expected (utils_upgrade.go)
  • RHOAI version is read from DSCI (status.release.version) to determine the upgrade path
  • Shared helpers (verifySpecIntegrity, storeUpgradeBaseline, addResourceBaseline) reduce duplication across all test pairs

Test plan

  • TestSetupTrainingRuntime / TestVerifyTrainingRuntime — passed on RHOAI 3.5
  • TestSetupCustomRuntimeUpgradeTrainJob / TestRunCustomRuntimeUpgradeTrainJob — passed on RHOAI 3.5
  • Generation unchanged assertions verified end-to-end
  • go vet ./tests/trainer/... passes
  • gofmt clean

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
    • Enhanced upgrade tests to capture pre-upgrade baselines (generation/spec) and verify post-upgrade integrity across default, specific-runtime, and custom-runtime scenarios
    • Added JSON baseline snapshotting and verification steps to ensure spec changes are expected per release version
    • Added a test helper to obtain the platform/runtime release version for version-aware checks
  • Utilities
    • Added version-aware logic to determine when spec mutations are expected between releases
  • Documentation
    • Added an "Upgrade Tests" section describing the baseline/verification flow, limitations, and maintenance notes

@openshift-ci openshift-ci Bot requested review from chipspeak and kapil27 June 9, 2026 11:57
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 38137707-551c-48fe-a918-e6f7ad63f4d6

📥 Commits

Reviewing files that changed from the base of the PR and between 0623ad4 and 6ed2c89.

📒 Files selected for processing (1)
  • tests/trainer/trainer_kueue_upgrade_training_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/trainer/trainer_kueue_upgrade_training_test.go

📝 Walkthrough

Walkthrough

This PR instruments Trainer upgrade tests to snapshot resource generation and serialized specs into ConfigMaps before upgrades, persist the cluster RHOAI version from DSCI, and compare generation/spec after upgrade. It adds version-normalization utilities to declare allowed spec-mutation upgrade paths, helper APIs to store/read baselines, and updates default, specific, custom runtime, and standalone TrainingRuntime tests to use these baselines for post-upgrade verification. README docs describing the two-phase upgrade test workflow were also added.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding spec integrity checks to Trainer upgrade tests, which is the core focus across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
tests/trainer/trainer_kueue_upgrade_training_test.go (1)

169-178: 💤 Low value

Consider adding ConfigMap cleanup for consistency.

Lines 313 and 468 explicitly delete the baseline ConfigMap in deferred cleanup, but this test omits it. While namespace deletion will cascade, explicit cleanup improves consistency across the test suite.

♻️ Add ConfigMap cleanup

In the deferred cleanup block around line 167, add:

 defer test.Client().Kueue().KueueV1beta2().ResourceFlavors().Delete(test.Ctx(), resourceFlavorName, metav1.DeleteOptions{})
 defer test.Client().Kueue().KueueV1beta2().ClusterQueues().Delete(test.Ctx(), clusterQueueName, metav1.DeleteOptions{})
+defer test.Client().Core().CoreV1().ConfigMaps(upgradeNamespaceName).Delete(test.Ctx(), upgradeConfigMapName, metav1.DeleteOptions{})
 defer DeleteTestNamespace(test, namespace)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/trainer/trainer_kueue_upgrade_training_test.go` around lines 169 - 178,
Add an explicit cleanup to delete the baseline ConfigMap in the deferred cleanup
block for this test: call
test.Client().Core().CoreV1().ConfigMaps(upgradeNamespaceName).Delete(test.Ctx(),
upgradeConfigMapName, metav1.DeleteOptions{}) and assert no error (e.g.,
test.Expect(err).NotTo(HaveOccurred() or allow NotFound) ) so the ConfigMap
created for verifySpecIntegrity (configMap) is removed consistently like other
tests; place this alongside other deferred deletions in the same setup block.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/common/support/dscInitialization.go`:
- Around line 56-58: The function calls unstructured.NestedString(dsci.Object,
"status", "release", "version") and currently returns "" on both parse error and
missing field without logging; update the block to log a clear diagnostic before
returning by using the DSCI object's identity (e.g., dsci.GetName() or
dsci.Object.GetName()) and include whether err is non-nil or found==false so
callers can distinguish malformed DSCI vs missing key; ensure you keep the
existing return "" behavior but add a processLogger (or t.Logf/appropriate
logger) message that includes the DSCI name, the error string (if err != nil),
and the found flag.

In `@tests/trainer/trainer_kueue_upgrade_training_test.go`:
- Around line 604-610: The ConfigMap named by configMapName is created without
OwnerReferences (see variable configMap and corev1.ConfigMap ObjectMeta), so add
an OwnerReference in ObjectMeta.OwnerReferences tying it to the intended owner
(for example the test's owning resource or the test namespace) using the owner's
APIVersion, Kind, Name and UID and set Controller and BlockOwnerDeletion as
appropriate, or if omission is intentional add a clear comment above the
configMap declaration explaining why OwnerReferences are omitted; update
references where configMap is created to include this OwnerReference or the
explanatory comment.
- Line 524: The code currently ignores the error from json.Marshal(spec) on the
json.Marshal call that assigns currentSpecJSON; change this to capture and
handle the error (e.g., currentSpecJSON, err := json.Marshal(spec); if err !=
nil { t.Fatalf("failed to marshal spec: %v", err) } or use require.NoError(t,
err) if testify is used) so that failures to marshal are surfaced instead of
producing a nil/empty currentSpecJSON; update uses of currentSpecJSON (the
subsequent log) to assume a valid byte slice after the check.
- Around line 522-529: GetRHOAIVersionFromDSCI can return an empty string
causing IsSpecMutationExpected to false-positive; before calling
trainerutils.IsSpecMutationExpected, check if preVersion == "" || postVersion ==
"" (variables set from configMap.Data[rhoaiVersionKey] and
GetRHOAIVersionFromDSCI(test)) and handle explicitly — e.g., call
test.T().Fatalf("RHOAI version unavailable: pre=%q post=%q; aborting upgrade
spec check") to fail fast (or optionally Skip/Mark as inconclusive) so the
assertion at trainerutils.IsSpecMutationExpected is not hit with unknown
versions.
- Around line 526-527: The two test log statements that print full spec JSON
(test.T().Logf("Pre-upgrade %s spec: %s", resourceName, configMap.Data[specKey])
and test.T().Logf("Post-upgrade %s spec: %s", resourceName, currentSpecJSON))
expose sensitive fields; replace them by either removing direct JSON logging or
passing the spec through a sanitizer that strips/redacts keys like Password,
Token, APIKey, Secret, Secret.Data (e.g., implement a sanitizeSpecJSON function
used for both pre- and post-upgrade logs), or log only a non-sensitive
summary/diff (resourceName and allowed metadata) instead of the raw
configMap.Data[specKey] and currentSpecJSON.

In `@tests/trainer/utils/utils_upgrade.go`:
- Around line 23-26: The test helper specMutationExpectedPaths currently
documents the known mutating upgrade pair but left it commented out, causing
IsSpecMutationExpected to return false and trigger false integrity failures;
update the specMutationExpectedPaths declaration (and the other similar list
around the second occurrence) to include the {"3.4","3.5"} pair (i.e., remove
the comment and add the tuple) so the known mutation path is recognized by
IsSpecMutationExpected (refer to the specMutationExpectedPaths variable and the
IsSpecMutationExpected usage to locate the change).

---

Nitpick comments:
In `@tests/trainer/trainer_kueue_upgrade_training_test.go`:
- Around line 169-178: Add an explicit cleanup to delete the baseline ConfigMap
in the deferred cleanup block for this test: call
test.Client().Core().CoreV1().ConfigMaps(upgradeNamespaceName).Delete(test.Ctx(),
upgradeConfigMapName, metav1.DeleteOptions{}) and assert no error (e.g.,
test.Expect(err).NotTo(HaveOccurred() or allow NotFound) ) so the ConfigMap
created for verifySpecIntegrity (configMap) is removed consistently like other
tests; place this alongside other deferred deletions in the same setup block.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 66ab7d83-8dbd-4b51-96d9-319da77b83f3

📥 Commits

Reviewing files that changed from the base of the PR and between 30a27e6 and e59c4c0.

📒 Files selected for processing (4)
  • tests/common/support/dscInitialization.go
  • tests/trainer/trainer_kueue_upgrade_training_test.go
  • tests/trainer/trainer_trainingruntime_upgrade_test.go
  • tests/trainer/utils/utils_upgrade.go

Comment thread tests/common/support/dscInitialization.go
Comment thread tests/trainer/trainer_kueue_upgrade_training_test.go
Comment thread tests/trainer/trainer_kueue_upgrade_training_test.go
Comment thread tests/trainer/trainer_kueue_upgrade_training_test.go
Comment thread tests/trainer/trainer_kueue_upgrade_training_test.go
Comment thread tests/trainer/utils/utils_upgrade.go
Store pre-upgrade resource generation, spec JSON, and RHOAI version in
ConfigMaps. Post-upgrade, compare generation and conditionally fail based
on an explicit allowlist of upgrade paths where spec mutations are expected
(e.g., API changes across minor versions). Before/after specs are logged
on mismatch for analyzer comparison.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sutaakar sutaakar force-pushed the RHOAIENG-63097-trainer-upgrade-spec-integrity branch from e59c4c0 to 243ecec Compare June 9, 2026 12:06

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tests/trainer/utils/utils_upgrade.go (1)

23-26: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Known mutation path is documented but disabled, causing false integrity failures.

Line 24 references a known mutating upgrade (3.4 → 3.5 per kubeflow/trainer#3309), but the pair is commented out. If generation changes on that path, IsSpecMutationExpected returns false and fails the test despite expected behavior.

Proposed fix
 var specMutationExpectedPaths = [][2]string{
 	// kubeflow/trainer#3309: PodTemplateOverrides → RuntimePatches
-	//	{"3.4", "3.5"},
+	{"3.4", "3.5"},
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/trainer/utils/utils_upgrade.go` around lines 23 - 26,
specMutationExpectedPaths array documents known mutating upgrade paths but the
3.4→3.5 pair is commented out, causing IsSpecMutationExpected to misreport
expected mutations; re-enable the documented pair by adding {"3.4", "3.5"} into
specMutationExpectedPaths (remove the comment markers) so the function
IsSpecMutationExpected will return true for that upgrade path and prevent false
integrity failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@tests/trainer/utils/utils_upgrade.go`:
- Around line 23-26: specMutationExpectedPaths array documents known mutating
upgrade paths but the 3.4→3.5 pair is commented out, causing
IsSpecMutationExpected to misreport expected mutations; re-enable the documented
pair by adding {"3.4", "3.5"} into specMutationExpectedPaths (remove the comment
markers) so the function IsSpecMutationExpected will return true for that
upgrade path and prevent false integrity failures.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: d9bd1889-9e6d-4ea6-891f-4bbaaa227d30

📥 Commits

Reviewing files that changed from the base of the PR and between e59c4c0 and 243ecec.

📒 Files selected for processing (4)
  • tests/common/support/dscInitialization.go
  • tests/trainer/trainer_kueue_upgrade_training_test.go
  • tests/trainer/trainer_trainingruntime_upgrade_test.go
  • tests/trainer/utils/utils_upgrade.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/trainer/trainer_trainingruntime_upgrade_test.go
  • tests/trainer/trainer_kueue_upgrade_training_test.go

sutaakar and others added 2 commits June 9, 2026 14:53
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Exercises wider TrainJob API surface in the upgrade test so spec
integrity checks can detect API migrations like kubeflow/trainer#3309
(PodTemplateOverrides → RuntimePatches).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sutaakar sutaakar requested review from ChughShilpa and abhijeet-dhumal and removed request for chipspeak and kapil27 June 10, 2026 13:01

@ChughShilpa ChughShilpa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@sutaakar

Copy link
Copy Markdown
Contributor Author

/approve

@openshift-ci

openshift-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ChughShilpa, sutaakar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot Bot merged commit 8683d89 into opendatahub-io:main Jun 10, 2026
8 checks passed
@sutaakar sutaakar deleted the RHOAIENG-63097-trainer-upgrade-spec-integrity branch June 11, 2026 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants