RHOAIENG-63097: Add spec integrity checks to Trainer upgrade tests#920
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Central YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR instruments Trainer upgrade tests to snapshot resource generation and serialized specs into ConfigMaps before upgrades, persist the cluster RHOAI version from DSCI, and compare generation/spec after upgrade. It adds version-normalization utilities to declare allowed spec-mutation upgrade paths, helper APIs to store/read baselines, and updates default, specific, custom runtime, and standalone TrainingRuntime tests to use these baselines for post-upgrade verification. README docs describing the two-phase upgrade test workflow were also added. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (1)
tests/trainer/trainer_kueue_upgrade_training_test.go (1)
169-178: 💤 Low valueConsider adding ConfigMap cleanup for consistency.
Lines 313 and 468 explicitly delete the baseline ConfigMap in deferred cleanup, but this test omits it. While namespace deletion will cascade, explicit cleanup improves consistency across the test suite.
♻️ Add ConfigMap cleanup
In the deferred cleanup block around line 167, add:
defer test.Client().Kueue().KueueV1beta2().ResourceFlavors().Delete(test.Ctx(), resourceFlavorName, metav1.DeleteOptions{}) defer test.Client().Kueue().KueueV1beta2().ClusterQueues().Delete(test.Ctx(), clusterQueueName, metav1.DeleteOptions{}) +defer test.Client().Core().CoreV1().ConfigMaps(upgradeNamespaceName).Delete(test.Ctx(), upgradeConfigMapName, metav1.DeleteOptions{}) defer DeleteTestNamespace(test, namespace)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/trainer/trainer_kueue_upgrade_training_test.go` around lines 169 - 178, Add an explicit cleanup to delete the baseline ConfigMap in the deferred cleanup block for this test: call test.Client().Core().CoreV1().ConfigMaps(upgradeNamespaceName).Delete(test.Ctx(), upgradeConfigMapName, metav1.DeleteOptions{}) and assert no error (e.g., test.Expect(err).NotTo(HaveOccurred() or allow NotFound) ) so the ConfigMap created for verifySpecIntegrity (configMap) is removed consistently like other tests; place this alongside other deferred deletions in the same setup block.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/common/support/dscInitialization.go`:
- Around line 56-58: The function calls unstructured.NestedString(dsci.Object,
"status", "release", "version") and currently returns "" on both parse error and
missing field without logging; update the block to log a clear diagnostic before
returning by using the DSCI object's identity (e.g., dsci.GetName() or
dsci.Object.GetName()) and include whether err is non-nil or found==false so
callers can distinguish malformed DSCI vs missing key; ensure you keep the
existing return "" behavior but add a processLogger (or t.Logf/appropriate
logger) message that includes the DSCI name, the error string (if err != nil),
and the found flag.
In `@tests/trainer/trainer_kueue_upgrade_training_test.go`:
- Around line 604-610: The ConfigMap named by configMapName is created without
OwnerReferences (see variable configMap and corev1.ConfigMap ObjectMeta), so add
an OwnerReference in ObjectMeta.OwnerReferences tying it to the intended owner
(for example the test's owning resource or the test namespace) using the owner's
APIVersion, Kind, Name and UID and set Controller and BlockOwnerDeletion as
appropriate, or if omission is intentional add a clear comment above the
configMap declaration explaining why OwnerReferences are omitted; update
references where configMap is created to include this OwnerReference or the
explanatory comment.
- Line 524: The code currently ignores the error from json.Marshal(spec) on the
json.Marshal call that assigns currentSpecJSON; change this to capture and
handle the error (e.g., currentSpecJSON, err := json.Marshal(spec); if err !=
nil { t.Fatalf("failed to marshal spec: %v", err) } or use require.NoError(t,
err) if testify is used) so that failures to marshal are surfaced instead of
producing a nil/empty currentSpecJSON; update uses of currentSpecJSON (the
subsequent log) to assume a valid byte slice after the check.
- Around line 522-529: GetRHOAIVersionFromDSCI can return an empty string
causing IsSpecMutationExpected to false-positive; before calling
trainerutils.IsSpecMutationExpected, check if preVersion == "" || postVersion ==
"" (variables set from configMap.Data[rhoaiVersionKey] and
GetRHOAIVersionFromDSCI(test)) and handle explicitly — e.g., call
test.T().Fatalf("RHOAI version unavailable: pre=%q post=%q; aborting upgrade
spec check") to fail fast (or optionally Skip/Mark as inconclusive) so the
assertion at trainerutils.IsSpecMutationExpected is not hit with unknown
versions.
- Around line 526-527: The two test log statements that print full spec JSON
(test.T().Logf("Pre-upgrade %s spec: %s", resourceName, configMap.Data[specKey])
and test.T().Logf("Post-upgrade %s spec: %s", resourceName, currentSpecJSON))
expose sensitive fields; replace them by either removing direct JSON logging or
passing the spec through a sanitizer that strips/redacts keys like Password,
Token, APIKey, Secret, Secret.Data (e.g., implement a sanitizeSpecJSON function
used for both pre- and post-upgrade logs), or log only a non-sensitive
summary/diff (resourceName and allowed metadata) instead of the raw
configMap.Data[specKey] and currentSpecJSON.
In `@tests/trainer/utils/utils_upgrade.go`:
- Around line 23-26: The test helper specMutationExpectedPaths currently
documents the known mutating upgrade pair but left it commented out, causing
IsSpecMutationExpected to return false and trigger false integrity failures;
update the specMutationExpectedPaths declaration (and the other similar list
around the second occurrence) to include the {"3.4","3.5"} pair (i.e., remove
the comment and add the tuple) so the known mutation path is recognized by
IsSpecMutationExpected (refer to the specMutationExpectedPaths variable and the
IsSpecMutationExpected usage to locate the change).
---
Nitpick comments:
In `@tests/trainer/trainer_kueue_upgrade_training_test.go`:
- Around line 169-178: Add an explicit cleanup to delete the baseline ConfigMap
in the deferred cleanup block for this test: call
test.Client().Core().CoreV1().ConfigMaps(upgradeNamespaceName).Delete(test.Ctx(),
upgradeConfigMapName, metav1.DeleteOptions{}) and assert no error (e.g.,
test.Expect(err).NotTo(HaveOccurred() or allow NotFound) ) so the ConfigMap
created for verifySpecIntegrity (configMap) is removed consistently like other
tests; place this alongside other deferred deletions in the same setup block.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 66ab7d83-8dbd-4b51-96d9-319da77b83f3
📒 Files selected for processing (4)
tests/common/support/dscInitialization.gotests/trainer/trainer_kueue_upgrade_training_test.gotests/trainer/trainer_trainingruntime_upgrade_test.gotests/trainer/utils/utils_upgrade.go
Store pre-upgrade resource generation, spec JSON, and RHOAI version in ConfigMaps. Post-upgrade, compare generation and conditionally fail based on an explicit allowlist of upgrade paths where spec mutations are expected (e.g., API changes across minor versions). Before/after specs are logged on mismatch for analyzer comparison. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e59c4c0 to
243ecec
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
tests/trainer/utils/utils_upgrade.go (1)
23-26:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKnown mutation path is documented but disabled, causing false integrity failures.
Line 24 references a known mutating upgrade (
3.4 → 3.5per kubeflow/trainer#3309), but the pair is commented out. If generation changes on that path,IsSpecMutationExpectedreturns false and fails the test despite expected behavior.Proposed fix
var specMutationExpectedPaths = [][2]string{ // kubeflow/trainer#3309: PodTemplateOverrides → RuntimePatches - // {"3.4", "3.5"}, + {"3.4", "3.5"}, }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/trainer/utils/utils_upgrade.go` around lines 23 - 26, specMutationExpectedPaths array documents known mutating upgrade paths but the 3.4→3.5 pair is commented out, causing IsSpecMutationExpected to misreport expected mutations; re-enable the documented pair by adding {"3.4", "3.5"} into specMutationExpectedPaths (remove the comment markers) so the function IsSpecMutationExpected will return true for that upgrade path and prevent false integrity failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@tests/trainer/utils/utils_upgrade.go`:
- Around line 23-26: specMutationExpectedPaths array documents known mutating
upgrade paths but the 3.4→3.5 pair is commented out, causing
IsSpecMutationExpected to misreport expected mutations; re-enable the documented
pair by adding {"3.4", "3.5"} into specMutationExpectedPaths (remove the comment
markers) so the function IsSpecMutationExpected will return true for that
upgrade path and prevent false integrity failures.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: d9bd1889-9e6d-4ea6-891f-4bbaaa227d30
📒 Files selected for processing (4)
tests/common/support/dscInitialization.gotests/trainer/trainer_kueue_upgrade_training_test.gotests/trainer/trainer_trainingruntime_upgrade_test.gotests/trainer/utils/utils_upgrade.go
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/trainer/trainer_trainingruntime_upgrade_test.go
- tests/trainer/trainer_kueue_upgrade_training_test.go
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Exercises wider TrainJob API surface in the upgrade test so spec integrity checks can detect API migrations like kubeflow/trainer#3309 (PodTemplateOverrides → RuntimePatches). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ChughShilpa, sutaakar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Summary
utils_upgrade.go)status.release.version) to determine the upgrade pathverifySpecIntegrity,storeUpgradeBaseline,addResourceBaseline) reduce duplication across all test pairsTest plan
TestSetupTrainingRuntime/TestVerifyTrainingRuntime— passed on RHOAI 3.5TestSetupCustomRuntimeUpgradeTrainJob/TestRunCustomRuntimeUpgradeTrainJob— passed on RHOAI 3.5go vet ./tests/trainer/...passesgofmtclean🤖 Generated with Claude Code
Summary by CodeRabbit