RHOAIENG-63118 - Add upgrade tests for Kueue resource management#939
RHOAIENG-63118 - Add upgrade tests for Kueue resource management#939sukumars321 wants to merge 1 commit into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Warning Review limit reached
Next review available in: 52 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Central YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughCWE-284 (Improper Access Control) note: this PR adds no RBAC/auth checks for ConfigMap read/write in the new Estimated code review effort: 3 (Moderate) | ~25 minutes Related issues: None referenced in provided diff. Related PRs: None referenced in provided diff. Suggested labels: test, kueue, gpu, e2e Suggested reviewers: No reviewer metadata provided in raw summary — CWE-940 (Improper Verification of Source of a Communication Channel) is not applicable here, but I cannot fabricate reviewer identities. Poem: 🚥 Pre-merge checks | ✅ 10✅ Passed checks (10 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
tests/common/support/kueue_upgrade.go (1)
54-56: 🩺 Stability & Availability | 🔵 Trivial | 💤 Low valueDelete error silently swallowed.
StoreUpgradeBaselineignores the error fromDelete(Line 54). If deletion fails for a reason other than "not found" (e.g. RBAC denial), the subsequentCreatewill fail with an unhelpfulAlreadyExists/permission error, masking the root cause.♻️ Suggested fix
- _ = test.Client().Core().CoreV1().ConfigMaps(namespace).Delete(test.Ctx(), configMapName, metav1.DeleteOptions{}) + if err := test.Client().Core().CoreV1().ConfigMaps(namespace).Delete(test.Ctx(), configMapName, metav1.DeleteOptions{}); err != nil && !k8serrors.IsNotFound(err) { + test.T().Logf("Failed to delete existing baseline ConfigMap %s/%s: %v", namespace, configMapName, err) + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/common/support/kueue_upgrade.go` around lines 54 - 56, StoreUpgradeBaseline is ignoring the result of ConfigMaps(...).Delete, which can hide real failures before the create step. Update the delete-and-create flow to check the Delete error explicitly in StoreUpgradeBaseline, using the existing test.Client().Core().CoreV1().ConfigMaps(namespace) calls, and fail the test immediately on any unexpected delete error (while still allowing a not-found case if intended) before proceeding to Create.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/common/support/kueue_upgrade_test.go`:
- Around line 90-106: The current test only covers the happy path where both
generation and spec match, so it does not expose the missing spec validation in
VerifyUpgradeResourceSpecIntegrity. Update
TestVerifyUpgradeResourceSpecIntegrity to add a mismatched ResourceFlavorSpec
with the same generation and assert it fails, and add a separate case where the
generation value differs to verify the expected failure behavior; use
VerifyUpgradeResourceSpecIntegrity, ResourceFlavorSpec, and the ConfigMap keys
already in the test to keep the coverage targeted.
- Around line 19-28: The import block in kueue_upgrade_test.go is not in the
order expected by openshift-goimports, causing verify-imports to fail. Reformat
the imports in the package for the test file so they are sorted and grouped
consistently with goimports conventions, keeping the existing dependencies but
adjusting their order in the import list.
In `@tests/common/support/kueue_upgrade.go`:
- Around line 60-75: VerifyUpgradeResourceSpecIntegrity only checks generation
and never asserts that the spec content stayed the same. In
VerifyUpgradeResourceSpecIntegrity, compare the pre-upgrade spec from
configMap.Data[specKey] against the post-upgrade spec serialized from spec, and
keep the generation check only as supplemental context. Use the existing
resourceName, genKey, and specKey flow to locate the mismatch, and add a spec
equality assertion so the function actually validates resource spec integrity
during upgrades.
In `@tests/kfto/kfto_kueue_gpu_upgrade_test.go`:
- Around line 155-163: `VerifyUpgradeResourceSpecIntegrity` is only checking the
stored generation and not validating that the spec content matches what was
saved. Update the helper in `kueue_upgrade.go` so it compares the marshalled
`spec` (`json.Marshal(spec)`) against the corresponding
`configMap.Data[specKey]` in addition to the generation check, and keep the
existing call sites like `VerifyUpgradeResourceSpecIntegrity` for
`ResourceFlavor` and `ClusterQueue` unchanged.
---
Nitpick comments:
In `@tests/common/support/kueue_upgrade.go`:
- Around line 54-56: StoreUpgradeBaseline is ignoring the result of
ConfigMaps(...).Delete, which can hide real failures before the create step.
Update the delete-and-create flow to check the Delete error explicitly in
StoreUpgradeBaseline, using the existing
test.Client().Core().CoreV1().ConfigMaps(namespace) calls, and fail the test
immediately on any unexpected delete error (while still allowing a not-found
case if intended) before proceeding to Create.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: e952e18d-938e-4b62-beef-8fbbf9717445
📒 Files selected for processing (3)
tests/common/support/kueue_upgrade.gotests/common/support/kueue_upgrade_test.gotests/kfto/kfto_kueue_gpu_upgrade_test.go
| import ( | ||
| "encoding/json" | ||
| "testing" | ||
|
|
||
| "github.com/onsi/gomega" | ||
|
|
||
| corev1 "k8s.io/api/core/v1" | ||
| "k8s.io/apimachinery/pkg/api/resource" | ||
| kueuev1beta2 "sigs.k8s.io/kueue/apis/kueue/v1beta2" | ||
| ) |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Imports not sorted — pipeline failing.
verify-imports reports this file fails openshift-goimports sorting.
make imports🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/common/support/kueue_upgrade_test.go` around lines 19 - 28, The import
block in kueue_upgrade_test.go is not in the order expected by
openshift-goimports, causing verify-imports to fail. Reformat the imports in the
package for the test file so they are sorted and grouped consistently with
goimports conventions, keeping the existing dependencies but adjusting their
order in the import list.
Source: Pipeline failures
| func TestVerifyUpgradeResourceSpecIntegrity(t *testing.T) { | ||
| test := NewTest(t) | ||
| configMap := &corev1.ConfigMap{ | ||
| Data: map[string]string{ | ||
| "gen-key": "5", | ||
| "spec-key": `{"nodeLabels":{"nvidia.com/gpu.present":"true"}}`, | ||
| UpgradeRHOAIVersionKey: "3.4.0", | ||
| }, | ||
| } | ||
| spec := kueuev1beta2.ResourceFlavorSpec{ | ||
| NodeLabels: map[string]string{ | ||
| "nvidia.com/gpu.present": "true", | ||
| }, | ||
| } | ||
|
|
||
| VerifyUpgradeResourceSpecIntegrity(test, "ResourceFlavor", 5, spec, configMap, "gen-key", "spec-key") | ||
| } |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Test doesn't cover the actual bug in VerifyUpgradeResourceSpecIntegrity.
This only exercises the case where generation and spec both match, so it passes even though the function under test never checks spec content (see comment on kueue_upgrade.go Lines 60-75). Add a case with a mismatched spec but matching generation to catch the gap, and a case for mismatched generation to verify failure behavior.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/common/support/kueue_upgrade_test.go` around lines 90 - 106, The
current test only covers the happy path where both generation and spec match, so
it does not expose the missing spec validation in
VerifyUpgradeResourceSpecIntegrity. Update
TestVerifyUpgradeResourceSpecIntegrity to add a mismatched ResourceFlavorSpec
with the same generation and assert it fails, and add a separate case where the
generation value differs to verify the expected failure behavior; use
VerifyUpgradeResourceSpecIntegrity, ResourceFlavorSpec, and the ConfigMap keys
already in the test to keep the coverage targeted.
| func VerifyUpgradeResourceSpecIntegrity(test Test, resourceName string, generation int64, spec interface{}, | ||
| configMap *corev1.ConfigMap, genKey, specKey string) { | ||
| test.T().Helper() | ||
|
|
||
| expectedGen := configMap.Data[genKey] | ||
| actualGen := fmt.Sprintf("%d", generation) | ||
| if actualGen != expectedGen { | ||
| currentSpecJSON, _ := json.Marshal(spec) | ||
| test.T().Logf("%s generation changed during upgrade (%s to %s)", resourceName, expectedGen, actualGen) | ||
| test.T().Logf("Pre-upgrade %s spec: %s", resourceName, configMap.Data[specKey]) | ||
| test.T().Logf("Post-upgrade %s spec: %s", resourceName, currentSpecJSON) | ||
| } | ||
| test.Expect(actualGen).To(gomega.Equal(expectedGen), | ||
| "%s spec should be unchanged after upgrade (generation %s, expected %s)", resourceName, actualGen, expectedGen) | ||
| test.T().Logf("%s generation unchanged after upgrade: %s", resourceName, actualGen) | ||
| } |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
VerifyUpgradeResourceSpecIntegrity never actually verifies the spec, only the generation.
currentSpecJSON (Line 67) is computed solely for the mismatch log message; the only assertion is on actualGen == expectedGen (Line 72-73). If a resource's spec silently drifts while Kueue leaves the generation unchanged (or the generation happens to coincide), this function reports success despite the spec being wrong — defeating the stated purpose ("Verify... Resource Spec Integrity") and the PR objective of validating that "resource specifications ... are preserved correctly during upgrades."
🐛 Proposed fix to compare spec content
func VerifyUpgradeResourceSpecIntegrity(test Test, resourceName string, generation int64, spec interface{},
configMap *corev1.ConfigMap, genKey, specKey string) {
test.T().Helper()
expectedGen := configMap.Data[genKey]
actualGen := fmt.Sprintf("%d", generation)
+ currentSpecJSON, err := json.Marshal(spec)
+ test.Expect(err).NotTo(gomega.HaveOccurred())
if actualGen != expectedGen {
- currentSpecJSON, _ := json.Marshal(spec)
test.T().Logf("%s generation changed during upgrade (%s to %s)", resourceName, expectedGen, actualGen)
test.T().Logf("Pre-upgrade %s spec: %s", resourceName, configMap.Data[specKey])
test.T().Logf("Post-upgrade %s spec: %s", resourceName, currentSpecJSON)
}
test.Expect(actualGen).To(gomega.Equal(expectedGen),
"%s spec should be unchanged after upgrade (generation %s, expected %s)", resourceName, actualGen, expectedGen)
+ test.Expect(string(currentSpecJSON)).To(gomega.MatchJSON(configMap.Data[specKey]),
+ "%s spec content should be unchanged after upgrade", resourceName)
test.T().Logf("%s generation unchanged after upgrade: %s", resourceName, actualGen)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func VerifyUpgradeResourceSpecIntegrity(test Test, resourceName string, generation int64, spec interface{}, | |
| configMap *corev1.ConfigMap, genKey, specKey string) { | |
| test.T().Helper() | |
| expectedGen := configMap.Data[genKey] | |
| actualGen := fmt.Sprintf("%d", generation) | |
| if actualGen != expectedGen { | |
| currentSpecJSON, _ := json.Marshal(spec) | |
| test.T().Logf("%s generation changed during upgrade (%s to %s)", resourceName, expectedGen, actualGen) | |
| test.T().Logf("Pre-upgrade %s spec: %s", resourceName, configMap.Data[specKey]) | |
| test.T().Logf("Post-upgrade %s spec: %s", resourceName, currentSpecJSON) | |
| } | |
| test.Expect(actualGen).To(gomega.Equal(expectedGen), | |
| "%s spec should be unchanged after upgrade (generation %s, expected %s)", resourceName, actualGen, expectedGen) | |
| test.T().Logf("%s generation unchanged after upgrade: %s", resourceName, actualGen) | |
| } | |
| func VerifyUpgradeResourceSpecIntegrity(test Test, resourceName string, generation int64, spec interface{}, | |
| configMap *corev1.ConfigMap, genKey, specKey string) { | |
| test.T().Helper() | |
| expectedGen := configMap.Data[genKey] | |
| actualGen := fmt.Sprintf("%d", generation) | |
| currentSpecJSON, err := json.Marshal(spec) | |
| test.Expect(err).NotTo(gomega.HaveOccurred()) | |
| if actualGen != expectedGen { | |
| test.T().Logf("%s generation changed during upgrade (%s to %s)", resourceName, expectedGen, actualGen) | |
| test.T().Logf("Pre-upgrade %s spec: %s", resourceName, configMap.Data[specKey]) | |
| test.T().Logf("Post-upgrade %s spec: %s", resourceName, currentSpecJSON) | |
| } | |
| test.Expect(actualGen).To(gomega.Equal(expectedGen), | |
| "%s spec should be unchanged after upgrade (generation %s, expected %s)", resourceName, actualGen, expectedGen) | |
| test.Expect(string(currentSpecJSON)).To(gomega.MatchJSON(configMap.Data[specKey]), | |
| "%s spec content should be unchanged after upgrade", resourceName) | |
| test.T().Logf("%s generation unchanged after upgrade: %s", resourceName, actualGen) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/common/support/kueue_upgrade.go` around lines 60 - 75,
VerifyUpgradeResourceSpecIntegrity only checks generation and never asserts that
the spec content stayed the same. In VerifyUpgradeResourceSpecIntegrity, compare
the pre-upgrade spec from configMap.Data[specKey] against the post-upgrade spec
serialized from spec, and keep the generation check only as supplemental
context. Use the existing resourceName, genKey, and specKey flow to locate the
mismatch, and add a spec equality assertion so the function actually validates
resource spec integrity during upgrades.
|
@sukumars321 is there a specific reason to test using GPU? |
This commit introduces new tests for validating the upgrade process of Kueue resource flavors and cluster queues. The tests ensure that resource specifications and generation numbers are correctly maintained during upgrades. Key functions include `AddUpgradeResourceBaseline`, `StoreUpgradeBaseline`, and `VerifyUpgradeResourceSpecIntegrity`, which handle the storage and verification of resource specifications in ConfigMaps. Additionally, the tests set up necessary Kueue resources and validate their state before and after the upgrade process, ensuring the integrity of the resource management system. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Sukumar Subramani <suksubra@redhat.com>
ff2c1df to
3d45d40
Compare
@sutaakar This is part of the Kueue pre/post upgrade coverage https://redhat.atlassian.net/browse/RHOAIENG-63118, This addresses Task 8 of [RHOAIENG-63117] Kueue Upgrade Test Coverage Audit Report |
chambridge
left a comment
There was a problem hiding this comment.
Noted a couple items upon my review, but I'm not super familar with this code base. I also see several coderabbit.ai code review comments that need resolving if applicable.
| defer test.Client().Kueue().KueueV1beta2().ResourceFlavors().Delete(test.Ctx(), gpuUpgradeResourceFlavorName, metav1.DeleteOptions{}) | ||
| defer test.Client().Kueue().KueueV1beta2().ClusterQueues().Delete(test.Ctx(), gpuUpgradeClusterQueueName, metav1.DeleteOptions{}) | ||
| defer test.Client().Core().CoreV1().ConfigMaps(gpuUpgradeNamespaceName).Delete(test.Ctx(), gpuUpgradeBaselineConfigMap, metav1.DeleteOptions{}) | ||
| defer DeleteTestNamespace(test, namespace) |
There was a problem hiding this comment.
defer test.Client().Kueue().KueueV1beta2().ResourceFlavors().Delete(...) // runs 4th (LIFO)
defer test.Client().Kueue().KueueV1beta2().ClusterQueues().Delete(...) // runs 3rd
defer test.Client().Core().CoreV1().ConfigMaps(...).Delete(...) // runs 2nd
defer DeleteTestNamespace(test, namespace) // runs 1stGo defers execute LIFO. DeleteTestNamespace runs first, which deletes the namespace and all its contents (including the ConfigMap). The explicit ConfigMap delete on line 146 will then fail with NotFound. It's harmless but misleading.
The MNIST upgrade test (kfto_kueue_mnist_upgrade_training_test.go) does NOT explicitly clean ConfigMaps — it relies on namespace deletion. Pick one pattern and be consistent.
|
|
||
| const UpgradeRHOAIVersionKey = "rhoai-version" | ||
|
|
||
| func AddUpgradeResourceBaseline(data map[string]string, genKey, specKey string, generation int64, spec interface{}) error { |
There was a problem hiding this comment.
The function returns an error rather than asserting, so error attribution in test output will point to the helper rather than the caller. StoreUpgradeBaseline and VerifyUpgradeResourceSpecIntegrity correctly call test.T().Helper().
| StoreUpgradeBaseline(test, gpuUpgradeNamespaceName, gpuUpgradeBaselineConfigMap, data) | ||
| } | ||
|
|
||
| func TestRunGpuKueueUpgrade(t *testing.T) { |
There was a problem hiding this comment.
TestRunGpuKueueUpgrade verifies ResourceFlavor and ClusterQueue spec integrity post-upgrade but never checks that the LocalQueue survived. Add:
_, err = test.Client().Kueue().KueueV1beta2().LocalQueues(gpuUpgradeNamespaceName).Get(
test.Ctx(), gpuUpgradeLocalQueueName, metav1.GetOptions{})
test.Expect(err).NotTo(HaveOccurred(), "LocalQueue should exist after upgrade")
Description
This commit introduces new tests for validating the upgrade process of Kueue resource flavors and cluster queues. The tests ensure that resource specifications and generation numbers are correctly maintained during upgrades. Key functions include
AddUpgradeResourceBaseline,StoreUpgradeBaseline, andVerifyUpgradeResourceSpecIntegrity, which handle the storage and verification of resource specifications in ConfigMaps.Additionally, the tests set up necessary Kueue resources and validate their state before and after the upgrade process, ensuring the integrity of the resource management system.
This code change addresses Task 8 of [RHOAIENG-63117] Kueue Upgrade Test Coverage Audit Report
JIRA : https://redhat.atlassian.net/browse/RHOAIENG-63118
How Has This Been Tested?
Manually verified the changes against a test cluster
Merge criteria:
Summary by CodeRabbit
New Features
Bug Fixes