Add e2e tests for manageJobsWithoutQueueName #4112

kaisoz · 2025-01-30T23:21:32Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR adds e2e tests for the manageJobsWithoutQueueName: true configuration. This is one of the tasks required to close #3767

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

The test suite sets manageJobsWithoutQueueName: true during the setup phase of the description block (just before running the test specs). It then initiates a rollout of the controller deployment to ensure that the new configuration is picked up by the pods. To confirm that the deployment rollout has started, the setup function waits for the Deployment to have the condition DeploymentCondition=false.

To ensure that this condition is met, the rollout strategy is set to Recreate, which guarantees that the pods are terminated before being recreated, resulting in the condition not being satisfied (see DeploymentConditionTypes docs).

Does this PR introduce a user-facing change?

NONE

netlify · 2025-01-30T23:21:50Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`beef1f7`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/67a9fe9a1d8f1000087113e3

kaisoz · 2025-01-30T23:22:18Z

/assign @mimowo

PTAL! Thanks! 😊

mimowo

Thank you for the PR, the new testing suite will be great to make sure the configuration works, which is really tricky and took lots of manual testing for now.

My main hesitation is about the way we run the new suite, and as pointed out in the comment I would be leaning towards a dedicated CI job. However, let me collect more feedback. I'm ok with the (2.) or (3.) approaches (from the comment) as interim.

@mbobrovskyi @mszadkow @dgrove-oss PTAL

test/util/e2e.go

test/e2e/queuename/jobwithoutqueue_test.go

mimowo · 2025-01-31T07:38:02Z

otoh we may want to test more e2e configurations in the future.

So I have an idea to mix the strategies and have a new suite customconfigs, where we could test custom configurations and restart kueue between them. i believe in practice for which configuration we will have only a handful of tests which are specific to them, and the main body of tests will be still in singlecluster suite. So this way we would avoid creating a ci job per config, while also offload the main run. wdyt?

For now I would suggest in this PR to put the e2e tests under e2e/customconfig/managedwithoutqueuebame. wdyt?

Makefile-test.mk

test/e2e/queuename/suite_test.go

mbobrovskyi · 2025-01-31T08:19:19Z

test/e2e/queuename/jobwithoutqueue_test.go

+			util.ExpectObjectToBeDeleted(ctx, k8sClient, defaultRf, true)
+		})
+
+		ginkgo.It("should suspend it", func() {


Can we join it with "should unsuspend it"? Looks like logic the same.

do you mean merging both test cases (ginkgo.It) so that the should unsuspend it test logic becomes part of the should suspend it test case?

test/e2e/queuename/jobwithoutqueue_test.go

test/e2e/queuename/suite_test.go

mbobrovskyi · 2025-01-31T08:33:52Z

otoh we may want to test more e2e configurations in the future.

So I have an idea to mix the strategies and have a new suite customconfigs, where we could test custom configurations and restart kueue between them. i believe in practice for which configuration we will have only a handful of tests which are specific to them, and the main body of tests will be still in singlecluster suite. So this way we would avoid creating a ci job per config, while also offload the main run. wdyt?

For now I would suggest in this PR to put the e2e tests under e2e/customconfig/managedwithoutqueuebame. wdyt?

If we need to create custom configurations with the ability to run different setups, I suggest moving the creation and deletion logic of Kueue to the Go part instead of running it on the bash script. We can start Kueue in BeforeSuite and delete it in AfterSuite, allowing each test suite to run with the required configuration.

mimowo · 2025-01-31T09:00:58Z

can you elaborate a bit why? it might be a good idea but Im not sure how this relates to the PR. Also would it affect the singlecluster and multicluster suites?

Also is this an alternative to the 1- 3 ideas or within one of them?

EDIT: iiuc moving the kubectl apply before suite introduces overhead, so i would prefer to avoid it for the main singlecluster suite which is often used againt an already running cluster with kueue.

PBundyra · 2025-01-31T09:47:15Z

otoh we may want to test more e2e configurations in the future.

So I have an idea to mix the strategies and have a new suite customconfigs, where we could test custom configurations and restart kueue between them. i believe in practice for which configuration we will have only a handful of tests which are specific to them, and the main body of tests will be still in singlecluster suite. So this way we would avoid creating a ci job per config, while also offload the main run. wdyt?

For now I would suggest in this PR to put the e2e tests under e2e/customconfig/managedwithoutqueuebame. wdyt?

+1

kaisoz · 2025-02-02T15:26:49Z

otoh we may want to test more e2e configurations in the future.

So I have an idea to mix the strategies and have a new suite customconfigs, where we could test custom configurations and restart kueue between them. i believe in practice for which configuration we will have only a handful of tests which are specific to them, and the main body of tests will be still in singlecluster suite. So this way we would avoid creating a ci job per config, while also offload the main run. wdyt?

For now I would suggest in this PR to put the e2e tests under e2e/customconfig/managedwithoutqueuebame. wdyt?

Yes, this is similar to what I had in mind (although better put and explained 😆 ): use this suite and technique to test custom configs.

For now I would suggest in this PR to put the e2e tests under e2e/customconfig/managedwithoutqueuebame. wdyt?

sgtm! What do you think on adding the word job to the testfile so that the reader can also see what kind is being tested? Or maybe name the testfile as the config under test?. e.g: e2e/customconfig/managejobswithoutqueuename_test.go?

kaisoz · 2025-02-02T17:34:46Z

@mimowo @mbobrovskyi @PBundyra

Thanks so much for your reviews! I answered most of your comments. PTAL!

test/e2e/config/default/manager_e2e_patch.yaml

mimowo · 2025-02-03T09:00:03Z

sgtm! What do you think on adding the word job to the testfile so that the reader can also see what kind is being tested? Or maybe name the testfile as the config under test?. e.g: e2e/customconfig/managejobswithoutqueuename_test.go?

I'm good with including the job word, but for a slightly different reason, but to correlate the config with the config field name:

kueue/apis/config/v1beta1/configuration_types.go

Line 49 in 08735dd

ManageJobsWithoutQueueName bool `json:"manageJobsWithoutQueueName"`

.

Alternative name which is good to me: skipjobswithoutqueuename. Slightly shorted and actually better corresponds to manageJobsWithoutQueuename: false which is being tested. WDYT?

kaisoz · 2025-02-07T10:04:04Z

/retest

kaisoz · 2025-02-07T10:39:16Z

I looked at the code after the recent updates and the structure looks great!

Still, we need to figure out how to fix the failures. Seeing that only two configs failed I suspect it is a flake. I'm wondering if increasing timeout would help. In e2e tests we often use LongTimeout rather than Timeout.

Yes, I got the same error happening intermittently in my machine. After some research I noticed that the cluster queue gets eventually deleted, but sometimes it happens after the test timeout. I wanted to make sure that this happens also in CI (and it does 😅 )

So I've taken your suggestion and used LongTimeout for waiting for the cluster queue to be deleted and it seems to work! 😊

kaisoz · 2025-02-07T10:49:19Z

/test all

kaisoz · 2025-02-07T11:17:36Z

/test all

kaisoz · 2025-02-07T11:47:08Z

/test all

kaisoz · 2025-02-07T13:07:44Z

I've rerun all the tests three times in a row to confirm that the flaky tests are gone 👍🏻

Makefile-test.mk

test/e2e/config/default/manager_e2e_patch.yaml

test/util/e2e.go

mimowo

cc @mbobrovskyi @mszadkow would you like to give it another pass?

test/util/e2e.go

mbobrovskyi · 2025-02-10T09:28:13Z

test/e2e/customconfigs/skipjobswithoutqueuename_test.go

+		ginkgo.AfterEach(func() {
+			gomega.Expect(util.DeleteAllJobsInNamespace(ctx, k8sClient, ns)).Should(gomega.Succeed())
+			// Force remove workloads to be sure that cluster queue can be removed.
+			gomega.Expect(util.DeleteWorkloadsInNamespace(ctx, k8sClient, ns)).Should(gomega.Succeed())


I think we also need to remove Pods, similar to what we do in util.DeleteNamespace, to avoid waiting for LongTimeout on ClusterQueue and ResourceFlavor. I'm wondering if we can merge BeforeEach and AfterEach into the Describe block.

I just tried that (merging the BeforeEach and AfterEach into the Describe block and reducing the waiting) but unfortunately the timeout error appears again :(

test/e2e/customconfigs/skipjobswithoutqueuename_test.go

Makefile-test.mk

mimowo

/lgtm
/approve
This looks great IMO. Please follow up with extracting the tests to a separate CI Job.
There should be 2 PRs:

first in test-infra to introduce the new CI job and running test-e2e-customconfigs
second in Kueue to stop running test-e2e-customconfigs as part of run-test-e2e

Then, I would like to yet see test cases for Deployment, StatefulSet and LWS showing how Kueue handles the config.

k8s-ci-robot · 2025-02-10T13:52:32Z

LGTM label has been added.

Git tree hash: 59e446220ad6b6516bce84584ac5021ac7d78353

k8s-ci-robot · 2025-02-10T13:52:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaisoz, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mimowo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Jan 30, 2025

k8s-ci-robot requested review from mimowo and PBundyra January 30, 2025 23:21

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 30, 2025

k8s-ci-robot assigned mimowo Jan 30, 2025

kaisoz force-pushed the managejobswithoutqueuename-e2e branch 2 times, most recently from 49bd6c4 to 7b42d2f Compare January 31, 2025 06:32

mimowo reviewed Jan 31, 2025

View reviewed changes

test/util/e2e.go Outdated Show resolved Hide resolved

test/util/e2e.go Outdated Show resolved Hide resolved

test/e2e/queuename/jobwithoutqueue_test.go Outdated Show resolved Hide resolved

test/e2e/queuename/jobwithoutqueue_test.go Outdated Show resolved Hide resolved

mbobrovskyi reviewed Jan 31, 2025

View reviewed changes