Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add e2e tests for manageJobsWithoutQueueName #4112

Merged

Conversation

kaisoz
Copy link
Contributor

@kaisoz kaisoz commented Jan 30, 2025

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR adds e2e tests for the manageJobsWithoutQueueName: true configuration. This is one of the tasks required to close #3767

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

The test suite sets manageJobsWithoutQueueName: true during the setup phase of the description block (just before running the test specs). It then initiates a rollout of the controller deployment to ensure that the new configuration is picked up by the pods. To confirm that the deployment rollout has started, the setup function waits for the Deployment to have the condition DeploymentCondition=false.

To ensure that this condition is met, the rollout strategy is set to Recreate, which guarantees that the pods are terminated before being recreated, resulting in the condition not being satisfied (see DeploymentConditionTypes docs).

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Jan 30, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 30, 2025
Copy link

netlify bot commented Jan 30, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit beef1f7
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/67a9fe9a1d8f1000087113e3

@kaisoz
Copy link
Contributor Author

kaisoz commented Jan 30, 2025

/assign @mimowo

PTAL! Thanks! 😊

@kaisoz kaisoz force-pushed the managejobswithoutqueuename-e2e branch 2 times, most recently from 49bd6c4 to 7b42d2f Compare January 31, 2025 06:32
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR, the new testing suite will be great to make sure the configuration works, which is really tricky and took lots of manual testing for now.

My main hesitation is about the way we run the new suite, and as pointed out in the comment I would be leaning towards a dedicated CI job. However, let me collect more feedback. I'm ok with the (2.) or (3.) approaches (from the comment) as interim.

@mbobrovskyi @mszadkow @dgrove-oss PTAL

@mimowo
Copy link
Contributor

mimowo commented Jan 31, 2025

otoh we may want to test more e2e configurations in the future.

So I have an idea to mix the strategies and have a new suite customconfigs, where we could test custom configurations and restart kueue between them. i believe in practice for which configuration we will have only a handful of tests which are specific to them, and the main body of tests will be still in singlecluster suite. So this way we would avoid creating a ci job per config, while also offload the main run. wdyt?

For now I would suggest in this PR to put the e2e tests under e2e/customconfig/managedwithoutqueuebame. wdyt?

util.ExpectObjectToBeDeleted(ctx, k8sClient, defaultRf, true)
})

ginkgo.It("should suspend it", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we join it with "should unsuspend it"? Looks like logic the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean merging both test cases (ginkgo.It) so that the should unsuspend it test logic becomes part of the should suspend it test case?

@mbobrovskyi
Copy link
Contributor

mbobrovskyi commented Jan 31, 2025

otoh we may want to test more e2e configurations in the future.

So I have an idea to mix the strategies and have a new suite customconfigs, where we could test custom configurations and restart kueue between them. i believe in practice for which configuration we will have only a handful of tests which are specific to them, and the main body of tests will be still in singlecluster suite. So this way we would avoid creating a ci job per config, while also offload the main run. wdyt?

For now I would suggest in this PR to put the e2e tests under e2e/customconfig/managedwithoutqueuebame. wdyt?

If we need to create custom configurations with the ability to run different setups, I suggest moving the creation and deletion logic of Kueue to the Go part instead of running it on the bash script. We can start Kueue in BeforeSuite and delete it in AfterSuite, allowing each test suite to run with the required configuration.

@mimowo
Copy link
Contributor

mimowo commented Jan 31, 2025

can you elaborate a bit why? it might be a good idea but Im not sure how this relates to the PR. Also would it affect the singlecluster and multicluster suites?

Also is this an alternative to the 1- 3 ideas or within one of them?

EDIT: iiuc moving the kubectl apply before suite introduces overhead, so i would prefer to avoid it for the main singlecluster suite which is often used againt an already running cluster with kueue.

@PBundyra
Copy link
Contributor

otoh we may want to test more e2e configurations in the future.

So I have an idea to mix the strategies and have a new suite customconfigs, where we could test custom configurations and restart kueue between them. i believe in practice for which configuration we will have only a handful of tests which are specific to them, and the main body of tests will be still in singlecluster suite. So this way we would avoid creating a ci job per config, while also offload the main run. wdyt?

For now I would suggest in this PR to put the e2e tests under e2e/customconfig/managedwithoutqueuebame. wdyt?

+1

@kaisoz kaisoz force-pushed the managejobswithoutqueuename-e2e branch 2 times, most recently from 413bf14 to d6c294b Compare February 2, 2025 15:08
@kaisoz
Copy link
Contributor Author

kaisoz commented Feb 2, 2025

otoh we may want to test more e2e configurations in the future.

So I have an idea to mix the strategies and have a new suite customconfigs, where we could test custom configurations and restart kueue between them. i believe in practice for which configuration we will have only a handful of tests which are specific to them, and the main body of tests will be still in singlecluster suite. So this way we would avoid creating a ci job per config, while also offload the main run. wdyt?

For now I would suggest in this PR to put the e2e tests under e2e/customconfig/managedwithoutqueuebame. wdyt?

Yes, this is similar to what I had in mind (although better put and explained 😆 ): use this suite and technique to test custom configs.

For now I would suggest in this PR to put the e2e tests under e2e/customconfig/managedwithoutqueuebame. wdyt?

sgtm! What do you think on adding the word job to the testfile so that the reader can also see what kind is being tested? Or maybe name the testfile as the config under test?. e.g: e2e/customconfig/managejobswithoutqueuename_test.go?

@kaisoz
Copy link
Contributor Author

kaisoz commented Feb 2, 2025

@mimowo @mbobrovskyi @PBundyra

Thanks so much for your reviews! I answered most of your comments. PTAL!

@mimowo
Copy link
Contributor

mimowo commented Feb 3, 2025

sgtm! What do you think on adding the word job to the testfile so that the reader can also see what kind is being tested? Or maybe name the testfile as the config under test?. e.g: e2e/customconfig/managejobswithoutqueuename_test.go?

I'm good with including the job word, but for a slightly different reason, but to correlate the config with the config field name:

ManageJobsWithoutQueueName bool `json:"manageJobsWithoutQueueName"`
.

Alternative name which is good to me: skipjobswithoutqueuename. Slightly shorted and actually better corresponds to manageJobsWithoutQueuename: false which is being tested. WDYT?

@kaisoz kaisoz force-pushed the managejobswithoutqueuename-e2e branch from 47cbea6 to 8647a49 Compare February 7, 2025 09:47
@kaisoz
Copy link
Contributor Author

kaisoz commented Feb 7, 2025

/retest

@kaisoz kaisoz force-pushed the managejobswithoutqueuename-e2e branch from 8647a49 to 6c84b56 Compare February 7, 2025 10:19
@kaisoz
Copy link
Contributor Author

kaisoz commented Feb 7, 2025

I looked at the code after the recent updates and the structure looks great!

Still, we need to figure out how to fix the failures. Seeing that only two configs failed I suspect it is a flake. I'm wondering if increasing timeout would help. In e2e tests we often use LongTimeout rather than Timeout.

Yes, I got the same error happening intermittently in my machine. After some research I noticed that the cluster queue gets eventually deleted, but sometimes it happens after the test timeout. I wanted to make sure that this happens also in CI (and it does 😅 )

So I've taken your suggestion and used LongTimeout for waiting for the cluster queue to be deleted and it seems to work! 😊

@kaisoz
Copy link
Contributor Author

kaisoz commented Feb 7, 2025

/test all

2 similar comments
@kaisoz
Copy link
Contributor Author

kaisoz commented Feb 7, 2025

/test all

@kaisoz
Copy link
Contributor Author

kaisoz commented Feb 7, 2025

/test all

@kaisoz
Copy link
Contributor Author

kaisoz commented Feb 7, 2025

I've rerun all the tests three times in a row to confirm that the flaky tests are gone 👍🏻

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mbobrovskyi @mszadkow would you like to give it another pass?

ginkgo.AfterEach(func() {
gomega.Expect(util.DeleteAllJobsInNamespace(ctx, k8sClient, ns)).Should(gomega.Succeed())
// Force remove workloads to be sure that cluster queue can be removed.
gomega.Expect(util.DeleteWorkloadsInNamespace(ctx, k8sClient, ns)).Should(gomega.Succeed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to remove Pods, similar to what we do in util.DeleteNamespace, to avoid waiting for LongTimeout on ClusterQueue and ResourceFlavor. I'm wondering if we can merge BeforeEach and AfterEach into the Describe block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried that (merging the BeforeEach and AfterEach into the Describe block and reducing the waiting) but unfortunately the timeout error appears again :(

@kaisoz kaisoz force-pushed the managejobswithoutqueuename-e2e branch from 6c84b56 to 64df89c Compare February 10, 2025 10:15
@kaisoz kaisoz force-pushed the managejobswithoutqueuename-e2e branch from 64df89c to dfd1536 Compare February 10, 2025 11:57
@kaisoz kaisoz force-pushed the managejobswithoutqueuename-e2e branch from dfd1536 to beef1f7 Compare February 10, 2025 13:26
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
This looks great IMO. Please follow up with extracting the tests to a separate CI Job.
There should be 2 PRs:

  • first in test-infra to introduce the new CI job and running test-e2e-customconfigs
  • second in Kueue to stop running test-e2e-customconfigs as part of run-test-e2e

Then, I would like to yet see test cases for Deployment, StatefulSet and LWS showing how Kueue handles the config.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 59e446220ad6b6516bce84584ac5021ac7d78353

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaisoz, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2025
@k8s-ci-robot k8s-ci-robot merged commit bef27a1 into kubernetes-sigs:main Feb 10, 2025
18 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.11 milestone Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add higher-level of testing for "queue-name" handling in pod-based workloads
5 participants