Tune KubeRay e2e task durations for faster CI by ikchifo · Pull Request #12213 · kubernetes-sigs/kueue

ikchifo · 2026-06-12T16:06:48Z

What type of PR is this?

/kind cleanup
/area testing

What this PR does / why we need it:

Two specs dominate the kuberay CI shards, per the junit from the prow
run linked in #12176:

Spec	Shard	CI duration
Should run a rayjob with InTreeAutoscaling	kuberay-b	206s
Should run a rayjob with multi scale-up steps	kuberay-a	179s

InTreeAutoscaling: replace the 3 tasks of 60s with a queue of 20
tasks of 10s (the same ~200 task-seconds of work).

The long tasks make the job end late: the deleted worker's task
restarts on the replacement pod, adding ~85s after replacement is
verified. Shorter sleeps are unsafe: scale-up takes ~40s in CI, so
30s tasks finish before the second worker starts and the autoscaler
(idleTimeoutSeconds=10, minReplicas=1) scales the idle first worker
down under the "2 running workers" assertions. The queue avoids
both: pending tasks hold worker demand through verification, and the
job drains within ~10s once the queue empties.

Multi scale-up: cut the trailing keep-alive tasks from 32 to 20.
They only need to outlive scale-down verification (the 10s idle
timeout plus an annotation update); 20 tasks (~25s) do.

Local A/B runs on a kind cluster:

Spec	main	this PR
multi scale-up	120-123s	107-110s
InTreeAutoscaling	118-121s	116-117s

Locally the pod deletion lands early, so main's tail barely shows.
In CI it lands late (the ~85s tail in the 205.7s baseline); the
queue's tail stays bounded no matter when the deletion lands, so the
spec gets faster and more deterministic in CI.

Which issue(s) this PR fixes:

Part of #12176, contributes to #11606

Special notes for your reviewer:

The verification steps and timeouts are unchanged; only the Ray task
payloads are tuned.

Does this PR introduce a user-facing change?

NONE

The InTreeAutoscaling rayjob spec ran 3 tasks of 60s each. The long sleeps were sized so that tasks survive scale-up and pod replacement verification, but the retried task of the deleted worker alone added a full extra minute after verification was already done. Replace them with a queue of 20 tasks of 10s each: pending tasks hold worker demand through verification no matter how long scale-up or replacement takes, and the job drains quickly once the queue is empty. The multi scale-up spec ended with 32 sequential 1s tasks to keep the job alive while scale-down is verified. Idle detection takes idleTimeoutSeconds=10 plus the annotation update, so 20 tasks (~25s) are enough with margin.

netlify · 2026-06-12T16:06:53Z

✅ Deploy Preview for kubernetes-sigs-kueue ready!

Name	Link
🔨 Latest commit	`ceccaa7`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a2c2e9a5aad2400088f63d1
😎 Deploy Preview	https://deploy-preview-12213--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2026-06-12T16:06:58Z

Hi @ikchifo. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai · 2026-06-12T16:07:00Z

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Excluded labels (none allowed) (3)

needs-ok-to-test
do-not-merge/work-in-progress
cncf-cla: no

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cff0a593-0a0b-48bd-93a3-afc28e5f9fcc

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tenzen-y · 2026-06-12T16:09:44Z

/ok-to-test

mimowo · 2026-06-12T16:40:37Z

cc @sohankunkerkar ptal

mimowo · 2026-06-12T17:48:12Z

/test all
retry to see stability

sohankunkerkar

I ran these tests locally on Kind (in loop), all passed. I confirm the numbers from the PR description with stable pod replacement timing. Further squeezing is possible but risks flakes. This is a good balance.
/lgtm
/hold in case @mimowo has anything to add here

k8s-ci-robot · 2026-06-12T18:46:25Z

LGTM label has been added.

Details

Git tree hash: 51c80746306739578b6aeb8150928adf533ab4ff

mimowo · 2026-06-12T19:59:31Z

Maybe this is over-optimizing, but could we reduce the task lengths if we make the "idleTimeout" to 1s?

In any case, I'm happy to merge already as is, because this is a great improvement already. We can follow up if you like the idea of decreasing the "idleTimeout".

/lgtm
/approve
/cherrypick release-0.18
/cherrypick release-0.17

Thank you, the results look great 👍

k8s-infra-cherrypick-robot · 2026-06-12T19:59:34Z

@mimowo: once the present PR merges, I will cherry-pick it on top of release-0.17, release-0.18 in new PRs and assign them to you.

Details

In response to this:

Maybe this is over-optimizing, but could we reduce the task lengths if we make the "idleTimeout" to 1s?

In any case, I'm happy to merge already as is, because this is a great improvement already. We can follow up if you like the idea of decreasing the "idleTimeout".

/lgtm
/approve
/cherrypick release-0.18
/cherrypick release-0.17

Thank you, the results look great 👍

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-06-12T19:59:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ikchifo, mimowo, sohankunkerkar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~test/OWNERS~~ [mimowo,sohankunkerkar]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mimowo · 2026-06-12T20:00:08Z

/unhold

k8s-infra-cherrypick-robot · 2026-06-12T20:31:26Z

@mimowo: #12213 failed to apply on top of branch "release-0.17":

Applying: Tune KubeRay e2e task durations for faster CI
Using index info to reconstruct a base tree...
M	test/e2e/singlecluster/extended/kuberay_test.go
Falling back to patching base and 3-way merge...
Auto-merging test/e2e/singlecluster/extended/kuberay_test.go
CONFLICT (content): Merge conflict in test/e2e/singlecluster/extended/kuberay_test.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 Tune KubeRay e2e task durations for faster CI

Details

In response to this:

Maybe this is over-optimizing, but could we reduce the task lengths if we make the "idleTimeout" to 1s?

In any case, I'm happy to merge already as is, because this is a great improvement already. We can follow up if you like the idea of decreasing the "idleTimeout".

/lgtm
/approve
/cherrypick release-0.18
/cherrypick release-0.17

Thank you, the results look great 👍

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-infra-cherrypick-robot · 2026-06-12T20:32:03Z

@mimowo: new pull request created: #12224

Details

In response to this:

Maybe this is over-optimizing, but could we reduce the task lengths if we make the "idleTimeout" to 1s?

In any case, I'm happy to merge already as is, because this is a great improvement already. We can follow up if you like the idea of decreasing the "idleTimeout".

/lgtm
/approve
/cherrypick release-0.18
/cherrypick release-0.17

Thank you, the results look great 👍

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. area/testing Testing - related stuff labels Jun 12, 2026

k8s-ci-robot requested review from PBundyra and pajakd June 12, 2026 16:06

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 12, 2026

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jun 12, 2026

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 12, 2026

sohankunkerkar approved these changes Jun 12, 2026

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 12, 2026

k8s-ci-robot assigned sohankunkerkar Jun 12, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 12, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 12, 2026

k8s-ci-robot assigned mimowo Jun 12, 2026

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 12, 2026

k8s-ci-robot merged commit 01e5762 into kubernetes-sigs:main Jun 12, 2026
58 checks passed

k8s-ci-robot added this to the v0.19 milestone Jun 12, 2026

k8s-infra-cherrypick-robot mentioned this pull request Jun 12, 2026

[release-0.18] Tune KubeRay e2e task durations for faster CI #12224

Open

ikchifo deleted the e2e-kuberay-task-tuning branch June 13, 2026 01:33

Conversation

ikchifo commented Jun 12, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

netlify Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-kueue ready!

Uh oh!

k8s-ci-robot commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

tenzen-y commented Jun 12, 2026

Uh oh!

mimowo commented Jun 12, 2026

Uh oh!

mimowo commented Jun 12, 2026

Uh oh!

sohankunkerkar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jun 12, 2026

Uh oh!

mimowo commented Jun 12, 2026

Uh oh!

k8s-infra-cherrypick-robot commented Jun 12, 2026

Uh oh!

k8s-ci-robot commented Jun 12, 2026

Uh oh!

mimowo commented Jun 12, 2026

Uh oh!

Uh oh!

k8s-infra-cherrypick-robot commented Jun 12, 2026

Uh oh!

k8s-infra-cherrypick-robot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

netlify Bot commented Jun 12, 2026 •

edited

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

sohankunkerkar left a comment •

edited

Loading