Skip to content

Tune KubeRay e2e task durations for faster CI#12213

Merged
k8s-ci-robot merged 1 commit into
kubernetes-sigs:mainfrom
ikchifo:e2e-kuberay-task-tuning
Jun 12, 2026
Merged

Tune KubeRay e2e task durations for faster CI#12213
k8s-ci-robot merged 1 commit into
kubernetes-sigs:mainfrom
ikchifo:e2e-kuberay-task-tuning

Conversation

@ikchifo

@ikchifo ikchifo commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind cleanup
/area testing

What this PR does / why we need it:

Two specs dominate the kuberay CI shards, per the junit from the prow
run linked in #12176:

Spec Shard CI duration
Should run a rayjob with InTreeAutoscaling kuberay-b 206s
Should run a rayjob with multi scale-up steps kuberay-a 179s

InTreeAutoscaling: replace the 3 tasks of 60s with a queue of 20
tasks of 10s (the same ~200 task-seconds of work).

The long tasks make the job end late: the deleted worker's task
restarts on the replacement pod, adding ~85s after replacement is
verified. Shorter sleeps are unsafe: scale-up takes ~40s in CI, so
30s tasks finish before the second worker starts and the autoscaler
(idleTimeoutSeconds=10, minReplicas=1) scales the idle first worker
down under the "2 running workers" assertions. The queue avoids
both: pending tasks hold worker demand through verification, and the
job drains within ~10s once the queue empties.

Multi scale-up: cut the trailing keep-alive tasks from 32 to 20.
They only need to outlive scale-down verification (the 10s idle
timeout plus an annotation update); 20 tasks (~25s) do.

Local A/B runs on a kind cluster:

Spec main this PR
multi scale-up 120-123s 107-110s
InTreeAutoscaling 118-121s 116-117s

Locally the pod deletion lands early, so main's tail barely shows.
In CI it lands late (the ~85s tail in the 205.7s baseline); the
queue's tail stays bounded no matter when the deletion lands, so the
spec gets faster and more deterministic in CI.

Which issue(s) this PR fixes:

Part of #12176, contributes to #11606

Special notes for your reviewer:

The verification steps and timeouts are unchanged; only the Ray task
payloads are tuned.

Does this PR introduce a user-facing change?

NONE

The InTreeAutoscaling rayjob spec ran 3 tasks of 60s each. The
long sleeps were sized so that tasks survive scale-up and pod
replacement verification, but the retried task of the deleted
worker alone added a full extra minute after verification was
already done. Replace them with a queue of 20 tasks of 10s each:
pending tasks hold worker demand through verification no matter
how long scale-up or replacement takes, and the job drains
quickly once the queue is empty.

The multi scale-up spec ended with 32 sequential 1s tasks to keep
the job alive while scale-down is verified. Idle detection takes
idleTimeoutSeconds=10 plus the annotation update, so 20 tasks
(~25s) are enough with margin.
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. area/testing Testing - related stuff labels Jun 12, 2026
@netlify

netlify Bot commented Jun 12, 2026

Copy link
Copy Markdown

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit ceccaa7
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a2c2e9a5aad2400088f63d1
😎 Deploy Preview https://deploy-preview-12213--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot requested review from PBundyra and pajakd June 12, 2026 16:06
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 12, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Hi @ikchifo. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Excluded labels (none allowed) (3)
  • needs-ok-to-test
  • do-not-merge/work-in-progress
  • cncf-cla: no

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cff0a593-0a0b-48bd-93a3-afc28e5f9fcc

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jun 12, 2026
@tenzen-y

Copy link
Copy Markdown
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 12, 2026
@mimowo

mimowo commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

cc @sohankunkerkar ptal

@mimowo

mimowo commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

/test all
retry to see stability

@sohankunkerkar sohankunkerkar left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran these tests locally on Kind (in loop), all passed. I confirm the numbers from the PR description with stable pod replacement timing. Further squeezing is possible but risks flakes. This is a good balance.
/lgtm
/hold in case @mimowo has anything to add here

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 12, 2026
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 12, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 51c80746306739578b6aeb8150928adf533ab4ff

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 12, 2026
@mimowo

mimowo commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Maybe this is over-optimizing, but could we reduce the task lengths if we make the "idleTimeout" to 1s?

In any case, I'm happy to merge already as is, because this is a great improvement already. We can follow up if you like the idea of decreasing the "idleTimeout".

/lgtm
/approve
/cherrypick release-0.18
/cherrypick release-0.17

Thank you, the results look great 👍

@k8s-infra-cherrypick-robot

Copy link
Copy Markdown
Contributor

@mimowo: once the present PR merges, I will cherry-pick it on top of release-0.17, release-0.18 in new PRs and assign them to you.

Details

In response to this:

Maybe this is over-optimizing, but could we reduce the task lengths if we make the "idleTimeout" to 1s?

In any case, I'm happy to merge already as is, because this is a great improvement already. We can follow up if you like the idea of decreasing the "idleTimeout".

/lgtm
/approve
/cherrypick release-0.18
/cherrypick release-0.17

Thank you, the results look great 👍

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ikchifo, mimowo, sohankunkerkar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mimowo

mimowo commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 12, 2026
@k8s-ci-robot k8s-ci-robot merged commit 01e5762 into kubernetes-sigs:main Jun 12, 2026
58 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.19 milestone Jun 12, 2026
@k8s-infra-cherrypick-robot

Copy link
Copy Markdown
Contributor

@mimowo: #12213 failed to apply on top of branch "release-0.17":

Applying: Tune KubeRay e2e task durations for faster CI
Using index info to reconstruct a base tree...
M	test/e2e/singlecluster/extended/kuberay_test.go
Falling back to patching base and 3-way merge...
Auto-merging test/e2e/singlecluster/extended/kuberay_test.go
CONFLICT (content): Merge conflict in test/e2e/singlecluster/extended/kuberay_test.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 Tune KubeRay e2e task durations for faster CI

Details

In response to this:

Maybe this is over-optimizing, but could we reduce the task lengths if we make the "idleTimeout" to 1s?

In any case, I'm happy to merge already as is, because this is a great improvement already. We can follow up if you like the idea of decreasing the "idleTimeout".

/lgtm
/approve
/cherrypick release-0.18
/cherrypick release-0.17

Thank you, the results look great 👍

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot

Copy link
Copy Markdown
Contributor

@mimowo: new pull request created: #12224

Details

In response to this:

Maybe this is over-optimizing, but could we reduce the task lengths if we make the "idleTimeout" to 1s?

In any case, I'm happy to merge already as is, because this is a great improvement already. We can follow up if you like the idea of decreasing the "idleTimeout".

/lgtm
/approve
/cherrypick release-0.18
/cherrypick release-0.17

Thank you, the results look great 👍

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/testing Testing - related stuff cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants