Tune KubeRay e2e task durations for faster CI#12213
Conversation
The InTreeAutoscaling rayjob spec ran 3 tasks of 60s each. The long sleeps were sized so that tasks survive scale-up and pod replacement verification, but the retried task of the deleted worker alone added a full extra minute after verification was already done. Replace them with a queue of 20 tasks of 10s each: pending tasks hold worker demand through verification no matter how long scale-up or replacement takes, and the job drains quickly once the queue is empty. The multi scale-up spec ended with 32 sequential 1s tasks to keep the job alive while scale-down is verified. Idle detection takes idleTimeoutSeconds=10 plus the annotation update, so 20 tasks (~25s) are enough with margin.
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Hi @ikchifo. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Excluded labels (none allowed) (3)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/ok-to-test |
|
cc @sohankunkerkar ptal |
|
/test all |
There was a problem hiding this comment.
I ran these tests locally on Kind (in loop), all passed. I confirm the numbers from the PR description with stable pod replacement timing. Further squeezing is possible but risks flakes. This is a good balance.
/lgtm
/hold in case @mimowo has anything to add here
|
LGTM label has been added. DetailsGit tree hash: 51c80746306739578b6aeb8150928adf533ab4ff |
|
Maybe this is over-optimizing, but could we reduce the task lengths if we make the "idleTimeout" to 1s? In any case, I'm happy to merge already as is, because this is a great improvement already. We can follow up if you like the idea of decreasing the "idleTimeout". /lgtm Thank you, the results look great 👍 |
|
@mimowo: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ikchifo, mimowo, sohankunkerkar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/unhold |
|
@mimowo: #12213 failed to apply on top of branch "release-0.17": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@mimowo: new pull request created: #12224 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind cleanup
/area testing
What this PR does / why we need it:
Two specs dominate the kuberay CI shards, per the junit from the prow
run linked in #12176:
InTreeAutoscaling: replace the 3 tasks of 60s with a queue of 20
tasks of 10s (the same ~200 task-seconds of work).
The long tasks make the job end late: the deleted worker's task
restarts on the replacement pod, adding ~85s after replacement is
verified. Shorter sleeps are unsafe: scale-up takes ~40s in CI, so
30s tasks finish before the second worker starts and the autoscaler
(idleTimeoutSeconds=10, minReplicas=1) scales the idle first worker
down under the "2 running workers" assertions. The queue avoids
both: pending tasks hold worker demand through verification, and the
job drains within ~10s once the queue empties.
Multi scale-up: cut the trailing keep-alive tasks from 32 to 20.
They only need to outlive scale-down verification (the 10s idle
timeout plus an annotation update); 20 tasks (~25s) do.
Local A/B runs on a kind cluster:
Locally the pod deletion lands early, so main's tail barely shows.
In CI it lands late (the ~85s tail in the 205.7s baseline); the
queue's tail stays bounded no matter when the deletion lands, so the
spec gets faster and more deterministic in CI.
Which issue(s) this PR fixes:
Part of #12176, contributes to #11606
Special notes for your reviewer:
The verification steps and timeouts are unchanged; only the Ray task
payloads are tuned.
Does this PR introduce a user-facing change?