[RayService][e2e] fix flaky test TestAutoscalingRayService#4702
[RayService][e2e] fix flaky test TestAutoscalingRayService#4702andrewsykim merged 4 commits intoray-project:masterfrom
Conversation
…startup timeout Signed-off-by: AndySung320 <andysung0320@gmail.com>
Future-Outlier
left a comment
There was a problem hiding this comment.
how do you observe that "The dashboard's MetricsHead module times out during startup with 500m CPU limit, causing the dashboard to crash and RayService to never become ready."?
is there any proof?
|
Hi, I've updated in the reproduction step for more details. |
There was a problem hiding this comment.
reasons I approved:
- our example in https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.sample.yaml#L34-L40 use 1 CPU for ray head container too
- before after comparison
Comparison
┌────────────────────┬───────────────────────┬─────────────────────────┐
│ │ Build #14465 (before) │ Build #14478 (after) │
├────────────────────┼───────────────────────┼─────────────────────────┤
│ Head CPU limit │ 500m │ 1 │
├────────────────────┼───────────────────────┼─────────────────────────┤
│ Dashboard response │ ~34s + 3 timeouts │ < 46s (entire pipeline) │
├────────────────────┼───────────────────────┼─────────────────────────┤
│ Worker pod │ Never created │ Created successfully │
├────────────────────┼───────────────────────┼─────────────────────────┤
│ RayService Ready │ ❌ Timed out at 300s │ ✅ ~46s │
└────────────────────┴───────────────────────┴─────────────────────────┘
|
@codex review |
|
Codex Review: Didn't find any major issues. Swish! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
cc @andrewsykim @rueian to merge |
| memory: 1G | ||
| limits: | ||
| cpu: 500m | ||
| cpu: "1" |
There was a problem hiding this comment.
Can we also consider removing limits entirely? Or does the autoscaling test require it?
There was a problem hiding this comment.
cc @AndySung320 to do the investigation, and update the yaml manifest if needed
Why are these changes needed?
There is a flaky test in Test E2E rayservice (TestAutoscalingRayService).
The dashboard's MetricsHead module times out during startup with 500m CPU limit, causing the dashboard to crash and RayService to never become ready.
This triggers a chain of failures:
Thus, increase to 1 CPU.

Once we increase to 1 CPU, test can pass successfully.
note: I also had to pin
gevent==24.11.1locally to run the test on my ARM Mac, sincegevent 26.4.0doesn't provide a prebuilt wheel forlinux/aarch64and the Ray image lacks a C compiler to build from source. This change is local-only and not included in this PR.Reproduction Step
First run the e2e test
go test ./test/e2erayservice/ -run TestAutoscalingRayService -v -timeout 30mWait for 30 seconds and open another terminal
You will see
RuntimeError: Module MetricsHead failed to start. Timeout after 30.0 seconds, confirming the dashboard process crashed due to insufficient CPU during startup.You will see
ConnectionRefusedError, confirming the Serve API is unreachable because the dashboard is dead.It shows
Pending Demands: (no resource demands), confirming no serve application was deployed so the autoscaler has no reason to scale up.kubectl get pods -n $NSIt shows only the head pod at
1/2 Runningwith no worker pods, confirming the autoscaler never triggered a scale-upRelated issue number
Checks