[RayService][e2e] fix flaky test TestAutoscalingRayService by AndySung320 · Pull Request #4702 · ray-project/kuberay

AndySung320 · 2026-04-13T01:21:27Z

Why are these changes needed?

There is a flaky test in Test E2E rayservice (TestAutoscalingRayService).
The dashboard's MetricsHead module times out during startup with 500m CPU limit, causing the dashboard to crash and RayService to never become ready.

This triggers a chain of failures:

Dashboard crashes -> port 8265 is not listening -> Serve API is unavailable
No Serve API -> serve applications cannot be deployed
No serve applications -> no resource demands are reported to the autoscaler
No resource demands -> autoscaler does not scale up workers
No workers -> RayService never becomes ready -> test times out

Thus, increase to 1 CPU.
Once we increase to 1 CPU, test can pass successfully.

note: I also had to pin gevent==24.11.1 locally to run the test on my ARM Mac, since gevent 26.4.0 doesn't provide a prebuilt wheel for linux/aarch64 and the Ray image lacks a C compiler to build from source. This change is local-only and not included in this PR.

Reproduction Step

First run the e2e test

 go test ./test/e2erayservice/  -run TestAutoscalingRayService -v -timeout 30m

Wait for 30 seconds and open another terminal

NS=test-ns-xxxx

kubectl exec -n $NS -c ray-head $(kubectl get pods -n $NS -l ray.io/node-type=head -o name) -- bash -c "cat /tmp/ray/session_latest/logs/dashboard.log 2>/dev/null | tail -20"

You will see RuntimeError: Module MetricsHead failed to start. Timeout after 30.0 seconds, confirming the dashboard process crashed due to insufficient CPU during startup.

kubectl exec -n $NS -c ray-head \
  $(kubectl get pods -n $NS -l ray.io/node-type=head -o name) \
  -- python -c "import socket; s=socket.socket(); s.settimeout(2); s.connect(('localhost',8265)); print('8265 OPEN'); s.close()"

You will see ConnectionRefusedError, confirming the Serve API is unreachable because the dashboard is dead.

kubectl exec -n $NS -c ray-head \
  $(kubectl get pods -n $NS -l ray.io/node-type=head -o name) \
  -- ray status

It shows Pending Demands: (no resource demands), confirming no serve application was deployed so the autoscaler has no reason to scale up.

kubectl get pods -n $NS

It shows only the head pod at 1/2 Running with no worker pods, confirming the autoscaler never triggered a scale-up

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

…startup timeout Signed-off-by: AndySung320 <andysung0320@gmail.com>

Future-Outlier

how do you observe that "The dashboard's MetricsHead module times out during startup with 500m CPU limit, causing the dashboard to crash and RayService to never become ready."?

is there any proof?

AndySung320 · 2026-04-13T18:25:21Z

Hi, I've updated in the reproduction step for more details.

Signed-off-by: Future-Outlier <eric901201@gmail.com>

Future-Outlier

reasons I approved:

our example in https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.sample.yaml#L34-L40 use 1 CPU for ray head container too
before after comparison

  Comparison

  ┌────────────────────┬───────────────────────┬─────────────────────────┐
  │                    │ Build #14465 (before) │  Build #14478 (after)   │
  ├────────────────────┼───────────────────────┼─────────────────────────┤
  │ Head CPU limit     │ 500m                  │ 1                       │
  ├────────────────────┼───────────────────────┼─────────────────────────┤
  │ Dashboard response │ ~34s + 3 timeouts     │ < 46s (entire pipeline) │
  ├────────────────────┼───────────────────────┼─────────────────────────┤
  │ Worker pod         │ Never created         │ Created successfully    │
  ├────────────────────┼───────────────────────┼─────────────────────────┤
  │ RayService Ready   │ ❌ Timed out at 300s  │ ✅ ~46s                 │
  └────────────────────┴───────────────────────┴─────────────────────────┘

Future-Outlier · 2026-04-15T13:46:42Z

@codex review

chatgpt-codex-connector · 2026-04-15T13:49:27Z

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Future-Outlier · 2026-04-15T15:59:27Z

cc @andrewsykim @rueian to merge

andrewsykim · 2026-04-15T23:52:32Z

                  memory: 1G
                limits:
-                  cpu: 500m
+                  cpu: "1"


Can we also consider removing limits entirely? Or does the autoscaling test require it?

cc @AndySung320 to do the investigation, and update the yaml manifest if needed

fix: increase head pod CPU limits in e2e test YAMLs to fix dashboard …

80ff116

…startup timeout Signed-off-by: AndySung320 <andysung0320@gmail.com>

Future-Outlier reviewed Apr 13, 2026

View reviewed changes

AndySung320 marked this pull request as ready for review April 13, 2026 18:27

Future-Outlier reviewed Apr 15, 2026

View reviewed changes

Comment thread ray-operator/test/e2erayservice/testdata/rayservice.autoscaling.yaml Outdated

Future-Outlier added 2 commits April 15, 2026 21:44

Merge remote-tracking branch 'upstream/master' into flaky/e2e-rayservice

25891ee

update

350878e

Signed-off-by: Future-Outlier <eric901201@gmail.com>

Future-Outlier approved these changes Apr 15, 2026

View reviewed changes

Future-Outlier added this to @Future-Outlier's kuberay project Apr 15, 2026

Future-Outlier moved this to can be merged in @Future-Outlier's kuberay project Apr 15, 2026

Merge remote-tracking branch 'upstream/master' into flaky/e2e-rayservice

c436dc0

andrewsykim reviewed Apr 15, 2026

View reviewed changes

andrewsykim merged commit 071040a into ray-project:master Apr 15, 2026
31 checks passed

github-project-automation bot moved this from can be merged to Done in @Future-Outlier's kuberay project Apr 15, 2026

andrewsykim mentioned this pull request Apr 16, 2026

[Config]: standardize quoting style in Kubernetes manifests (#4353) #4359

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RayService][e2e] fix flaky test TestAutoscalingRayService#4702

[RayService][e2e] fix flaky test TestAutoscalingRayService#4702
andrewsykim merged 4 commits intoray-project:masterfrom
AndySung320:flaky/e2e-rayservice

AndySung320 commented Apr 13, 2026 •

edited

Loading

Uh oh!

Future-Outlier left a comment

Uh oh!

AndySung320 commented Apr 13, 2026

Uh oh!

Uh oh!

Future-Outlier left a comment •

edited

Loading

Uh oh!

Future-Outlier commented Apr 15, 2026

Uh oh!

chatgpt-codex-connector bot commented Apr 15, 2026

Uh oh!

Future-Outlier commented Apr 15, 2026

Uh oh!

andrewsykim Apr 15, 2026

Uh oh!

Future-Outlier Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AndySung320 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Reproduction Step

Related issue number

Checks

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

AndySung320 commented Apr 13, 2026

Uh oh!

Uh oh!

Future-Outlier left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Future-Outlier commented Apr 15, 2026

Uh oh!

chatgpt-codex-connector bot commented Apr 15, 2026

Uh oh!

Future-Outlier commented Apr 15, 2026

Uh oh!

andrewsykim Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Future-Outlier Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AndySung320 commented Apr 13, 2026 •

edited

Loading

Future-Outlier left a comment •

edited

Loading