Skip to content

[RayService][e2e] fix flaky test TestAutoscalingRayService#4702

Merged
andrewsykim merged 4 commits intoray-project:masterfrom
AndySung320:flaky/e2e-rayservice
Apr 15, 2026
Merged

[RayService][e2e] fix flaky test TestAutoscalingRayService#4702
andrewsykim merged 4 commits intoray-project:masterfrom
AndySung320:flaky/e2e-rayservice

Conversation

@AndySung320
Copy link
Copy Markdown
Contributor

@AndySung320 AndySung320 commented Apr 13, 2026

Why are these changes needed?

There is a flaky test in Test E2E rayservice (TestAutoscalingRayService).
The dashboard's MetricsHead module times out during startup with 500m CPU limit, causing the dashboard to crash and RayService to never become ready.

This triggers a chain of failures:

  1. Dashboard crashes -> port 8265 is not listening -> Serve API is unavailable
  2. No Serve API -> serve applications cannot be deployed
  3. No serve applications -> no resource demands are reported to the autoscaler
  4. No resource demands -> autoscaler does not scale up workers
  5. No workers -> RayService never becomes ready -> test times out

Thus, increase to 1 CPU.
Once we increase to 1 CPU, test can pass successfully.
Screenshot 2026-04-13 at 11 40 25 AM
note: I also had to pin gevent==24.11.1 locally to run the test on my ARM Mac, since gevent 26.4.0 doesn't provide a prebuilt wheel for linux/aarch64 and the Ray image lacks a C compiler to build from source. This change is local-only and not included in this PR.

Reproduction Step

First run the e2e test

 go test ./test/e2erayservice/  -run TestAutoscalingRayService -v -timeout 30m

Wait for 30 seconds and open another terminal

NS=test-ns-xxxx
kubectl exec -n $NS -c ray-head $(kubectl get pods -n $NS -l ray.io/node-type=head -o name) -- bash -c "cat /tmp/ray/session_latest/logs/dashboard.log 2>/dev/null | tail -20"
Screenshot 2026-04-13 at 11 20 36 AM

You will see RuntimeError: Module MetricsHead failed to start. Timeout after 30.0 seconds, confirming the dashboard process crashed due to insufficient CPU during startup.

kubectl exec -n $NS -c ray-head \
  $(kubectl get pods -n $NS -l ray.io/node-type=head -o name) \
  -- python -c "import socket; s=socket.socket(); s.settimeout(2); s.connect(('localhost',8265)); print('8265 OPEN'); s.close()"
Screenshot 2026-04-13 at 11 20 50 AM

You will see ConnectionRefusedError, confirming the Serve API is unreachable because the dashboard is dead.

kubectl exec -n $NS -c ray-head \
  $(kubectl get pods -n $NS -l ray.io/node-type=head -o name) \
  -- ray status
Screenshot 2026-04-13 at 11 21 30 AM

It shows Pending Demands: (no resource demands), confirming no serve application was deployed so the autoscaler has no reason to scale up.

kubectl get pods -n $NS
Screenshot 2026-04-13 at 11 21 41 AM

It shows only the head pod at 1/2 Running with no worker pods, confirming the autoscaler never triggered a scale-up

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

…startup timeout

Signed-off-by: AndySung320 <andysung0320@gmail.com>
Copy link
Copy Markdown
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you observe that "The dashboard's MetricsHead module times out during startup with 500m CPU limit, causing the dashboard to crash and RayService to never become ready."?

is there any proof?

@AndySung320
Copy link
Copy Markdown
Contributor Author

Hi, I've updated in the reproduction step for more details.

@AndySung320 AndySung320 marked this pull request as ready for review April 13, 2026 18:27
Comment thread ray-operator/test/e2erayservice/testdata/rayservice.autoscaling.yaml Outdated
Copy link
Copy Markdown
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reasons I approved:

  1. our example in https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.sample.yaml#L34-L40 use 1 CPU for ray head container too
  2. before after comparison
  Comparison

  ┌────────────────────┬───────────────────────┬─────────────────────────┐
  │                    │ Build #14465 (before) │  Build #14478 (after)   │
  ├────────────────────┼───────────────────────┼─────────────────────────┤
  │ Head CPU limit     │ 500m                  │ 1                       │
  ├────────────────────┼───────────────────────┼─────────────────────────┤
  │ Dashboard response │ ~34s + 3 timeouts     │ < 46s (entire pipeline) │
  ├────────────────────┼───────────────────────┼─────────────────────────┤
  │ Worker pod         │ Never created         │ Created successfully    │
  ├────────────────────┼───────────────────────┼─────────────────────────┤
  │ RayService Ready   │ ❌ Timed out at 300s  │ ✅ ~46s                 │
  └────────────────────┴───────────────────────┴─────────────────────────┘

@Future-Outlier
Copy link
Copy Markdown
Member

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@Future-Outlier
Copy link
Copy Markdown
Member

cc @andrewsykim @rueian to merge

memory: 1G
limits:
cpu: 500m
cpu: "1"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also consider removing limits entirely? Or does the autoscaling test require it?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @AndySung320 to do the investigation, and update the yaml manifest if needed

@andrewsykim andrewsykim merged commit 071040a into ray-project:master Apr 15, 2026
31 checks passed
@github-project-automation github-project-automation bot moved this from can be merged to Done in @Future-Outlier's kuberay project Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants