Skip to content

[Fix] Increase resource limits for wait-gcs-ready init container#4705

Open
xiaolin8 wants to merge 1 commit intoray-project:masterfrom
xiaolin8:fix/wait-gcs-ready-oom
Open

[Fix] Increase resource limits for wait-gcs-ready init container#4705
xiaolin8 wants to merge 1 commit intoray-project:masterfrom
xiaolin8:fix/wait-gcs-ready-oom

Conversation

@xiaolin8
Copy link
Copy Markdown

Summary

  • Increase resource limits for the wait-gcs-ready init container to prevent OOMKilled errors

Problem

The wait-gcs-ready init container was being OOMKilled due to insufficient memory limit (256Mi). The ray health-check command consumes approximately 180-190MB of RSS, which with system overhead exceeds the 256Mi limit.

This caused Ray workers to fail to start, especially in CI environments (kind clusters) and developer machines.

Solution

Increase the init container's resource limits:

  • CPU: 200m1
  • Memory: 256Mi1Gi

The Requests remain unchanged (200m CPU, 256Mi memory) to ensure efficient resource allocation.

Test Plan

  • Verify the init container no longer gets OOMKilled in CI environments
  • Verify workers can successfully start and connect to the head node

Fixes #2735

The wait-gcs-ready init container was being OOMKilled due to insufficient
memory limit (256Mi). The `ray health-check` command consumes approximately
180-190MB of RSS, which with system overhead exceeds the 256Mi limit.

This fix increases the init container's resource limits to:
- CPU: 200m -> 1
- Memory: 256Mi -> 1Gi

This ensures the init container can reliably run the health-check command
without being OOMKilled.

Fixes ray-project#2735
// Therefore, hard-coding the resources is acceptable.
Limits: corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("200m"),
corev1.ResourceMemory: resource.MustParse("256Mi"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this init container even need to set limits? Only setting requests may be better to reduce chance of OOM without requesting additional resources just for init

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(removing limits can be considered a breaking change in some scenarios, so let's take that into consideration as well)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, we probably want to keep the limits. But changing the limit means the init container QoS is changing from Guarenteed -> Burstable. Do you anticipate any issues with that? https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#quality-of-service-classes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] wait-gcs-ready init-container going out-of-memory indefinitely (OOMKilled)

2 participants