[Fix] Increase resource limits for wait-gcs-ready init container#4705
Open
xiaolin8 wants to merge 1 commit intoray-project:masterfrom
Open
[Fix] Increase resource limits for wait-gcs-ready init container#4705xiaolin8 wants to merge 1 commit intoray-project:masterfrom
xiaolin8 wants to merge 1 commit intoray-project:masterfrom
Conversation
The wait-gcs-ready init container was being OOMKilled due to insufficient memory limit (256Mi). The `ray health-check` command consumes approximately 180-190MB of RSS, which with system overhead exceeds the 256Mi limit. This fix increases the init container's resource limits to: - CPU: 200m -> 1 - Memory: 256Mi -> 1Gi This ensures the init container can reliably run the health-check command without being OOMKilled. Fixes ray-project#2735
andrewsykim
requested changes
Apr 14, 2026
| // Therefore, hard-coding the resources is acceptable. | ||
| Limits: corev1.ResourceList{ | ||
| corev1.ResourceCPU: resource.MustParse("200m"), | ||
| corev1.ResourceMemory: resource.MustParse("256Mi"), |
Member
There was a problem hiding this comment.
Does this init container even need to set limits? Only setting requests may be better to reduce chance of OOM without requesting additional resources just for init
Member
There was a problem hiding this comment.
(removing limits can be considered a breaking change in some scenarios, so let's take that into consideration as well)
Member
There was a problem hiding this comment.
On second thought, we probably want to keep the limits. But changing the limit means the init container QoS is changing from Guarenteed -> Burstable. Do you anticipate any issues with that? https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#quality-of-service-classes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
wait-gcs-readyinit container to prevent OOMKilled errorsProblem
The
wait-gcs-readyinit container was being OOMKilled due to insufficient memory limit (256Mi). Theray health-checkcommand consumes approximately 180-190MB of RSS, which with system overhead exceeds the 256Mi limit.This caused Ray workers to fail to start, especially in CI environments (kind clusters) and developer machines.
Solution
Increase the init container's resource limits:
200m→1256Mi→1GiThe Requests remain unchanged (
200mCPU,256Mimemory) to ensure efficient resource allocation.Test Plan
Fixes #2735