Update website/docs/bestpractices/analytics/spark-oom-kills.md

vara-bonthu · nabuskey · web-flow · commit c9f2f3bb3d91 · 2026-04-06T17:32:18.000-07:00
Co-authored-by: Manabu McCloskey &lt;manabu.mccloskey@gmail.com&gt;
diff --git a/website/docs/bestpractices/analytics/spark-oom-kills.md b/website/docs/bestpractices/analytics/spark-oom-kills.md
@@ -9,7 +9,7 @@ import TabItem from '@theme/TabItem';
 # Preventing OOM Kills in Spark on Kubernetes
 Every organization running large scale Spark workloads on Kubernetes has dealt with this: a job runs for hours, processes terabytes of data, completes 80% of its work, and then executors start disappearing. No JVM exception. No heap dump. No warning in Spark UI. Just `exit code 137` and hours of compute burned. The standard response is to throw more memory at it, bump `memoryOverhead` by another 10 GB, and hope for the best. That works until the next data spike.
 
-The root cause is not insufficient memory. It is a design flaw in how **cgroupsv1** handles the Linux page cache. When a Spark executor reads shuffle data from local NVMe, the kernel caches those file pages in RAM. Under cgroupsv1, this page cache counts against the container's memory limit with no mechanism to reclaim it before the OOM killer fires. The kernel kills your executor to free memory it could have simply evicted.
+The root cause is not insufficient memory. It is a design limitation in how **cgroupsv1** handles the Linux page cache. When a Spark executor reads shuffle data from local storage, the kernel caches those file pages in RAM. Under cgroupsv1, this page cache counts against the container's memory limit with no mechanism to reclaim it before the OOM killer fires. The kernel kills your executor to free memory it could have simply evicted.
 
 **cgroupsv2** fixes this with `memory.high`, a throttling boundary that forces page cache eviction before reaching the hard kill limit. Kubernetes exposes this through the **MemoryQoS** feature gate ([KEP-2570](https://github.com/kubernetes/enhancements/issues/2570)). This guide covers the kernel internals behind the problem, the cgroupsv2 solution, and the exact EKS configuration to deploy it.