Description
Phase 2 of KEP-4205 (PSI Based Node Conditions) is proposing to utilize the node level PSI metric to set node condition and node taints.
We conducted some investigation observing it on a real cluster and we can conclude that it could be really risky.
We can state that we cannot (still?) directly/solely use PSI metrics at node level to identify nodes under "pressure" and taint them.
At least not regarding CPU pressure when we have pods with stringent CPU limits.
This because PSI is not currently able to discriminate pressure caused by the contention of a scarce resource from pressure due to CPU throttling according to the limit that the user explicitly asked for.
As for kernel doc,
the pressure interface is something like:
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
where the “some” line indicates the share of time in which at least some tasks are stalled on a given resource.
CPU full is undefined at the system level, but has been reported since 5.13, so it is set to zero for backward compatibility.
So basically for CPU we have only the "some" line at node level, and even a single misconfigured pod is already some.
Let's try for instance with a simple prod running stress-ng
with 8 parallel CPU stressors but having that container limited at 0.02 cores (20 milli-cores).
apiVersion: v1
kind: Pod
metadata:
name: stress-ng
spec:
containers:
- name: stress-ng
image: quay.io/tiraboschi/stress-ng
command: ['/usr/bin/stress-ng', '--temp-path', '/var/tmp/', '--cpu', '8']
resources:
requests:
cpu: "10m"
limits:
cpu: "20m"
Kubernetes will translate spec.resources.limits.cpu: "20m"
to
2000 100000
in cpu.max
for the cgroup slice of the test pod.
Now if we check the CPU pressure for that pod (reading its cgroup slice) we will find something like:
sh-5.1# cat /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5d95f454_57f6_4fc3_b5d4_6c7cf673b0cd.slice/cpu.pressure
some avg10=99.00 avg60=98.93 avg300=98.83 total=109943161035
full avg10=99.00 avg60=98.93 avg300=98.83 total=109935133476
Which is basically correct and accurate when reported at workload level since that container is getting by far less CPU than needed and so it's under a significant pressure.
Now the issue is how to read this at node level: we can safely assume that node is definitely not overloaded since the problematic pod is getting only a small amount of CPU due to the throttling.
But when we look at the CPU pressure at Kubelet level we see something like:
sh-5.1# cat /sys/fs/cgroup/kubepods.slice/cpu.pressure
some avg10=88.89 avg60=89.90 avg300=92.84 total=678445993193
full avg10=3.13 avg60=4.25 avg300=17.54 total=645991129996
since "some" of the slices under the Kubelet slice are at that high pressure.
And the same at node level:
sh-5.1# cat /sys/fs/cgroup/cpu.pressure
some avg10=85.02 avg60=88.95 avg300=92.77 total=670759972464
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
(exactly the same reading it from /proc
) since some
(in our corner case just our problematic test pod but formally still some) of the slices running on that node are under a considerable CPU pressure.
So, although this is absolutely correct according to the pressure interface as reported by the Kernel (since at least one pod, so some, was really suffering due to the lack of CPU), we shouldn't really take any action based on that.
In our exaggerated test corner case, the lack of CPU was only caused by CPU throttling and not really by resource contention with other neighbors. In this specific case, tainting the node to prevent scheduling additional load there will provide no benefits.