Skip to content

[KEP-4205] Concerns on using CPU PSI pressure to taint nodes #5062

Open
@tiraboschi

Description

@tiraboschi

Phase 2 of KEP-4205 (PSI Based Node Conditions) is proposing to utilize the node level PSI metric to set node condition and node taints.

We conducted some investigation observing it on a real cluster and we can conclude that it could be really risky.
We can state that we cannot (still?) directly/solely use PSI metrics at node level to identify nodes under "pressure" and taint them.

At least not regarding CPU pressure when we have pods with stringent CPU limits.
This because PSI is not currently able to discriminate pressure caused by the contention of a scarce resource from pressure due to CPU throttling according to the limit that the user explicitly asked for.
As for kernel doc,
the pressure interface is something like:

some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

where the “some” line indicates the share of time in which at least some tasks are stalled on a given resource.
CPU full is undefined at the system level, but has been reported since 5.13, so it is set to zero for backward compatibility.
So basically for CPU we have only the "some" line at node level, and even a single misconfigured pod is already some.

Let's try for instance with a simple prod running stress-ng with 8 parallel CPU stressors but having that container limited at 0.02 cores (20 milli-cores).

apiVersion: v1
kind: Pod
metadata:
  name: stress-ng
spec:
  containers:
    - name: stress-ng
      image: quay.io/tiraboschi/stress-ng
      command: ['/usr/bin/stress-ng', '--temp-path', '/var/tmp/', '--cpu', '8']
      resources:
        requests:
          cpu: "10m"
        limits:
          cpu: "20m"

Kubernetes will translate spec.resources.limits.cpu: "20m" to
2000 100000 in cpu.max for the cgroup slice of the test pod.

Now if we check the CPU pressure for that pod (reading its cgroup slice) we will find something like:

sh-5.1# cat /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5d95f454_57f6_4fc3_b5d4_6c7cf673b0cd.slice/cpu.pressure
some avg10=99.00 avg60=98.93 avg300=98.83 total=109943161035
full avg10=99.00 avg60=98.93 avg300=98.83 total=109935133476

Which is basically correct and accurate when reported at workload level since that container is getting by far less CPU than needed and so it's under a significant pressure.
Now the issue is how to read this at node level: we can safely assume that node is definitely not overloaded since the problematic pod is getting only a small amount of CPU due to the throttling.

But when we look at the CPU pressure at Kubelet level we see something like:

sh-5.1# cat /sys/fs/cgroup/kubepods.slice/cpu.pressure
some avg10=88.89 avg60=89.90 avg300=92.84 total=678445993193
full avg10=3.13 avg60=4.25 avg300=17.54 total=645991129996

since "some" of the slices under the Kubelet slice are at that high pressure.

And the same at node level:

sh-5.1# cat /sys/fs/cgroup/cpu.pressure
some avg10=85.02 avg60=88.95 avg300=92.77 total=670759972464
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

(exactly the same reading it from /proc) since some (in our corner case just our problematic test pod but formally still some) of the slices running on that node are under a considerable CPU pressure.

So, although this is absolutely correct according to the pressure interface as reported by the Kernel (since at least one pod, so some, was really suffering due to the lack of CPU), we shouldn't really take any action based on that.
In our exaggerated test corner case, the lack of CPU was only caused by CPU throttling and not really by resource contention with other neighbors. In this specific case, tainting the node to prevent scheduling additional load there will provide no benefits.

Metadata

Metadata

Assignees

No one assigned

    Labels

    sig/nodeCategorizes an issue or PR as relevant to SIG Node.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions