Big spikes on CPU and memory of the agent under high load

**What happened**:

At one of our customers we see huge spikes in memory and CPU of the node policy agent when the cluster is under high load. The customer runs a lot of short-lived jobs. I can reproduce this behavior on our staging cluster by simulating high load with a lot of short jobs.

If I look at the memory metrics of the network policy agent, under normal operation it is around 75Mb, but under load it spikes to 300-700Mb. Attached a graph of the aws-node pod during my load test. The cni never uses more than 50Mb.

![Image](https://github.com/user-attachments/assets/243384a5-e227-4d51-b948-69ec1b999584)
 
Next to the memory also the CPU spikes from 10m to 80-100m:

![Image](https://github.com/user-attachments/assets/219a5ef2-94cb-4bba-941a-bfc442abce4c)

This is problematic as we initially limitted the memory/cpu of the aws-node but this caused the aws-cni to crashloop and brought the whole cluster down as new pods did not receive an IP address and thus stayed pending for a long time.  

In my opinion such a spread is unworkable. What are the recommended request and limit for cpu/memory of the node-policy agent? Do you have any load test results for me to compare our situation to?


**Attach logs**
I ran the eks-log-collector.sh on the host, so I can send these logs if it helps?

**What you expected to happen**:
- is this expected behavior?
- I would like to understand why the network-policy-agent is using so much memory/cpu?
- Ideally I want a stable resource consumption from the network-policy-agent.
- I will settle for a recommendation on memory/cpu limit for the component 

**How to reproduce it (as minimally and precisely as possible)**:
- my load test has 4 threads, each creating 50 pods that just sleep for 10 seconds.
- for each thread after the 50 pods are finished, it launches again 50 pods.

**Anything else we need to know?**:
I looked at the similar issue regarding high cpu and memory. We do not have network policy on standalone pods. For us it is used on statefulsets, to make sure our regular workloads cannot reach them, only our custom components running in specific namespaces can.

We noticed the issue after upgrading to Kubernetes v1.32.3 but I do not see the link.
I tried the same test with an older version of the network policy agent,  v1.2.0-eksbuild.1 but saw no difference.

**Environment**:
- Kubernetes version (use `kubectl version`): v1.32.3-eks-4096722
- CNI Version: v1.19.5-eksbuild.1
- Network Policy Agent Version: v1.2.1-eksbuild.1
- OS (e.g: `cat /etc/os-release`):
```
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023.7.20250512"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
VENDOR_NAME="AWS"
VENDOR_URL="https://aws.amazon.com/"
SUPPORT_END="2029-06-30"
```
- Kernel (e.g. `uname -a`):
```
Linux ip-10-2-158-107.eu-west-1.compute.internal 6.1.134-152.225.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Wed May  7 09:10:59 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Big spikes on CPU and memory of the agent under high load #416

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Big spikes on CPU and memory of the agent under high load #416

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions