Skip to content

Elastic agent uses too much memory per Pod in k8s #5835

@swiatekm

Description

@swiatekm

In its default configuration, agent has the kubernetes provider enabled. In DaemonSet mode, this provider keeps track of data about Pods scheduled on the Node the agent is running on. This issue concerns the fact that the agent process itself uses an excessive amount of memory if the number of these Pods is high (for the purpose of this issue, this will mean close to the default Kubernetes limit of 110). This was originally discovered while troubleshooting #4729.

This effect is visible even if we disable all inputs and self-monitoring, leaving agent to run as a single process without any components. This strongly implies it has to do with configuration variable providers. I used this empty configuration in my testing to limit confounding variables from beats, but the effect is more pronounced when components using variables are present in the configuration.

Here's a graph of agent memory consumption as the number of Pods on the Node increases from 10 to 110:

Image

A couple of observations from looking at how configuration changes affect this behaviour:

  • Making the garbage collector more aggressive makes most of the effect disappear, as does restarting the Pod. Very likely that this is caused by allocation churn more than steady-state heap utilization.
  • Disabling the host provider also reduces the effect greatly. When creating variables, each entry from a dynamic provider gets its own copy of data from context providers. On a large node, there can be quite a bit of host provider data. This is also visible when looking at variables.yml in diagnostics.
  • Increasing the debounce time on variable emission in the composable coordinator doesn't help much.

Test setup

  • Single node KiND cluster, default settings.
  • Default standalone kubernetes manifest, all inputs removed from configuration.
  • A single nginx Deployment, starting at 0 replicas, and later scaled up to the maximum the Node allows.

More data

I resized my Deployment a couple times and looked at a heap profile of the agent process:

Image

The churn appears to be coming primarily from needing to recreate all the variables whenever a Pod is updated. The call to composable.cloneMap is where we copy data from the host provider.

Root cause

The root cause appears to be a combination of behaviours in the variable pipeline:

  1. All variables are recreated and emitted whenever there's any change to the underlying data. In Kubernetes, with a lot of Pods on a Node, changes can be quite frequent, and the amount of data is non-trivial.
  2. We copy all the data from all context providers to any dynamic provider mapping. If there are a lot of dynamic provider mappings (one for each Pod), this can be quite expensive.
  3. I suspect there's also more copying going on in component model generation, but I haven't looked into it too much.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions