Skip to content

Rate Limit Processor #35204

Open
Open
@juergen-kaiser-by

Description

@juergen-kaiser-by

Component(s)

No response

Is your feature request related to a problem? Please describe.

We run an observability backend (Elasticsearch) shared by many teams and services (thousands). The services run in kubernetes clusters and we want to collect the logs of all pods.

Problem: If a service/pod becomes very noisy for some reason, it can burden the backend so much that all of other teams feel it. In short: One team can ruin the day for all others.

We would like to limit the effect a single instance or service can have on the observability backend.

Describe the solution you'd like

  • There should be a method to limit the flow of logs based on some atributes. We apply a standardized set of labels to each Kubernetes deployment, so we would work with those.
  • Dropping log lines would be okay (for us) since we can not assume that a noisy service stops being noisy soon. Sampling is not okay because of the nature of logs.
  • If rate limiting happens then we and our users should be able to see it.
    • A single metric ingested into some (other) pipeline would suffice.
    • Let us choose a set of attributes the metric should have (copied) from the rate limited logs so that we can map it to the affected service. In our case, we would like to have our standard pod labels copied to the metric.

Considering the points above, we think that there should be a processor for this.

We have no requirements regarding the algorithm backing the rate limiting. It seems that a token bucket filter (example blog entry) is a reasonable choice here.

Describe alternatives you've considered

Rate Limiter in Receivers

Receiver rate limiting is okay if you only work with attributes available in the receivers. In our case, those are insufficient because we need pod labels. As a workaround, we could inject them into the collector config as env variables and focus the collector on a single pod by deploying it as a sidecar. However, a sidecar deployment consumes too much resources across all pods because we have large clusters (>= 10K pods).

A benefit of a rate limiting in receivers could give collector users to choose between dropping incoming telemetry and just not receive it, effectively creating backpressure.

Rate limiting in Receivers is discussed in #6908.

Rate Limiter as Extension

We lack knowledge about how extensions work internally to say anything about it. Rate limiting as extensions also is discussed in #6908.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions