Skip to content

Conversation

@ericywl
Copy link
Contributor

@ericywl ericywl commented Nov 5, 2025

Summary

Add config for gubernator behavior.

Testing

Unit testing is done via Go tests.

Manual testing is done as follow:

  1. Set go.mod replacement to use new ratelimitprocessor
    replace github.com/elastic/opentelemetry-collector-components/processor/ratelimitprocessor =>  ../../../opentelemetry-collector-components/processor/ratelimitprocessor/
    
  2. Update Helm values to add gubernator_behavior
    ratelimit:
      serverless:
        ...
        gubernator_behavior: 2
    
  3. Update the benchmarks Helm values for Ingest resources / rate limit rate & burst as you wish
  4. Run the benchmarks
    make benchmark-aws
    DOCKER_IMAGE_TAG="v0.5.0@sha256:c070470aef97b9cabe5c742cd6f5ac74b30de79ab23a50743a2e793fb309063e" make run-otelbench mode=k8s
    

@ericywl ericywl requested review from a team as code owners November 5, 2025 07:14
@ericywl ericywl changed the title Add gubernator behavior config [processor/ratelimit] Add gubernator behavior config Nov 5, 2025
@ericywl ericywl self-assigned this Nov 5, 2025
Copy link
Contributor

@marclop marclop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this in unit and also manually? From my previous testing it had some unintended behavior I saw. At the minimum, we should update our tests to test both BATCHNG and GLOBAL.

@ericywl
Copy link
Contributor Author

ericywl commented Nov 5, 2025

Have you tested this in unit and also manually? From my previous testing it had some unintended behavior I saw. At the minimum, we should update our tests to test both BATCHNG and GLOBAL.

No not yet, I'm currently setting up the env to test this manually, and also looking at the existing unit tests.

@ericywl ericywl marked this pull request as draft November 5, 2025 08:44
@ericywl
Copy link
Contributor Author

ericywl commented Nov 5, 2025

It seems the unit tests broken by GLOBAL are mainly the dynamic rate limiter ones? The tests check for remaining hits in the event channel after each rate limit request, but GLOBAL behavior does not guarantee that the remaining hits will be accurate so the tests fail.

EDIT: It seems GLOBAL request is somehow double counting the hits or something. The below event is supposed to have 940 remaining but has 880 instead.

{name:"requests_per_sec"  unique_key:"default"  hits:60  limit:1000  duration:1000  algorithm:LEAKY_BUCKET  behavior:34  burst:1000  created_at:1762401297012 limit:1000  remaining:880  reset_time:1762401297132}

Copy link
Contributor

@marclop marclop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update the README.md to include the gubernator_behavior field, and link to the Gubernator Architecture.md.

@ericywl ericywl force-pushed the add-gubernator-behavior branch from 42a77e6 to c6c0c8f Compare November 7, 2025 03:25
@ericywl ericywl marked this pull request as ready for review November 10, 2025 10:22
Copy link
Contributor

@simitt simitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please wait for @marclop 's final approval since he left some questions in his last review.

Copy link
Contributor

@marclop marclop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM, thanks for updating all the unit tests. Did you manage to test this manually with GLOBAL behavior to ensure the limits are properly applied as intended?

@ericywl
Copy link
Contributor Author

ericywl commented Nov 11, 2025

Changes LGTM, thanks for updating all the unit tests. Did you manage to test this manually with GLOBAL behavior to ensure the limits are properly applied as intended?

So, I have tested that the rate limit is applied, by using a low rate limit threshold and invoking benchmark load test on it. But I'm not sure if that counts as limits are properly applied as intended 🤔

@ericywl
Copy link
Contributor Author

ericywl commented Nov 24, 2025

Benchmarked GLOBAL vs BATCHING behavior, with 2 replicas for ingest collector.

Performance

With BATCHING:

BenchmarkOTelbench/traces-otlphttp-1500      122         538283832 ns/op                 0 failed_logs/s                 0 failed_metric_points/s                0 failed_requests/s             0 failed_spans/s           0 logs/s                0 metric_points/s            1858 requests/s        18708 spans/s
BenchmarkOTelbench/traces-otlphttp-1500      132         545949767 ns/op                 0 failed_logs/s                 0 failed_metric_points/s                0 failed_requests/s             0 failed_spans/s           0 logs/s                0 metric_points/s            1832 requests/s        18445 spans/s

With GLOBAL:

BenchmarkOTelbench/traces-otlphttp-1500      188         378135965 ns/op                 0 failed_logs/s                 0 failed_metric_points/s                0 failed_requests/s             0 failed_spans/s           0 logs/s                0 metric_points/s            2645 requests/s        26631 spans/s
BenchmarkOTelbench/traces-otlphttp-1500      204         383595610 ns/op                 0 failed_logs/s                 0 failed_metric_points/s                0 failed_requests/s             0 failed_spans/s           0 logs/s                0 metric_points/s            2607 requests/s        26252 spans/s

The GLOBAL behavior seems to be consistently performing better, presumably due to using local cache for rate limit requests.

Rate Limit Logs

Both also have rate limit being applied at similar frequencies:

With BATCHING:

[motel-ingest-collector-us-west-2a-5685b7b7c5-hvvkb] {"log.level":"error","@timestamp":"2025-11-24T08:55:47.364Z","message":"request is over the limits defined by the rate limiter","resource":{"cloud.availability_zone":"us-west-2a","k8s.namespace.name":"motel-ingest-collector","k8s.node.name":"ip-172-31-19-140.us-west-2.compute.internal","k8s.pod.name":"motel-ingest-collector-us-west-2a-5685b7b7c5-hvvkb","k8s.pod.uid":"f82510be-51ba-4d09-bd71-d80336485d9b","orchestrator.cluster.name":"default","orchestrator.deploymentslice":"","orchestrator.environment":"default","service.instance.id":"25934b7f-3b0a-47d8-ae89-5238cbd3972b","service.name":"motel-ingest-collector","service.version":"git"},"otelcol.component.id":"ratelimit","otelcol.component.kind":"processor","otelcol.pipeline.id":"logs","otelcol.signal":"logs","hits":100058,"x-elastic-project-id":"local","x-elastic-target-id":"local","x-elastic-target-type":"serverless","error":{"message":"rpc error: code = ResourceExhausted desc = too many requests"},"ecs.version":"1.6.0","log.origin.stack_trace":"github.com/elastic/opentelemetry-collector-components/processor/ratelimitprocessor.rateLimit\n\tgithub.com/elastic/opentelemetry-collector-components/processor/[email protected]/processor.go:259\ngithub.com/elastic/opentelemetry-collector-components/processor/ratelimitprocessor.(*LogsRateLimiterProcessor).ConsumeLogs\n\tgithub.com/elastic/opentelemetry-collector-components/processor/[email protected]/processor.go:267\ngo.opentelemetry.io/collector/service/internal/obsconsumer.obsLogs.ConsumeLogs\n\tgo.opentelemetry.io/collector/[email protected]/internal/obsconsumer/logs.go:68\ngo.opentelemetry.io/collector/service/internal/refconsumer.refLogs.ConsumeLogs\n\tgo.opentelemetry.io/collector/[email protected]/internal/refconsumer/logs.go:29\ngo.opentelemetry.io/collector/service/internal/obsconsumer.obsLogs.ConsumeLogs\n\tgo.opentelemetry.io/collector/[email protected]/internal/obsconsumer/logs.go:68\ngithub.com/elastic/hosted-otel-collector/internal/processor/errorsanitizationprocessor.(*Processor).ConsumeLogs\n\tgithub.com/elastic/hosted-otel-collector/internal/processor/[email protected]/processor.go:100\ngo.opentelemetry.io/collector/service/internal/obsconsumer.obsLogs.ConsumeLogs\n\tgo.opentelemetry.io/collector/[email protected]/internal/obsconsumer/logs.go:68\ngo.opentelemetry.io/collector/service/internal/refconsumer.refLogs.ConsumeLogs\n\tgo.opentelemetry.io/collector/[email protected]/internal/refconsumer/logs.go:29\ngo.opentelemetry.io/collector/consumer.ConsumeLogsFunc.ConsumeLogs\n\tgo.opentelemetry.io/collector/[email protected]/logs.go:27\ngo.opentelemetry.io/collector/service/internal/obsconsumer.obsLogs.ConsumeLogs\n\tgo.opentelemetry.io/collector/[email protected]/internal/obsconsumer/logs.go:68\ngo.opentelemetry.io/collector/receiver/otlpreceiver/internal/logs.(*Receiver).Export\n\tgo.opentelemetry.io/collector/receiver/[email protected]/internal/logs/otlp.go:41\ngo.opentelemetry.io/collector/pdata/plog/plogotlp.rawLogsServer.Export\n\tgo.opentelemetry.io/collector/[email protected]/plog/plogotlp/grpc.go:86\ngo.opentelemetry.io/collector/pdata/internal/otelgrpc.logsServiceExportHandler.func1\n\tgo.opentelemetry.io/collector/[email protected]/internal/otelgrpc/logs_service.go:72\ngithub.com/elastic/hosted-otel-collector/internal/extension/authmiddlewareextension.(*authMiddleware).GetGRPCServerOptions.(*authMiddleware).getAuthUnaryServerInterceptor.func1\n\tgithub.com/elastic/hosted-otel-collector/internal/extension/[email protected]/middleware.go:107\ngoogle.golang.org/grpc.getChainUnaryHandler.func1.getChainUnaryHandler.1\n\tgoogle.golang.org/[email protected]/server.go:1243\ngithub.com/elastic/hosted-otel-collector/internal/extension/timeoutmiddlewareextension.timeoutMiddleware.grpcUnaryInterceptor\n\tgithub.com/elastic/hosted-otel-collector/internal/extension/[email protected]/middleware.go:62\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\tgoogle.golang.org/[email protected]/server.go:1243\ngithub.com/elastic/hosted-otel-collector/internal/extension/ecproutingmiddlewareextension.(*middleware).GetGRPCServerOptions.func1\n\tgithub.com/elastic/hosted-otel-collector/internal/extension/[email protected]/middleware.go:94\ngoogle.golang.org/grpc.getChainUnaryHandler.func1.getChainUnaryHandler.1\n\tgoogle.golang.org/[email protected]/server.go:1243\ngithub.com/elastic/hosted-otel-collector/internal/extension/telemetrymiddlewareextension.(*telemetryMiddleware).GetGRPCServerOptions.(*telemetryMiddleware).unaryServerInterceptor.func1\n\tgithub.com/elastic/hosted-otel-collector/internal/extension/[email protected]/extension.go:162\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\tgoogle.golang.org/[email protected]/server.go:1243\ngo.opentelemetry.io/collector/config/configgrpc.(*ServerConfig).getGrpcServerOptions.enhanceWithClientInformation.func9\n\tgo.opentelemetry.io/collector/config/[email protected]/configgrpc.go:576\ngoogle.golang.org/grpc.NewServer.chainUnaryServerInterceptors.chainUnaryInterceptors.func1\n\tgoogle.golang.org/[email protected]/server.go:1234\ngo.opentelemetry.io/collector/pdata/internal/otelgrpc.logsServiceExportHandler\n\tgo.opentelemetry.io/collector/[email protected]/internal/otelgrpc/logs_service.go:74\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\tgoogle.golang.org/[email protected]/server.go:1431\ngoogle.golang.org/grpc.(*Server).handleStream\n\tgoogle.golang.org/[email protected]/server.go:1842\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n\tgoogle.golang.org/[email protected]/server.go:1061"}

With GLOBAL:

[motel-ingest-collector-us-west-2c-5bcd48c9bb-djtt9] {"log.level":"error","@timestamp":"2025-11-24T10:11:12.817Z","message":"request is over the limits defined by the rate limiter","resource":{"cloud.availability_zone":"us-west-2c","k8s.namespace.name":"motel-ingest-collector","k8s.node.name":"ip-172-31-9-166.us-west-2.compute.internal","k8s.pod.name":"motel-ingest-collector-us-west-2c-5bcd48c9bb-djtt9","k8s.pod.uid":"e8163ba7-ae4f-4d62-94c5-ae9e63ba3cc3","orchestrator.cluster.name":"default","orchestrator.deploymentslice":"","orchestrator.environment":"default","service.instance.id":"73944440-5813-4611-8178-31a0ca11b553","service.name":"motel-ingest-collector","service.version":"git"},"otelcol.component.id":"ratelimit","otelcol.component.kind":"processor","otelcol.pipeline.id":"logs","otelcol.signal":"logs","hits":105658,"x-elastic-project-id":"local","x-elastic-target-id":"local","x-elastic-target-type":"serverless","error":{"message":"rpc error: code = ResourceExhausted desc = too many requests"},"ecs.version":"1.6.0","log.origin.stack_trace":"github.com/elastic/opentelemetry-collector-components/processor/ratelimitprocessor.rateLimit\n\tgithub.com/elastic/opentelemetry-collector-components/processor/[email protected]/processor.go:259\ngithub.com/elastic/opentelemetry-collector-components/processor/ratelimitprocessor.(*LogsRateLimiterProcessor).ConsumeLogs\n\tgithub.com/elastic/opentelemetry-collector-components/processor/[email protected]/processor.go:267\ngo.opentelemetry.io/collector/service/internal/obsconsumer.obsLogs.ConsumeLogs\n\tgo.opentelemetry.io/collector/[email protected]/internal/obsconsumer/logs.go:68\ngo.opentelemetry.io/collector/service/internal/refconsumer.refLogs.ConsumeLogs\n\tgo.opentelemetry.io/collector/[email protected]/internal/refconsumer/logs.go:29\ngo.opentelemetry.io/collector/service/internal/obsconsumer.obsLogs.ConsumeLogs\n\tgo.opentelemetry.io/collector/[email protected]/internal/obsconsumer/logs.go:68\ngithub.com/elastic/hosted-otel-collector/internal/processor/errorsanitizationprocessor.(*Processor).ConsumeLogs\n\tgithub.com/elastic/hosted-otel-collector/internal/processor/[email protected]/processor.go:100\ngo.opentelemetry.io/collector/service/internal/obsconsumer.obsLogs.ConsumeLogs\n\tgo.opentelemetry.io/collector/[email protected]/internal/obsconsumer/logs.go:68\ngo.opentelemetry.io/collector/service/internal/refconsumer.refLogs.ConsumeLogs\n\tgo.opentelemetry.io/collector/[email protected]/internal/refconsumer/logs.go:29\ngo.opentelemetry.io/collector/consumer.ConsumeLogsFunc.ConsumeLogs\n\tgo.opentelemetry.io/collector/[email protected]/logs.go:27\ngo.opentelemetry.io/collector/service/internal/obsconsumer.obsLogs.ConsumeLogs\n\tgo.opentelemetry.io/collector/[email protected]/internal/obsconsumer/logs.go:68\ngo.opentelemetry.io/collector/receiver/otlpreceiver/internal/logs.(*Receiver).Export\n\tgo.opentelemetry.io/collector/receiver/[email protected]/internal/logs/otlp.go:41\ngo.opentelemetry.io/collector/pdata/plog/plogotlp.rawLogsServer.Export\n\tgo.opentelemetry.io/collector/[email protected]/plog/plogotlp/grpc.go:86\ngo.opentelemetry.io/collector/pdata/internal/otelgrpc.logsServiceExportHandler.func1\n\tgo.opentelemetry.io/collector/[email protected]/internal/otelgrpc/logs_service.go:72\ngithub.com/elastic/hosted-otel-collector/internal/extension/authmiddlewareextension.(*authMiddleware).GetGRPCServerOptions.(*authMiddleware).getAuthUnaryServerInterceptor.func1\n\tgithub.com/elastic/hosted-otel-collector/internal/extension/[email protected]/middleware.go:107\ngoogle.golang.org/grpc.getChainUnaryHandler.func1.getChainUnaryHandler.1\n\tgoogle.golang.org/[email protected]/server.go:1243\ngithub.com/elastic/hosted-otel-collector/internal/extension/timeoutmiddlewareextension.timeoutMiddleware.grpcUnaryInterceptor\n\tgithub.com/elastic/hosted-otel-collector/internal/extension/[email protected]/middleware.go:62\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\tgoogle.golang.org/[email protected]/server.go:1243\ngithub.com/elastic/hosted-otel-collector/internal/extension/ecproutingmiddlewareextension.(*middleware).GetGRPCServerOptions.func1\n\tgithub.com/elastic/hosted-otel-collector/internal/extension/[email protected]/middleware.go:94\ngoogle.golang.org/grpc.getChainUnaryHandler.func1.getChainUnaryHandler.1\n\tgoogle.golang.org/[email protected]/server.go:1243\ngithub.com/elastic/hosted-otel-collector/internal/extension/telemetrymiddlewareextension.(*telemetryMiddleware).GetGRPCServerOptions.(*telemetryMiddleware).unaryServerInterceptor.func1\n\tgithub.com/elastic/hosted-otel-collector/internal/extension/[email protected]/extension.go:162\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\tgoogle.golang.org/[email protected]/server.go:1243\ngo.opentelemetry.io/collector/config/configgrpc.(*ServerConfig).getGrpcServerOptions.enhanceWithClientInformation.func9\n\tgo.opentelemetry.io/collector/config/[email protected]/configgrpc.go:576\ngoogle.golang.org/grpc.NewServer.chainUnaryServerInterceptors.chainUnaryInterceptors.func1\n\tgoogle.golang.org/[email protected]/server.go:1234\ngo.opentelemetry.io/collector/pdata/internal/otelgrpc.logsServiceExportHandler\n\tgo.opentelemetry.io/collector/[email protected]/internal/otelgrpc/logs_service.go:74\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\tgoogle.golang.org/[email protected]/server.go:1431\ngoogle.golang.org/grpc.(*Server).handleStream\n\tgoogle.golang.org/[email protected]/server.go:1842\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n\tgoogle.golang.org/[email protected]/server.go:1061"}

@ericywl
Copy link
Contributor Author

ericywl commented Nov 24, 2025

Test failure seems unrelated to our change:

--- FAIL: TestConcurrentRequestsTelemetry (0.00s)
    processor_test.go:505: 
        	Error Trace:	/home/runner/work/opentelemetry-collector-components/opentelemetry-collector-components/processor/ratelimitprocessor/processor_test.go:505
        	Error:      	Not equal: 
        	            	expected: 2
        	            	actual  : 1
        	Test:       	TestConcurrentRequestsTelemetry
        	Messages:   	expected to observe otelcol_ratelimit.concurrent_requests == 2

@vigneshshanmugam
Copy link
Member

The GLOBAL behavior seems to be consistently performing better, presumably due to using local cache for rate limit requests.

Can you also link the resource usage across the Ingest collector when you performed the test?

Test failure seems unrelated to our change:

Might be related? Since we are using the value from cache? Would be good to validate that.

@ericywl
Copy link
Contributor Author

ericywl commented Nov 25, 2025

Might be related? Since we are using the value from cache? Would be good to validate that.

The TestConcurrentRequestsTelemetry uses local rate limiter so it shouldn't be affected by gubernator GLOBAL.

@ericywl
Copy link
Contributor Author

ericywl commented Nov 25, 2025

Can you also link the resource usage across the Ingest collector when you performed the test?

With BATCHING, CPU usage around 95% (of 500m) and memory around 12% of (1Gi).

{
  "@timestamp": [
    "2025-11-25T05:34:48.298Z"
  ],
  "csp": [
    "aws"
  ],
  "data_stream.dataset": [
    "kubeletstatsreceiver"
  ],
  "data_stream.namespace": [
    "default"
  ],
  "data_stream.type": [
    "metrics"
  ],
  "k8s.pod.cpu_limit_utilization": [
    0.954177
  ],
  "k8s.pod.memory_limit_utilization": [
    0.1245166
  ],
  "kubernetes.namespace": [
    "motel-ingest-collector"
  ],
  "kubernetes.pod.name": [
    "motel-ingest-collector-us-west-2d-5c778bdccb-bznf8"
  ],
  "kubernetes.pod.uid": [
    "17319341-838c-4339-8ec0-9e5188098968"
  ],
  "run_id": [
    "local"
  ],
  "_id": "UwqCuZoB6bLp83H8LBFT",
  "_index": ".ds-metrics-kubeletstatsreceiver-default-2025.10.29-000004",
  "_score": null
}

With GLOBAL, CPU usage around 95% (of 500m) and memory around 20% (of 1Gi).

{
  "@timestamp": [
    "2025-11-25T05:29:58.301Z"
  ],
  "csp": [
    "aws"
  ],
  "data_stream.dataset": [
    "kubeletstatsreceiver"
  ],
  "data_stream.namespace": [
    "default"
  ],
  "data_stream.type": [
    "metrics"
  ],
  "k8s.pod.cpu_limit_utilization": [
    0.9515024
  ],
  "k8s.pod.memory_limit_utilization": [
    0.2022789
  ],
  "kubernetes.namespace": [
    "motel-ingest-collector"
  ],
  "kubernetes.pod.name": [
    "motel-ingest-collector-us-west-2d-664b97b967-qnr6d"
  ],
  "kubernetes.pod.uid": [
    "ebf66552-4948-4b02-bcd4-abc4a87a89a1"
  ],
  "run_id": [
    "local"
  ],
  "_id": "DgJ9uZoBohrT-wfumJOE",
  "_index": ".ds-metrics-kubeletstatsreceiver-default-2025.10.29-000004",
  "_score": null
}

@vigneshshanmugam
Copy link
Member

With GLOBAL, CPU usage around 95% (of 500m) and memory around 20% (of 1Gi).

Do we know the reason for the memory growth? Seems to be a big difference when it comes down to caching the limits locally with the global behaviour?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants