Skip to content

Conversation

@dricross
Copy link
Contributor

@dricross dricross commented Oct 24, 2025

Description of the issue

Integrate with new algorithm for converting OTel classic histograms to CloudWatch's Value/Counts. This significantly improves percentile metrics for classic histogram metrics.

Description of changes

Integrate with new algorithm in contrib repo: amazon-contributing/opentelemetry-collector-contrib#376

Associated integ test update to fix histogram definitions: aws/amazon-cloudwatch-agent-test#617

New Algorithm

This algorithm converts each of the buckets of the input histogram into (at most) 10 value/count data pairs aka "inner buckets". The values of the inner buckets are spread evenly across the bucket span. The counts of the inner buckets are determined using an exponential mapping algorithm. Counts are weighted more heavily to one side according to an exponential function depending on how the density of the nearby buckets are changing.

Aggregation

Currently, the cloudwatch output plugin converts inputs histograms to the internal type RegularDistribution which contains a series of value/count pairs in a map. When receiving a new histogram datapoint for the same metric within the aggregation interval, the new histogram datapoint is converted to a RegularDistribution and the two resultant maps are combined, e.g. (where weight is 1):

for bucketNumber, bucketCounts := range fromDistribution.buckets {
	regularDist.buckets[bucketNumber] += bucketCounts * weight
}

With this new conversion algorithm, we are going to delay converting the OTel histogram to a series of value/count pairs until the aggregation interval is complete to avoid ballooning the number of unique values in the aggregated metric datapoint. Instead, the original OTel histogram datapoint is preserved and incoming histogram datapoints are merged similar to how the OTel deltatocumulative processor works: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/deltatocumulativeprocessor/internal/data/add.go#L46. This makes the aggregation logic a little messy.

Adapter Receiver Histogram Metrics

The adapter receiver uses OTel histograms as a way to shepherd a series of values/counts to CloudWatch. As an example, StatsD histograms are ingested and stored as values/counts using a map in a distribution type. The distribution is converted to an OTel histogram to send it down the metric pipeline. This conversion stores the values as bucket bounds (which isn't quite right) and the counts as bucket counts. This cause len(bounds) == len(counts) violating the OTel histogram format. Additionally, the order of the "values" is not monotonically increasing, which also violates the OTel histogram format.

All the adapter receiver wants to do is send the input values/counts to CloudWatch so that percentile metrics are available. To keep this functionality, the new conversion algorithm and aggregation logic will be bypassed for adapter receiver histogram metrics. This is achieved by marking adapter receiver histogram datapoints with a special attribute.

Summary

  • [Unchanged] Gauge/Counter metrics are aggregated using RegularDistribution data type and output as value/count pairs at the end of the aggregation interval
  • [Unchanged] ExponentialHistogram metrics are aggregated using ExpHistogramDistribution data type and output as value/count pairs at the end of aggregation interval.
  • [Unchanged] Adapter receiver classic histogram metrics are aggregated using RegularDistribution data type and output as value/count pairs at the end of the aggregation interval
  • [NEW] Classic histogram metrics (besides those originating from the adapter receiver) are aggregated using OTel's pmetric.HistogramDataPoint type. The aggregated datapoint is converted from OTel histogram to Values/Counts using the new classic histogram mapping algorithm from the opentelemetry-collector-contrib repo at the end of the aggregation interval.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Use the new tools in the contrib repo so send histogram test cases to CloudWatch and then retrieve the percentile metrics. See amazon-contributing/opentelemetry-collector-contrib#376 for more details.

The existing histogram test fails as its sending invalid OTel histograms to the agent which it will now drop. Updated test repo: aws/amazon-cloudwatch-agent-test#617

Full integ test run: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/18923198676. Several tests unrelated to histograms are failing but failures seem to be consistent with main. Mainline for comparison: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/18854598546

Requirements

  1. Run make fmt and make fmt-sh
  2. Run make lint

Integration Tests

To run integration tests against this PR, add the ready for testing label.

@dricross dricross force-pushed the dricross/classichistograms branch 2 times, most recently from 26e7654 to cc7ab39 Compare October 24, 2025 18:26
@dricross dricross force-pushed the dricross/classichistograms branch from 16632ac to c10ab8f Compare October 27, 2025 19:38
@dricross dricross marked this pull request as ready for review October 27, 2025 19:58
@dricross dricross requested a review from a team as a code owner October 27, 2025 19:58
@dricross dricross added the ready for testing Indicates this PR is ready for integration tests to run label Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready for testing Indicates this PR is ready for integration tests to run

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant