Skip to content

Comments

Convert OTel Histograms to CloudWatch Values/Counts#376

Open
dricross wants to merge 1 commit intoaws-cwa-devfrom
classichistograms
Open

Convert OTel Histograms to CloudWatch Values/Counts#376
dricross wants to merge 1 commit intoaws-cwa-devfrom
classichistograms

Conversation

@dricross
Copy link

@dricross dricross commented Oct 27, 2025

Description

New implementation for converting OTel histograms to CloudWatch Values/Counts for emission to CloudWatch by the CloudWatch Agent. The OTel histogram format is incompatible with the CloudWatch APIs. A mapping algorithm is needed to transform OTel histograms to Values/Counts.

OTel histograms are in the format:

  • A series of buckets with:
    • Explicit boundary values. These values denote the lower and upper bounds for buckets and whether not a given observation would be recorded in this bucket.
    • A count of the number of observations that fell within this bucket.
  • Min (optional)
  • Max (optional)
  • Sum
  • Count
  • Attributes (key/value pairs)

See the following for more details on OTel histogram format: https://opentelemetry.io/docs/specs/otel/metrics/data-model/#histogram

For the purposes of this algorithm, the input histograms are assumed to always be in Delta temporarility as the CloudWatch Agent will use the cumuluativetodelta processor to convert before emission.

CloudWatch accepts histograms using the Values/Counts model in the PutMetricData API.

  • Values: Array of numbers representing the values for the metric during the period. Each unique value is listed just once in this array, and the corresponding number in the Counts array specifies the number of times that value occurred during the period. You can include up to 150 unique values in each PutMetricData action that specifies a Values array.
  • Counts: Array of numbers that is used along with the Values array. Each number in the Count array is the number of times the corresponding value in the Values array occurred during the period.
  • StatisticValues which contains statistic values for the input data set:
    • Min (not optional)
    • Max (not optional)
    • Sum
    • SampleCount
  • Dimensions (key/value pairs)

See the following for more details: https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_MetricDatum.html. This API accepts:

This algorithm converts each of the buckets of the input histogram into (at most) 10 value/count data pairs aka "inner buckets". The values of the inner buckets are spread evenly across the bucket span. The counts of the inner buckets are determined using an exponential mapping algorithm. Counts are weighted more heavily to one side according to an exponential function depending on how the density of the nearby buckets are changing.

The following image demonstrates how an example input histogram is converted to the values/count model. The red dots indicate the values/counts that are pushed to CloudWatch.
image

Testing

Unit testing

Used the new tools introduced previously to send histogram test cases to CloudWatch and then retrieve the percentile metrics.

TestCase                                                   P10         P25         P50         P75         P90         P99       P99.9         Min         Max         Sum       Count
126 Buckets                                             125.95      314.76      629.19      944.75      1132.9      1258.2      1271.5           5        1300  5.2233e+06        8316
176 Buckets                                             175.88      440.04      880.01      1318.6      1583.1      1771.3      1797.1           5        1800  1.0182e+07       11616
225 Buckets                                              226.5      564.39      1128.8      1695.1      2033.5      2239.9      2289.3           5        2300  1.6822e+07       14916
325 Buckets                                             325.96      814.79      1628.7      2443.9      2931.6      3260.9      3296.1           5        3300  3.4983e+07       21516
Basic Histogram                                         17.913      28.327      50.986      73.413      86.886       194.4      199.43          10         200       36000         606
Cumulative bucket starts at 0                         0.010662    0.049403     0.10823     0.23481     0.40067      2.7043      11.867           0          45        6600       19086
Large Numbers                                       3.5613e+05  1.8884e+06  9.4334e+06  4.9984e+07   9.722e+07  7.2107e+08  8.7259e+08       1e+05       1e+09       6e+11        6006
Many Buckets                                            6.0464      35.102      89.752      558.59      889.85      1043.9      1090.7         0.5        1100     2.1e+06        6744
Negative and Positive Boundaries                           N/A         N/A         N/A         N/A         N/A         N/A         N/A         -50          50           0         636
No Min or Max                                           2.1182      18.084      55.369      71.599      180.26       242.8      250.74           0         300       21000         450
No Min/Max with Single Value                            142.82      143.99      145.97      147.97      149.18      149.92      149.99          50         150         600           6
Only Max Defined                                        52.465      118.33      203.07      303.27      367.55      733.64      748.35           0         750    1.05e+05         606
Only Min Defined                                        37.583      56.621      86.121       110.6      128.82      170.21       171.7          25         200       24000         306
Only Negative Boundaries                                   N/A         N/A         N/A         N/A         N/A         N/A         N/A        -200         -10      -60000         606
Positive boundaries but implied Negative Values            N/A         N/A         N/A         N/A         N/A         N/A         N/A        -100          60        1200         606
Single Bucket                                           37.763      38.306       39.23      40.176      40.754      41.106      41.141           5          75        6000         306
Tail Heavy Histogram                                    128.84      139.48       144.7      147.85      149.77      150.93         151          10         151     8.7e+05        6060
Two Buckets                                               1.78      2.6881       4.278       5.429      6.3839      9.9766      9.9977           1          10         900         186
Unbounded Histogram                                          0           0           0           0           0           0           0           0           0       21000         450
Very Small Numbers                                  5.2363e-08  7.2171e-07   1.629e-06  2.7846e-06  3.3259e-06  4.5734e-06  5.9513e-06       1e-08       6e-06      0.0009         606
Zero Counts and Sparse Data                             1.0712      2.8607      7.7614      221.86      983.31      1271.3      1489.5           0        1500     1.5e+05         606

Most percentiles fall within the expected range. A few are off by a percent or two. I believe this is due to the back-end applying another SEH1 mapping slightly modifying the values that the agent sends to CW for efficient storing.

For our accuracy tests, we see several improvments:

  • Maximum error reduced from 99% to 9%
  • Reduce average error from 30% to 3%
  • Improve throughput for histogram conversions by 60%

Agent integration tests: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/18909364034

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Nov 13, 2025
@github-actions
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Nov 27, 2025
@jefchien jefchien reopened this Dec 3, 2025
@github-actions github-actions bot removed the Stale label Dec 4, 2025
// allocation, processing time, and the maximum number of value/count pairs that are sent to CloudWatch which could
// cause a CloudWatch PutMetricData / PutLogEvent request to be split into multiple requests due to the 100/150
// metric datapoint limit.
const maximumInnerBucketCount = 10
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did we settle on 10? Would it make sense for this to be configurable so we don't have to update this function just to change this value?

ConvertOTelToCloudWatch(dp pmetric.HistogramDataPoint, maximumInnerBucketCount int)

Comment on lines +7 to +8
1. Remove `t.Skip(...)` from `TestWriteInputHistograms` and run the test to generate json files for the input histograms.
1. Remove `t.Skip(...)` from `TestWriteConvertedHistograms` and run the test to generate json files for the converted histograms.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could hide them behind a go:build flag, so you don't need to modify the code to be able to run them.

Comment on lines +156 to +171
// This algorithm creates "inner buckets" between user-defined bucket based on the sample count, up to a
// maximum. A logarithmic ratio (named "magnitude") compares the density between the current bucket and the
// next bucket. This logarithmic ratio is used to decide how to spread samples amongst inner buckets.
//
// case 1: magnitude < 0
// * What this means: Current bucket is denser than the next bucket -> density is decreasing.
// * What we do: Use inverse quadratic distribution to spread the samples. This allocates more samples towards
// the lower bound of the bucket.
// case 2: 0 <= magnitude < 1
// * What this means: Current bucket and next bucket has similar densities -> density is not changing much.
// * What we do: Use inform distribution to spread the samples. Extra samples that can't be spread evenly are
// (arbitrarily) allocated towards the start of the bucket.
// case 3: 1 <= magnitude
// * What this means: Current bucket is less dense than the next bucket -> density is increasing.
// * What we do: Use quadratic distribution to spread the samples. This allocates more samples toward the end
// of the bucket.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Might be easier for readability if this comment was closer to the switch case.

epsilon := float64(sampleCount) / sigma
entryStart := len(em.counts)

runningSum := 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: More of a runningCount or distributedCount. It's the amount of the sample count that's been distributed. Sum has a different meaning for histograms, so this might be confusing.

Comment on lines +221 to +227
// distribute the remainder towards the front
remainder := sampleCount - runningSum
// make sure there's room for the remainder
if len(em.counts) < entryStart+remainder {
em.counts = append(em.counts, make([]float64, remainder)...)
em.values = append(em.values, make([]float64, remainder)...)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. How is this distributing the remainder towards the front? Let's say len(em.counts) is 10 and our remainder is somehow 12. entryStart is 0 since it was assigned before the for-loop. If we append em.counts = append(em.counts, make([]float64, remainder)...), won't this pad out 12 new entries of 0.0 making the new len(em.counts) 22? Should it be make([]float64, entryStart+remainder-len(em.counts))? This does seem like an edge case because remainder should be less than the number of entries that were added.

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Dec 25, 2025
@github-actions
Copy link

github-actions bot commented Jan 8, 2026

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Jan 8, 2026
@dricross dricross reopened this Jan 27, 2026
@github-actions github-actions bot removed the Stale label Jan 28, 2026
@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants