Convert OTel Histograms to CloudWatch Values/Counts#376
Convert OTel Histograms to CloudWatch Values/Counts#376dricross wants to merge 1 commit intoaws-cwa-devfrom
Conversation
f1a277e to
3b43f3b
Compare
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
| // allocation, processing time, and the maximum number of value/count pairs that are sent to CloudWatch which could | ||
| // cause a CloudWatch PutMetricData / PutLogEvent request to be split into multiple requests due to the 100/150 | ||
| // metric datapoint limit. | ||
| const maximumInnerBucketCount = 10 |
There was a problem hiding this comment.
How did we settle on 10? Would it make sense for this to be configurable so we don't have to update this function just to change this value?
ConvertOTelToCloudWatch(dp pmetric.HistogramDataPoint, maximumInnerBucketCount int)
| 1. Remove `t.Skip(...)` from `TestWriteInputHistograms` and run the test to generate json files for the input histograms. | ||
| 1. Remove `t.Skip(...)` from `TestWriteConvertedHistograms` and run the test to generate json files for the converted histograms. |
There was a problem hiding this comment.
nit: Could hide them behind a go:build flag, so you don't need to modify the code to be able to run them.
| // This algorithm creates "inner buckets" between user-defined bucket based on the sample count, up to a | ||
| // maximum. A logarithmic ratio (named "magnitude") compares the density between the current bucket and the | ||
| // next bucket. This logarithmic ratio is used to decide how to spread samples amongst inner buckets. | ||
| // | ||
| // case 1: magnitude < 0 | ||
| // * What this means: Current bucket is denser than the next bucket -> density is decreasing. | ||
| // * What we do: Use inverse quadratic distribution to spread the samples. This allocates more samples towards | ||
| // the lower bound of the bucket. | ||
| // case 2: 0 <= magnitude < 1 | ||
| // * What this means: Current bucket and next bucket has similar densities -> density is not changing much. | ||
| // * What we do: Use inform distribution to spread the samples. Extra samples that can't be spread evenly are | ||
| // (arbitrarily) allocated towards the start of the bucket. | ||
| // case 3: 1 <= magnitude | ||
| // * What this means: Current bucket is less dense than the next bucket -> density is increasing. | ||
| // * What we do: Use quadratic distribution to spread the samples. This allocates more samples toward the end | ||
| // of the bucket. |
There was a problem hiding this comment.
nit: Might be easier for readability if this comment was closer to the switch case.
| epsilon := float64(sampleCount) / sigma | ||
| entryStart := len(em.counts) | ||
|
|
||
| runningSum := 0 |
There was a problem hiding this comment.
nit: More of a runningCount or distributedCount. It's the amount of the sample count that's been distributed. Sum has a different meaning for histograms, so this might be confusing.
| // distribute the remainder towards the front | ||
| remainder := sampleCount - runningSum | ||
| // make sure there's room for the remainder | ||
| if len(em.counts) < entryStart+remainder { | ||
| em.counts = append(em.counts, make([]float64, remainder)...) | ||
| em.values = append(em.values, make([]float64, remainder)...) | ||
| } |
There was a problem hiding this comment.
I'm not sure I follow. How is this distributing the remainder towards the front? Let's say len(em.counts) is 10 and our remainder is somehow 12. entryStart is 0 since it was assigned before the for-loop. If we append em.counts = append(em.counts, make([]float64, remainder)...), won't this pad out 12 new entries of 0.0 making the new len(em.counts) 22? Should it be make([]float64, entryStart+remainder-len(em.counts))? This does seem like an edge case because remainder should be less than the number of entries that were added.
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Description
New implementation for converting OTel histograms to CloudWatch Values/Counts for emission to CloudWatch by the CloudWatch Agent. The OTel histogram format is incompatible with the CloudWatch APIs. A mapping algorithm is needed to transform OTel histograms to Values/Counts.
OTel histograms are in the format:
See the following for more details on OTel histogram format: https://opentelemetry.io/docs/specs/otel/metrics/data-model/#histogram
For the purposes of this algorithm, the input histograms are assumed to always be in Delta temporarility as the CloudWatch Agent will use the cumuluativetodelta processor to convert before emission.
CloudWatch accepts histograms using the Values/Counts model in the PutMetricData API.
See the following for more details: https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_MetricDatum.html. This API accepts:
This algorithm converts each of the buckets of the input histogram into (at most) 10 value/count data pairs aka "inner buckets". The values of the inner buckets are spread evenly across the bucket span. The counts of the inner buckets are determined using an exponential mapping algorithm. Counts are weighted more heavily to one side according to an exponential function depending on how the density of the nearby buckets are changing.
The following image demonstrates how an example input histogram is converted to the values/count model. The red dots indicate the values/counts that are pushed to CloudWatch.

Testing
Unit testing
Used the new tools introduced previously to send histogram test cases to CloudWatch and then retrieve the percentile metrics.
Most percentiles fall within the expected range. A few are off by a percent or two. I believe this is due to the back-end applying another SEH1 mapping slightly modifying the values that the agent sends to CW for efficient storing.
For our accuracy tests, we see several improvments:
Agent integration tests: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/18909364034