Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/awsfirehose] Add support for encoding extensions #37262

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

axw
Copy link
Contributor

@axw axw commented Jan 16, 2025

Description

The internal unmarshallers now implement plog.Unmarshaler and pmetric.Unmarshaler. This will enable extracting them later as encoding extensions; for now they remain embedded within the receiver. The existing record_type config has been deprecated and replaced by encoding, which can reference either the internal unmarshallers or encoding extensions.

As a result of the interface change, the unmarshallers now unmarshal a single record at a time, which means we cannot merge resources/metrics as we go, but only after each record. This is achieved by using existing common code: internal/exp/metrics for merging metrics, and internal/pdatautil for merging logs.

Due to using the above merging functions, this PR also fixes a bug in the cwmetrics unmarshaller where the unit of a metric was not considered part of its identity, causing two metrics that differed only by unit to be merged.

Link to tracking issue

Fixes #37113

Testing

Should be a non-functional change, so relying on existing unit tests to catch issues. Tests have been added for new extension functionality.

Documentation

Updated README.

@github-actions github-actions bot requested a review from Aneurysm9 January 16, 2025 08:25
@axw axw force-pushed the firehose-encoding-extension branch 3 times, most recently from 7e6a082 to fbf5919 Compare January 16, 2025 08:47
The internal unmarshalers now implement plog.Unmarshaler
and pmetric.Unmarshaler. This will enable extracting them
later as encoding extensions; for now they remain embedded
within the receiver.

As a result of the interface change, the unmarshalers now
unmarshal a single record at a time, which means we cannot
merge resources/metrics as we go, but only after each record.

This also fixes a bug in the cwmetrics unmarshaller where
the unit of a metric was not considered part of its identity,
and so two metrics that differed only by unit would be merged.
@axw axw force-pushed the firehose-encoding-extension branch from fbf5919 to abc720f Compare January 16, 2025 08:59
@axw axw marked this pull request as ready for review January 16, 2025 11:25
@axw axw requested a review from a team as a code owner January 16, 2025 11:25
@axw axw requested a review from andrzej-stencel January 16, 2025 11:25
Copy link
Contributor

@atoulme atoulme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -19,7 +19,7 @@ import (

func TestLoadConfig(t *testing.T) {
for _, configType := range []string{
"cwmetrics", "cwlogs", "otlp_v1", "invalid",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing for an invalid record type or encoding is different from testing that both an encoding and record type have been provided. Both tests should remain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we're now supporting extensions, the record type is only known to be valid/invalid at the time we call Start. There's a test case in there for invalid encoding/record type. See the WithUnknownEncoding test cases for TestLogsReceiver_Start and TestMetricsReceiver_Start.

receiver/awsfirehosereceiver/config_test.go Outdated Show resolved Hide resolved
// If a record type is specified, it must be valid.
// An empty string is acceptable, however, because it will use a telemetry-type-specific default.
if c.RecordType != "" {
return validateRecordType(c.RecordType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think validation of the record type or encoding can be deferred. This has to fail fast to alert the user to their configuration error rather than allowing the collector to start and then failing to process received data.

Copy link
Contributor Author

@axw axw Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collector will still fail fast. e.g.

$ cat local/config.yaml 
receivers:
  awsfirehose:
    record_type: invalid

exporters:
  debug: {}

service:
  pipelines:
    logs:
      receivers: [awsfirehose]
      processors: []
      exporters: [debug]

$ ./bin/otelcontribcol_linux_amd64 --config local/config.yaml
2025-01-17T10:51:28.527+0800    info    [email protected]/service.go:164   Setting up own telemetry...
2025-01-17T10:51:28.527+0800    info    telemetry/metrics.go:70 Serving metrics {"address": "localhost:8888", "metrics level": "Normal"}
2025-01-17T10:51:28.527+0800    info    builders/builders.go:26 Development component. May change in the future.        {"kind": "exporter", "data_type": "logs", "name": "debug"}
2025-01-17T10:51:28.527+0800    warn    [email protected]/config.go:48       record_type is deprecated, and will be removed in a future version. Use encoding instead.       {"kind": "receiver", "name": "awsfirehose", "data_type": "logs"}
2025-01-17T10:51:28.530+0800    info    [email protected]/service.go:230   Starting otelcontribcol...      {"Version": "0.117.0-dev", "NumCPU": 16}
2025-01-17T10:51:28.530+0800    info    extensions/extensions.go:39     Starting extensions...
2025-01-17T10:51:28.530+0800    error   graph/graph.go:426      Failed to start component       {"error": "unknown encoding extension \"invalid\"", "type": "Receiver", "id": "awsfirehose"}
2025-01-17T10:51:28.530+0800    info    [email protected]/service.go:295   Starting shutdown...
2025-01-17T10:51:28.530+0800    info    extensions/extensions.go:66     Stopping extensions...
2025-01-17T10:51:28.530+0800    info    [email protected]/service.go:309   Shutdown complete.
Error: cannot start pipelines: unknown encoding extension "invalid"
2025/01/17 10:51:28 collector server run finished with error: cannot start pipelines: unknown encoding extension "invalid"

It's doing a bit more work than before it gets to the error, but AFAIK it's not possible to access extensions earlier than the Start method.

if logs.ResourceLogs().Len() == 0 {
return logs, errInvalidRecords
}
pdatautil.GroupByResourceLogs(logs.ResourceLogs())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a benchmark to compare the effect of allocating a new resource log entry for each log record and combining them later? If not, please add one. This feels like something that could be a significant regression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not, I've added one. This way of grouping is indeed much more expensive than what was there before. I've reverted to grouping by the known CloudWatch attributes.

}

return md, nil
metrics = expmetrics.Merge(pmetric.NewMetrics(), metrics)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again would like a benchmark. This feels like a lot of duplicated work for inputs of any appreciable size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and like in cwlog I've reverted to grouping by the known CloudWatch attributes, rather than using expmetrics.Merge here. Note that there's still merging at the consumer level, since at that level we don't know anything about the data -- so it has to merge in a more generic and expensive way.

@@ -45,18 +57,16 @@ func TestUnmarshal(t *testing.T) {
"WithSomeInvalidRecords": {
filename: "some_invalid_records",
wantResourceCount: 5,
wantMetricCount: 35,
wantMetricCount: 36,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Was there a latent defect or is this papering over a regression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there was a defect, from the description:

Due to using the above merging functions, this PR also fixes a bug in the cwmetrics unmarshaller where the unit of a metric was not considered part of its identity, causing two metrics that differed only by unit to be merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, the MetadataNoToken metric in some_invalid_records has different units: in most cases it's "None", in one case it's "Count".

receiver/awsfirehosereceiver/logs_receiver.go Outdated Show resolved Hide resolved
receiver/awsfirehosereceiver/metrics_receiver.go Outdated Show resolved Hide resolved
axw added 8 commits January 17, 2025 10:55
Adds a test that covers merging various combinations
of resource/scope identities.
pdatautil.GroupByResourceLogs is much slower,
at least partially because it's generic. Revert
to a simpler grouping by known resource attributes.
- Use json-iterator for decoding JSON
- Use klauspost/compress for decompressing gzip
- Pool gzip readers
- Remove pointer type from cwMetricValue to avoid allocation
- Don't read the whole request body into memory
- Implement more efficient metrics merging
@axw
Copy link
Contributor Author

axw commented Jan 17, 2025

@Aneurysm9 as mentioned in thread replies, I have reverted some of the grouping changes, but grouping across records is still going to be more expensive, and I think that's unavoidable given the signature of unmarhallers.

To counteract that, I've made various optimisations unrelated to the main changes so unmarshalling is now faster -- albeit uses more memory for unmarshalling metrics (significantly less for logs). I may have gotten a little bit carried away, let me know if you'd like me to pull these into separate PRs.

goos: linux
goarch: amd64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awsfirehosereceiver
cpu: AMD Ryzen 7 PRO 5850U with Radeon Graphics     
                                                              │ /tmp/old.txt  │            /tmp/new.txt             │
                                                              │    sec/op     │    sec/op     vs base               │
LogsConsumer_cwlogs/10resources_10records_1logs-16              153.98µ ±  8%   80.79µ ± 19%  -47.53% (p=0.002 n=6)
LogsConsumer_cwlogs/10resources_10records_10logs-16              466.3µ ±  3%   275.5µ ±  3%  -40.92% (p=0.002 n=6)
LogsConsumer_cwlogs/10resources_100records_1logs-16             1445.6µ ±  5%   813.0µ ±  3%  -43.76% (p=0.002 n=6)
LogsConsumer_cwlogs/10resources_100records_10logs-16             4.293m ± 20%   2.690m ±  6%  -37.34% (p=0.002 n=6)
MetricsConsumer_cwmetrics/10resources_10records_1metrics-16      91.58µ ± 17%   62.62µ ±  4%  -31.62% (p=0.002 n=6)
MetricsConsumer_cwmetrics/10resources_10records_10metrics-16     790.2µ ± 17%   428.1µ ± 11%  -45.82% (p=0.002 n=6)
MetricsConsumer_cwmetrics/10resources_100records_1metrics-16     667.1µ ± 13%   567.8µ ± 20%  -14.88% (p=0.041 n=6)
MetricsConsumer_cwmetrics/10resources_100records_10metrics-16    6.135m ± 14%   4.258m ±  8%  -30.60% (p=0.002 n=6)
geomean                                                          776.4µ         486.8µ        -37.29%

                                                              │ /tmp/old.txt  │             /tmp/new.txt             │
                                                              │     B/op      │     B/op      vs base                │
LogsConsumer_cwlogs/10resources_10records_1logs-16              417.99Ki ± 0%   54.74Ki ± 0%   -86.90% (p=0.002 n=6)
LogsConsumer_cwlogs/10resources_10records_10logs-16              512.8Ki ± 0%   146.9Ki ± 2%   -71.34% (p=0.002 n=6)
LogsConsumer_cwlogs/10resources_100records_1logs-16             4153.6Ki ± 0%   549.6Ki ± 0%   -86.77% (p=0.002 n=6)
LogsConsumer_cwlogs/10resources_100records_10logs-16             4.954Mi ± 0%   1.427Mi ± 2%   -71.20% (p=0.002 n=6)
MetricsConsumer_cwmetrics/10resources_10records_1metrics-16      22.88Ki ± 7%   80.91Ki ± 9%  +253.57% (p=0.002 n=6)
MetricsConsumer_cwmetrics/10resources_10records_10metrics-16     162.8Ki ± 2%   275.8Ki ± 2%   +69.43% (p=0.002 n=6)
MetricsConsumer_cwmetrics/10resources_100records_1metrics-16     128.5Ki ± 0%   677.8Ki ± 0%  +427.44% (p=0.002 n=6)
MetricsConsumer_cwmetrics/10resources_100records_10metrics-16    1.137Mi ± 0%   2.435Mi ± 1%  +114.19% (p=0.002 n=6)
geomean                                                          473.3Ki        353.5Ki        -25.32%

                                                              │ /tmp/old.txt │            /tmp/new.txt            │
                                                              │  allocs/op   │  allocs/op   vs base               │
LogsConsumer_cwlogs/10resources_10records_1logs-16                331.0 ± 5%    407.0 ± 0%  +22.96% (p=0.002 n=6)
LogsConsumer_cwlogs/10resources_10records_10logs-16              1.955k ± 0%   2.715k ± 2%  +38.85% (p=0.002 n=6)
LogsConsumer_cwlogs/10resources_100records_1logs-16              2.655k ± 0%   3.904k ± 0%  +47.04% (p=0.002 n=6)
LogsConsumer_cwlogs/10resources_100records_10logs-16             18.19k ± 0%   26.24k ± 2%  +44.25% (p=0.002 n=6)
MetricsConsumer_cwmetrics/10resources_10records_1metrics-16       493.0 ± 6%    618.0 ± 4%  +25.35% (p=0.002 n=6)
MetricsConsumer_cwmetrics/10resources_10records_10metrics-16     3.743k ± 2%   4.761k ± 2%  +27.21% (p=0.002 n=6)
MetricsConsumer_cwmetrics/10resources_100records_1metrics-16     3.018k ± 0%   5.638k ± 0%  +86.80% (p=0.002 n=6)
MetricsConsumer_cwmetrics/10resources_100records_10metrics-16    27.73k ± 0%   46.15k ± 0%  +66.41% (p=0.002 n=6)
geomean                                                          2.887k        4.142k       +43.47%

@axw
Copy link
Contributor Author

axw commented Jan 20, 2025

@Aneurysm9 as mentioned in thread replies, I have reverted some of the grouping changes, but grouping across records is still going to be more expensive, and I think that's unavoidable given the signature of unmarhallers.

Another option: we can keep the merging in the unmarshallers, which would operate on the Firehose record level, and drop the merging across records. That would be a change in behaviour, but would avoid the overhead.

@axw
Copy link
Contributor Author

axw commented Jan 21, 2025

Another option: we can keep the merging in the unmarshallers, which would operate on the Firehose record level, and drop the merging across records. That would be a change in behaviour, but would avoid the overhead.

I've done this in #37361

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[receiver/awsfirehose] Implement encoding extension framework
4 participants