-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Inject component-identifying scope attributes #12617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inject component-identifying scope attributes #12617
Conversation
We discussed offline adding a feature gate for this and all other internal telemetry related changes, I intend to merge this once the comments are addressed and the feature gate has been added |
54c13a9
…behind feature gate (#12933) #### Context PR #12617 introduced logic to inject new instrumentation scope attributes in all internal telemetry to identify which Collector component it came from. These attributes had already been added to internal logs as regular log attributes, and this PR switched them to scope attributes for consistency. The new logic was placed behind an Alpha stage feature gate, `telemetry.newPipelineTelemetry`. Unfortunately, the default "off" state of the feature gate disabled the injection of component-identifying attributes entirely, which was a regression since they had been present in internal logs in previous releases. See issue #12870 for an in-depth discussion of this issue. To correct this, PR #12856 was filed, which stabilized the feature gate, making it on by default, with no way to disable it, and removed the logic that the feature gate used to toggle. This was thought to be the simplest way to mitigate the regression in the "off" state, since we planned to stabilize the feature eventually anyways. Unfortunately, it was found that the "on" state of the feature gate causes a different issue: [the Prometheus exporter](https://github.com/open-telemetry/opentelemetry-go/tree/main/exporters/prometheus) is the default way of exporting the Collector's internal metrics, accessible at `collector:8888/metrics`. This exporter does not currently have any support for instrumentation scope attributes, meaning that metric streams differentiated by said attributes but not by any other identifying property will appear as aliases to Prometheus, which causes an error. This completely breaks the export of Collector metrics through Prometheus under some simple configurations, which is a release blocker. #### Description To fix this issue, this PR sets the `telemetry.newPipelineTelemetry` feature gate back to "Alpha" (off by default), and reintroduces logic to disable the injection of the new instrumentation scope attributes when the gate is off, but only in internal metrics. Note that the new logic is still used unconditionally for logs and traces, to avoid reintroducing the logs issue (#12870). This should avoid breaking the Collector in its default configuration while we try to get a fix in the Prometheus exporter. #### Link to tracking issue No tracking issue currently, will probably file one later. #### Testing I performed some simple manual testing with a config file like the following: ```yaml receivers: otlp: [...] processors: batch: exporters: debug: [...] service: pipelines: logs: receivers: [otlp] processors: [batch] exporters: [debug] traces: receivers: [otlp] processors: [batch] exporters: [debug] telemetry: metrics: level: detailed traces: [...] logs: [...] ``` The two batch processors create aliased metric streams, which are only differentiated by the new component attributes. I checked that: 1. this config causes an error in the Prometheus exporter on main; 2. the error is resolved by default after applying this PR; 3. the error reappears when enabling the feature gate (this is expected) 4. scope attributes are added on the traces and logs no matter the state of the gate.
…peline components (#12812) Depends on #12856 Resolves #12676 This is a reboot of #11311, incorporating metrics defined in the [component telemetry RFC](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/component-universal-telemetry.md) and attributes added in #12617. The basic pattern is: - When building any pipeline component which produces data, wrap the "next consumer" with instrumentation to measure the number of items being passed. This wrapped consumer is then passed into the constructor of the component. - When building any pipeline component which consumes data, wrap the component itself. This wrapped consumer is saved onto the graph node so that it can be retrieved during graph assembly. --------- Co-authored-by: Pablo Baeyens <[email protected]>
@@ -54,7 +54,7 @@ type otlpReceiver struct { | |||
// responsibility to invoke the respective Start*Reception methods as well | |||
// as the various Stop*Reception methods to end it. | |||
func newOtlpReceiver(cfg *Config, set *receiver.Settings) (*otlpReceiver, error) { | |||
set.Logger = telemetry.LoggerWithout(set.TelemetrySettings, componentattribute.SignalKey) | |||
set.TelemetrySettings = telemetry.WithoutAttributes(set.TelemetrySettings, componentattribute.SignalKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jade-guiton-dd Apologies for asking on a old PR, Was there any specific reason for not including the signal info on the exposed metrics from OTLP receiver? This would be really useful to see the amount of data ingested based on the signal type.
We did already have this info exposed in the older metrics via otelcol_receiver_accepted_log_records
, otelcol_receiver_accepted_metric_points
and otelcol_receiver_accepted_spans
. Trying to understand if we have a plan to add them down the line? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OTLP receiver is internally a single object, even when configured in multiple pipelines for multiple signals. For that reason, the telemetry it emits can't easily be associated with a single signal, so it removes the "otelcol.signal" attribute from its set of attributes on startup. If we didn't do that, all telemetry from the component would be associated with whichever signal pipeline happened to be created first, which would not be helpful.
However, the OTLP receiver could manually add back a signal attribute on specific metric points which are associated with a specific signal. But I don't believe this is currently needed:
- The older
otelcol_receiver_X
metrics (which aren't going anywhere for the foreseeable future) already differentiate between signals in their name - The new metrics emitted by pipeline auto-instrumentation (implemented in a later PR) use the original attribute set of the component before startup, which includes "otelcol.signal".
Do you have any examples of internal metrics emitted by the OTLP receiver which are lacking association with a specific signal (and which could be associated with one despite the singleton architecture)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed answer, I was going off based on this PR and upon testing the different metrics/logs, I do see the otelcol.signal
being present in all of the emitted telemetry - even the custom ones generated via mdatagen which is super cool. Thanks for making that happen 👍🏽
One thing I noticed is since we are currently treating Middlewares as part of the Extension interface, some of the pipeline attributes like signal and outcome
are missing from the reported metrics. Should we treat them similar to Receivers as they are running as part of the Receiver end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not extremely familiar with the middleware interface considering how new it is (are there even any implementations of it yet?), but I think this would be of questionable use and difficult to accomplish:
- Because middlewares act at the HTTP/gRPC request level, there's no generic reliable way to know which signal (if any) they're processing. This is determined later by the receiver, after the request has been processed by the middleware. The only case where I think this would be doable is if the receiver only handles a single type of signal, in which case
otelcol.signal
is much less useful anyway. - We only use the
outcome
attribute on auto-instrumented pipeline metrics, not arbitrary receiver telemetry, because the auto-instrumentation layer is at the right place to know whether the next component succeeded or not. I think adding a similar instrumentation layer inside middlewares would be difficult, but I'm not familiar enough with the middleware API to tell for sure.
However, I think we could make a stronger case for adding an attribute to middleware telemetry to know which receiver instance it's used in. This would be doable, though we would need to redesign the middleware API to pass in a new TelemetrySettings
with the appropriate attributes for each call to GetHTTPHandler
/ GetGRPCServerOptions
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the details again. Just to give some context, I am working with the Middleware based extension to add additional attributes to the custom telemetry records without relying on some of the proposed alternatives like Baggage propagation, Processors etc - #12316
This is determined later by the receiver, after the request has been processed by the middleware.
Makes sense 👍🏽.
However, I think we could make a stronger case for adding an attribute to middleware telemetry to know which receiver instance it's used in
That would be useful since middleware extension can basically be run on any HTTP/gRPC receivers.
Do you think its worth creating a issue for this so we can move the discussion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that would make sense. It would be good to make the contributors working on the middleware interface aware that this need exists.
#### Context PR #12617, which implemented the injection of component-identifying attributes into the `zap.Logger` provided to components, introduced significant additional memory use when the Collector's pipelines contain many components (#13014). This was because we would call `zapcore.NewSamplerWithOptions` to wrap the specialized logger core of each Collector component, which allocates half a megabyte's worth of sampling counters. This problem was mitigated in #13015 by moving the sampling layer to a different location in the logger core hierarchy. This meant that Collector users that do not export their logs through OTLP and only use stdout-based logs no longer saw the memory increase. #### Description This PR aims to provide a better solution to this issue, by using the `reflect` library to clone zap's sampler core and set a new inner core, while reusing the counter allocation. (This may also be "more correct" from a sampling point of view, ie. we only have one global instance of the counters instead of one for console logs and one for each component's OTLP-exported logs, but I'm not sure if anyone noticed the difference anyway). #### Link to tracking issue Fixes #13014 #### Testing A new test was added which checks that the log counters are shared between two sampler cores with different attributes.
Description
Fork of #12384 to showcase how component attributes can be injected into scope attributes instead of log/metric/span attributes. See that PR for more context.
To see the diff from the previous PR, filter changes starting from the "Prototype using scope attributes" commit.
Link to tracking issue
Resolves #12217
Also incidentally resolves #12213 and resolves #12117
Testing
I updated the existing tests to check for scope attributes, and did some manual testing with a debug exporter to check that the scope attributes are added/removed properly.