Clarify event naming for different outcomes#2010
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2010 +/- ##
==========================================
- Coverage 86.42% 86.42% -0.01%
==========================================
Files 525 525
Lines 167144 167144
==========================================
- Hits 144457 144450 -7
- Misses 22153 22160 +7
Partials 534 534
🚀 New features to boost your workflow:
|
|
@lquerel @albertlockett Requesting more eyes for this. |
| Regarding severity, choose the log level that best reflects the significance of | ||
| the event. For example, `node.shutdown.complete` at INFO for a graceful | ||
| shutdown and `node.shutdown.fail` at ERROR for a critical failure -- these are | ||
| distinct events, not the same event at different severity levels. |
There was a problem hiding this comment.
FWIW, I have a slight preference for the previous guidance (i.e. same event name, different level based on outcome) only because it can be easier to review in a downstream dashboard, etc...
When filtering by event name (e.g. node.shutdown) the user sees both the info and warning/error messages interleaved without needing to get into the internals of the event names.
Of course, the counter argument is that a fuzzy search on node.shutdown would yield the same results either way.
There was a problem hiding this comment.
https://opentelemetry.io/docs/specs/otel/logs/data-model/#field-eventname
I will ask if OTel semantic conventions has thoughts on this as well. From what I can remember from old discussions, each occurrence of an Event should have same structure - but it does not explicitly say if Severity is part of the structure! (it does say attributes and body are)
There was a problem hiding this comment.
I'll readily admit that I do not know what the best practices here!
The OTel data model says the event name identifies the "class / type of event", and I suppose it's conceivable that events of the same type/class could be more/less severe depending on the context in which they occurred.
Also, does this example not still conflict with the guidance in event naming?
otel-arrow/rust/otap-dataflow/docs/telemetry/events-guide.md
Lines 123 to 148 in 1f22e63
node.shutdown.complete is of the form <entity>.<verb>.<verb> isn't it? Or is shutdown now the "thing"?
|
I'm so, so very tempted to share my opinion, but then I realized it's wasteful of all of our time. I agree with Cijo we should follow OTel and there is a nuanced question about whether Level is considered independent of event identity to be addressed. I approve if @AaronRM and @cijothomas and others agree. |
I agree. 👍 My main concern is that the guidelines are consistent to know how to address comments properly in #1988. |
I'll follow-up separately for this one. Only goal of this PR is to clarify if we should use same EventName for different outcomes of an operation. One event: vs
We currently have former, and this PR is changing to the latter. Both approaches are valid, but distinct event names better reflect that a successful shutdown and a failed shutdown are fundamentally different event types with different operational significance and mental models—success is a routine lifecycle transition while failure is an anomalous condition requiring investigation. This could be a time consuming discussion without much returns! Lets give it another day to see if anyone has strong preference either way. |
|
There are indeed a few ambiguities in our guidelines around events that need clarification. I believe the semantic conventions lean toward Approach 1, but I'm adding @lmolkova and @jsuereth (semconv and weaver maintainers) to get their perspective on this topic. Concretely, my thinking is that the event name would be: where I've intentionally left the |
|
https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/OpenTelemetryProtocolExporterEventSource.cs#L98-L114 |
An interesting argument to take into consideration. |
|
I think we should merge this. I do not think we should wait for the philosophical question or OTel to change its mind. |
| the event. For example, `node.shutdown.complete` at INFO for a graceful | ||
| shutdown and `node.shutdown.fail` at ERROR for a critical failure -- these are |
There was a problem hiding this comment.
this is not a formal guidance (at least not yet), but event naming follows the same principles as metric naming - one name for specific kind of event, outcome recorded as an attribute.
E.g. node.shutdown with attribute error.type = something_specific.
Caveat: assuming there are events along the way, during the process of shutdown, then there could be
node.shutdown.error event(s) and the process of shutdown can also be recorded as a span (or an event if spans don't work).
TL;DR: node.shutdown event represents the shutdown process with any outcome (success, failure, something in betwwen). node.shutdown.error represents a specific occurrence of an error during shutdown process. Naming comes from what you want to record.
Severity considerations (in review): open-telemetry/semantic-conventions#3311
TL;DR: if it's the end of the world situation - FATAL, if it clearly affects user experience - ERROR, if it's retriable and effect on user experience is minimal/not known - WARN or below
There was a problem hiding this comment.
Thanks for sharing this @lmolkova. When looking at the OTel Collector's internal metrics, I see they use separate metric names for different outcomes:
otelcol_receiver_accepted_spans and otelcol_receiver_refused_spans
Rather than:
otelcol_receiver_spans{status="accepted|refused"}
Of course it is Collector, and while it was being built, semantic conventions may not have existed. Anyway, I can open an issue with sem.conv repo to continue this discussion in a better forum for this.
There was a problem hiding this comment.
TL;DR: node.shutdown event represents the shutdown process with any outcome (success, failure, something in betwwen). node.shutdown.error represents a specific occurrence of an error during shutdown process. Naming comes from what you want to record.
This is quite reasonable.
There was a problem hiding this comment.
I believe there is an issue on collector - open-telemetry/opentelemetry-collector#14350. Having arrow interested in a similar set of metrics is a good justification to move the issue to semconv (or create a new one).
We've gone through a round of internal OTel SDK metric renames recently - https://github.com/open-telemetry/semantic-conventions/blob/main/docs/otel/sdk-metrics.md and I imagine collector would do something similar at some point
There was a problem hiding this comment.
SDK and Collectors/pipelines can use the same metrics, but I've begun to see the SDK case as special compared with the pipeline case.
@lmolkova please see https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/component-universal-telemetry.md which I have been working to emulate.
There was a problem hiding this comment.
Thank you, @lmolkova, for the clarification. We will incorporate this into our guidelines.
The "Severity and placement" section previously suggested using the same event name (node.shutdown) at different severity levels to distinguish a graceful shutdown (INFO) from a critical failure (ERROR). This conflicts with the guidance in Event Naming and Verbs, which recommends distinct event names for different outcomes. Updated the example to use node.shutdown.complete (INFO) and node.shutdown.fail (ERROR) so the guide is internally consistent. This is an intentional semantic change, not just a wording tweak. The guide now consistently says: different outcomes → different event names. Severity reflects significance, but should not be the sole way to distinguish success from failure. Fixes open-telemetry#1972
The "Severity and placement" section previously suggested using the same event name (node.shutdown) at different severity levels to distinguish a graceful shutdown (INFO) from a critical failure (ERROR).
This conflicts with the guidance in Event Naming and Verbs, which recommends distinct event names for different outcomes. Updated the example to use node.shutdown.complete (INFO) and node.shutdown.fail (ERROR) so the guide is internally consistent.
This is an intentional semantic change, not just a wording tweak. The guide now consistently says: different outcomes → different event names. Severity reflects significance, but should not be the sole way to distinguish success from failure.
Fixes #1972