-
Notifications
You must be signed in to change notification settings - Fork 299
Refine recording errors documentation to include logs and avoid span events #3228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 13 commits
ed8f108
d96eedf
13f95d2
0baed85
6b5bc87
6390bfd
b352235
b8ca828
023348a
00d8769
02744ca
aa38670
1be560d
36405a0
060846c
6e232d4
521e340
14f21c6
c07f05c
266c084
c52e846
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| # Use this changelog template to create an entry for release notes. | ||
| # | ||
| # If your change doesn't affect end users you should instead start | ||
| # your pull request title with [chore] or use the "Skip Changelog" label. | ||
|
|
||
| # One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix' | ||
| change_type: enhancement | ||
|
|
||
| # The name of the area of concern in the attributes-registry, (e.g. http, cloud, db) | ||
| component: general | ||
|
|
||
| # A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`). | ||
| note: Refine recording errors documentation to include logs and avoid span events. | ||
|
|
||
| # Mandatory: One or more tracking issues related to the change. You can use the PR number here if no issue exists. | ||
| # The values here must be integers. | ||
| issues: [3228] | ||
|
|
||
| # (Optional) One or more lines of additional information to render under the primary note. | ||
| # These lines will be padded with 2 spaces and then inserted directly into the document. | ||
| # Use pipe (|) for multiline entries. | ||
| subtext: |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -5,26 +5,26 @@ | |
| <!-- toc --> | ||
|
|
||
| - [What constitutes an error](#what-constitutes-an-error) | ||
| - [What constitutes a failed operation](#what-constitutes-a-failed-operation) | ||
| - [Recording errors](#recording-errors) | ||
| - [Recording errors on spans](#recording-errors-on-spans) | ||
| - [Recording errors on metrics](#recording-errors-on-metrics) | ||
| - [Recording exceptions](#recording-exceptions) | ||
| - [Recording errors on logs](#recording-errors-on-logs) | ||
|
|
||
| <!-- tocstop --> | ||
|
|
||
| This document provides recommendations to semantic convention and instrumentation authors | ||
| on how to record errors on spans and metrics. | ||
| This document provides recommendations to semantic convention | ||
| and instrumentation authors on how to record errors on spans, metrics, and logs. | ||
|
|
||
| Individual semantic conventions are encouraged to provide additional guidance. | ||
|
|
||
| ## What constitutes an error | ||
|
|
||
| An operation SHOULD be considered as failed if any of the following is true: | ||
| In the scope of this document, an error occurs when: | ||
|
|
||
| - an exception is thrown by the instrumented method (API, block of code, or another instrumented unit) | ||
| - the instrumented method returns an error in another way, for example, via an error code | ||
|
|
||
| Semantic conventions that define domain-specific status codes SHOULD specify | ||
| which status codes should be reported as errors by a general-purpose instrumentation. | ||
| - an exception is thrown by an instrumented operation, | ||
| - the instrumented operation returns an error in another way, | ||
| for example, via an error object or status code. | ||
|
|
||
| > [!NOTE] | ||
| > | ||
|
|
@@ -33,38 +33,42 @@ An operation SHOULD be considered as failed if any of the following is true: | |
| > expected the resource to be available. However, it is not an error when the | ||
| > application is simply checking whether the resource exists. | ||
| > | ||
| > Instrumentations that have additional context about a specific request MAY use | ||
| > this context to set the span status more precisely. | ||
| > Instrumentations that have additional context about a specific request SHOULD | ||
| > use this context to classify whether the status code is an error. | ||
|
|
||
| Errors that were retried or handled (allowing an operation to complete gracefully) SHOULD NOT | ||
| be recorded on spans or metrics that describe this operation. | ||
| ## What constitutes a failed operation | ||
|
|
||
| ## Recording errors on spans | ||
| An operation SHOULD be considered as failed when it ends with an error. | ||
|
|
||
| [Span Status Code][SpanStatus] MUST be left unset if the instrumented operation has | ||
| ended without any errors. | ||
| Errors that were retried or handled (allowing an operation to complete gracefully) | ||
| SHOULD NOT be recorded on spans or metrics that describe this operation. | ||
dashpole marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| When the operation ends with an error, instrumentation: | ||
| ## Recording errors | ||
|
|
||
| - SHOULD set the span status code to `Error` | ||
| - SHOULD set the [`error.type`](/docs/registry/attributes/error.md#error-type) attribute | ||
| - SHOULD set the span status description when it has additional information | ||
| about the error which is not expected to contain sensitive details and aligns | ||
| with [Span Status Description][SpanStatus] definition. | ||
| Instrumentation SHOULD ensure that, for a given error, | ||
| the same [`error.type`][ErrorType] attribute value is used across all signals. | ||
|
|
||
| ## Recording errors on spans | ||
|
|
||
| When the instrumented operation failed, the instrumentation: | ||
|
|
||
| It's NOT RECOMMENDED to duplicate status code or `error.type` in span status description. | ||
| - SHOULD set the span status code to `Error`, | ||
|
||
| - SHOULD set the [`error.type`][ErrorType] attribute, | ||
| - SHOULD set the span status description when it has additional information | ||
| about the error that aligns with [Span Status Description][SpanStatus] | ||
| definition, for example, an exception message. | ||
|
|
||
| When the operation fails with an exception, the span status description SHOULD be set to | ||
| the exception message. | ||
| Note that [Span Status Code][SpanStatus] MUST be left unset if the instrumented | ||
| operation has ended without any errors. | ||
|
|
||
| Refer to the [recording exceptions](#recording-exceptions) on capturing exception | ||
| details. | ||
| It is NOT RECOMMENDED to record the error via a span event, | ||
pellared marked this conversation as resolved.
Show resolved
Hide resolved
pellared marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| for example, by using [`Span.RecordException`][RecordException]. | ||
|
|
||
| ## Recording errors on metrics | ||
|
|
||
| Semantic conventions for operations usually define an operation duration histogram | ||
| metric. This metric SHOULD include the `error.type` attribute. This enables users to derive | ||
| throughput and error rates. | ||
| metric. This metric SHOULD include the [`error.type`][ErrorType] attribute. | ||
| This enables users to derive throughput and error rates. | ||
|
|
||
| Operations that complete successfully SHOULD NOT include the `error.type` attribute, | ||
| allowing users to filter out errors. | ||
|
|
@@ -76,50 +80,43 @@ messaging operation may involve sending multiple messages) and includes `error.t | |
| It's RECOMMENDED to report one metric that includes successes and failures as opposed | ||
| to reporting two (or more) metrics depending on the operation status. | ||
|
|
||
| Instrumentation SHOULD ensure `error.type` is applied consistently across spans | ||
| and metrics when both are reported. A span and its corresponding metric for a single | ||
| operation SHOULD have the same `error.type` value if the operation failed and SHOULD NOT | ||
| include it if the operation succeeded. | ||
|
|
||
| ## Recording exceptions | ||
|
|
||
| When an instrumented operation fails with an exception, instrumentation SHOULD record | ||
| this exception as a [span event](/docs/exceptions/exceptions-spans.md) or a [log record](/docs/exceptions/exceptions-logs.md). | ||
|
|
||
| It's RECOMMENDED to use the `Span.recordException` API or logging library API that takes exception instance | ||
| instead of providing individual attributes. This enables the OpenTelemetry SDK to | ||
| control what information is recorded based on application configuration. | ||
|
|
||
| It's NOT RECOMMENDED to record the same exception more than once. | ||
| It's NOT RECOMMENDED to record exceptions that are handled by the instrumented library. | ||
|
|
||
| For example, in this code-snippet, `ResourceAlreadyExistsException` is handled and the corresponding | ||
| native instrumentation should not record it. Exceptions which are propagated | ||
| to the caller should be recorded (or logged) once. | ||
|
|
||
| ```java | ||
| public boolean createIfNotExists(String resourceId) throws IOException { | ||
| Span span = startSpan(); | ||
| try { | ||
| create(resourceId); | ||
| return true; | ||
| } catch (ResourceAlreadyExistsException e) { | ||
| // not recording exception and not setting span status to error - exception is handled | ||
| // but we can set attributes that capture additional details | ||
| span.setAttribute(AttributeKey.stringKey("acme.resource.create.status"), "already_exists"); | ||
| return false; | ||
| } catch (IOException e) { | ||
| // recording exception here (assuming it was not recorded inside `create` method) | ||
| span.recordException(e); | ||
| // or | ||
| // logger.warn(e); | ||
|
|
||
| span.setAttribute(AttributeKey.stringKey("error.type"), e.getClass().getCanonicalName()) | ||
| span.setStatus(StatusCode.ERROR, e.getMessage()); | ||
| throw e; | ||
| } | ||
| } | ||
| ``` | ||
| ## Recording errors on logs | ||
|
|
||
| When recording an error using logs ([event records][EventRecord]): | ||
|
|
||
| - SHOULD set [`EventName`][EventName] with a value that help indicating | ||
| what operation failed, e.g. `socket.connection_failed`, | ||
| - SHOULD set the [`error.type`][ErrorType] attribute, | ||
| - SHOULD set [`error.message`][ErrorMessage] attribute to add additional | ||
pellared marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| information about the error, for example, an exception message. | ||
pellared marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| When an error occurs outside the context of any span | ||
| and it causes an operation to fail, | ||
| the instrumentation SHOULD record it as an event record. | ||
pellared marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| In such scenario, [`SeverityNumber`][SeverityNumber] MUST be greater than | ||
pellared marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| or equal to 17 (ERROR). | ||
|
|
||
| When an error occurs inside the context of a span | ||
| and it causes an operation to fail, | ||
| the instrumentation SHOULD NOT additionally record it as an event record. | ||
pellared marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
dashpole marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| > [!NOTE] | ||
| > | ||
| > Applications that also want event records for errors that are already | ||
| > recorded on spans can use a span processor (or equivalent component) | ||
| > that emits corresponding error event records. This is an optional, | ||
| > user-configured mechanism and is not required by these conventions. | ||
|
|
||
| When an error is retried or handled, even when the overall operation completes | ||
| successfully, the instrumentation MAY record it as an event record for | ||
| diagnostic purposes. In such scenario, [`SeverityNumber`][SeverityNumber] | ||
| MUST be below 17 (ERROR). | ||
|
|
||
| [DocumentStatus]: https://opentelemetry.io/docs/specs/otel/document-status | ||
| [SpanStatus]: https://github.com/open-telemetry/opentelemetry-specification/blob/v1.52.0/specification/trace/api.md#set-status | ||
| [RecordException]: https://github.com/open-telemetry/opentelemetry-specification/blob/v1.52.0/specification/trace/api.md#record-exception | ||
| [ErrorType]: /docs/registry/attributes/error.md#error-type | ||
| [ErrorMessage]: /docs/registry/attributes/error.md#error-message | ||
| [EventRecord]: https://github.com/open-telemetry/opentelemetry-specification/blob/v1.52.0/specification/logs/data-model.md#log-and-event-record-definition | ||
| [EventName]: https://github.com/open-telemetry/opentelemetry-specification/blob/v1.52.0/specification/logs/data-model.md#field-eventname | ||
| [SeverityNumber]: https://github.com/open-telemetry/opentelemetry-specification/blob/v1.52.0/specification/logs/data-model.md#field-severitynumber | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: does it need a separate section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a similar section for defining "error". I think that the "failed operations" is a very important term used in this document that deserves also being defined. I can refactor it to a single section like "Definitions used in this document". However, I would prefer doing such structural changes in a separate PR.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the section after changes.
I removed the whole section so that we can stabilize each section independently. Also as noted #3228 (comment) there are cases when we want to add attribute to spans even if the operation has not (semantically) failed. The metrics already has