-
Notifications
You must be signed in to change notification settings - Fork 299
Refine recording errors documentation to include logs and avoid span events #3228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refines the error recording documentation to align with the Span Event API deprecation plan (OTEP 4430), which is necessary for stabilizing the otelhttp instrumentation library in OpenTelemetry Go.
Key changes:
- Adds comprehensive guidance on recording errors via logs/events instead of span events
- Clarifies the distinction between "error" and "failed operation" concepts
- Deprecates the recommendation to use
Span.RecordExceptionfor error recording - Establishes consistency requirements for
error.typeacross all signals (spans, metrics, logs)
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| docs/general/recording-errors.md | Major restructure of error recording guidance: adds new sections for logs, clarifies when operations are failed vs encountering errors, removes Java exception handling example, discourages span events for error recording, and adds guidance for severity levels in log-based error recording |
| .chloggen/3228.yaml | Adds changelog entry documenting the enhancement to error recording documentation |
…ording errors documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
docs/general/recording-errors.md
Outdated
|
|
||
| When the instrumented operation failed, the instrumentation: | ||
|
|
||
| - SHOULD set the span status code to `Error`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the definitions above, this would happen any time the span ends with an exception thrown by the instrumented operation. But the instrumentation may know that the exception doesn't actually represent an error, e.g. that the exception is expected to be handled outside the span (see pydantic/logfire#1361 for example).
We still want to be able to record info about these kinds of exceptions, including a traceback. But they shouldn't be marked as errors. That means that there's also no place here to store the exception message, since the span status description can't be set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL 6e232d4
docs/general/recording-errors.md
Outdated
| - SHOULD set [`SeverityNumber`][SeverityNumber]. | ||
|
|
||
| When the error happens during an operation, | ||
| it is RECOMMENDED to set [`SeverityNumber`][SeverityNumber] below 9 (INFO). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you're saying here is "when the error happens before the operation has ended and is handled, so the operation is still expected to complete successfully". That should be clarified.
In that situation, I find it surprising to consider this as severity below INFO. I would usually log this as a warning, it's concerning enough to warrant some attention. A user is likely to configure logging to filter out everything below INFO, meaning they wouldn't see any sign of transient errors at all. If there was internal retrying to overcome the error, you'd likely see a span taking mysteriously long and think there was a performance problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you're saying here is "when the error happens before the operation has ended and is handled, so the operation is still expected to complete successfully". That should be clarified.
It is also when it has ended and when it is not handled (operation fails).
This tries to address #3228 (comment)
I would usually log this as a warning, it's concerning enough to warrant some attention.
I disagree. Transient errors are happening all the time. I would consider it as logs overuse.
A user is likely to configure logging to filter out everything below INFO, meaning they wouldn't see any sign of transient errors at all. If there was internal retrying to overcome the error, you'd likely see a span taking mysteriously long and think there was a performance problem.
This is not true. Note that the user would get the information about the transient errors via metrics. Then if the users want diagnose it further they can always set even "TRACE" severity for given error.type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. Transient errors are happening all the time. I would consider it as logs overuse.
Based on what? We record transient errors in our application as warnings and I'm glad we do.
Note that the user would get the information about the transient errors via metrics.
Unless you're lucky, there would be no way to determine from metrics what happened for a specific given span. And the user might not even think to look at metrics at all, or know how.
Then if the users want diagnose it further
But they can't diagnose past events.
they can always set even "TRACE" severity for given error.type.
Is this particularly easy to do? In any case I think it would be better if the user had the option to set the severity of the log to something higher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to address it together with #3228 (comment)
Saying "above info" or "below info" without enough context will seem confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL 14f21c6
| - SHOULD set the [`error.type`][ErrorType] attribute, | ||
| - SHOULD set [`error.message`][ErrorMessage] attribute to add additional | ||
| information about the error, for example, an exception message, | ||
| - SHOULD set the `error.stacktrace` attribute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought you didn't want to introduce new attributes. why not exception.stacktrace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing "error" into "exception" terminology (which better aligns with the specification e.g. Span.RecordException and current semantic conventions) can be done in a separate PR.
There are too many open discussions around this and it is not clear yet which way we want to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd only set the stacktrace in the case of an exception, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is implementation/language specific. E.g. usually in Go errors do not have stacktraces. However, they may have it e.g. when using github.com/pkg/errros.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so will you add a new attribute to the registry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in the scope of this PR or at least not until there is a clear agreement.
There are too many open discussions around this and it is not clear yet which way we want to go.
Note that this document is in Development status.
docs/general/recording-errors.md
Outdated
| When the instrumented operation failed, the instrumentation: | ||
|
|
||
| It's NOT RECOMMENDED to duplicate status code or `error.type` in span status description. | ||
| - SHOULD set the span status code to `Error` if this is a semantical error, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the condition needs to be elaborated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any proposal how? Here is an example what constitutes a semantic error for HTTP spans: https://github.com/open-telemetry/semantic-conventions/blob/main/docs/http/http-spans.md#status. Maybe a hyperlink as this is a good example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SHOULD set the span status code to Error, unless this is an exception that doesn't actually represent an error, e.g. it is being used for control flow, or is expected to be handled gracefully outside the span. For example, a 400 HTTP status code is not an error on an HTTP server span because it indicates a problem with the user's request, not the application. But a 500 HTTP status code is an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note. Problably when we change exception to error terminology it would be more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar feedback in #3228 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL 14f21c6
Co-authored-by: Alex Hall <[email protected]>
| - SHOULD set the span status code to `Error` if this is a semantic error, | ||
| - SHOULD set the [`error.type`][ErrorType] attribute, | ||
| - SHOULD set [`error.message`][ErrorMessage] attribute to add additional | ||
| information about the error, for example, an exception message, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incompatible with:
semantic-conventions/docs/registry/attributes/error.md
Lines 19 to 21 in 502202e
| It is also NOT RECOMMENDED to duplicate the value of `exception.message` in `error.message`. | |
| `error.message` is NOT RECOMMENDED for metrics or spans due to its unbounded cardinality and overlap with span status. |
one of them needs to change
…ntions into recording-errors
…tion and clarifying recording practices for retried errors
it it renduant with the the line below
|
SIG meeting notes: Personal opinion: I think it was a good as this PR initiated some important discussions and helps us planning next steps. |
|
SIG meeting notes: |
Towards OTEP: Span Event API deprecation plan
Towards open-telemetry/opentelemetry-specification#4429
Towards open-telemetry/opentelemetry-go-contrib#7254 (comment)
Towards open-telemetry/opentelemetry-go-contrib#7470
Supersedes #2296
"Prototype" (do this across all instrumentation libraries in Go): open-telemetry/opentelemetry-go-contrib#7254 (comment)
Note that this is needed toward stabilization of
otelhttpwhich is the OTel Go instrumentation library for HTTP server and client as we want to have a clear and stable way on how to record errors.Changes
Span.RecordExceptionfor error recordingerror.typeacross all signals (spans, metrics, logs)Other
I was also considering deprecating Semantic conventions for exceptions on spans, but decided to hold off and better scope it to a separate PR.
An next steps (probably after stabilizing parts of this document) we could deprecate:
Changing "error" into "exception" terminology (which better aligns with the specification e.g.
Span.RecordExceptionand current semantic conventions) can be done in a separate PRI was thinking about adding
error.stacktraceper 2025-12-09 Logs SIG meeting, but I felt that it would be better to scope it to a separate PR as this is already big enough. But probably we would prefer to useexception.*attributes anyway.