Skip to content

Conversation

@pellared
Copy link
Member

@pellared pellared commented Dec 19, 2025

Towards OTEP: Span Event API deprecation plan

Towards open-telemetry/opentelemetry-specification#4429

Towards open-telemetry/opentelemetry-go-contrib#7254 (comment)
Towards open-telemetry/opentelemetry-go-contrib#7470

Supersedes #2296

"Prototype" (do this across all instrumentation libraries in Go): open-telemetry/opentelemetry-go-contrib#7254 (comment)

Note that this is needed toward stabilization of otelhttp which is the OTel Go instrumentation library for HTTP server and client as we want to have a clear and stable way on how to record errors.

Changes

  • Adds comprehensive guidance on recording errors via logs (event records) instead of span events
  • Clarifies the distinction between "error" and "failed operation" concepts
  • Deprecates the recommendation to use Span.RecordException for error recording
  • Establishes consistency requirements for error.type across all signals (spans, metrics, logs)

Other

I was also considering deprecating Semantic conventions for exceptions on spans, but decided to hold off and better scope it to a separate PR.

An next steps (probably after stabilizing parts of this document) we could deprecate:

Changing "error" into "exception" terminology (which better aligns with the specification e.g. Span.RecordException and current semantic conventions) can be done in a separate PR

I was thinking about adding error.stacktrace per 2025-12-09 Logs SIG meeting, but I felt that it would be better to scope it to a separate PR as this is already big enough. But probably we would prefer to use exception.* attributes anyway.

@pellared pellared requested review from a team as code owners December 19, 2025 13:21
@pellared pellared changed the title Refine recording errors documentation towards Span Event API deprecat… Refine recording errors documentation towards Span Event API deprecation plan Dec 19, 2025
@pellared pellared changed the title Refine recording errors documentation towards Span Event API deprecation plan Refine recording errors documentation to include logs and avoid span events Dec 19, 2025
@github-actions github-actions bot added the enhancement New feature or request label Dec 19, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refines the error recording documentation to align with the Span Event API deprecation plan (OTEP 4430), which is necessary for stabilizing the otelhttp instrumentation library in OpenTelemetry Go.

Key changes:

  • Adds comprehensive guidance on recording errors via logs/events instead of span events
  • Clarifies the distinction between "error" and "failed operation" concepts
  • Deprecates the recommendation to use Span.RecordException for error recording
  • Establishes consistency requirements for error.type across all signals (spans, metrics, logs)

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
docs/general/recording-errors.md Major restructure of error recording guidance: adds new sections for logs, clarifies when operations are failed vs encountering errors, removes Java exception handling example, discourages span events for error recording, and adds guidance for severity levels in log-based error recording
.chloggen/3228.yaml Adds changelog entry documenting the enhancement to error recording documentation

@pellared pellared requested a review from Copilot December 19, 2025 13:52
@pellared pellared self-assigned this Dec 19, 2025
@pellared pellared moved this to In progress in Logs SIG Dec 19, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

@pellared pellared requested a review from dashpole January 7, 2026 11:09

When the instrumented operation failed, the instrumentation:

- SHOULD set the span status code to `Error`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the definitions above, this would happen any time the span ends with an exception thrown by the instrumented operation. But the instrumentation may know that the exception doesn't actually represent an error, e.g. that the exception is expected to be handled outside the span (see pydantic/logfire#1361 for example).

We still want to be able to record info about these kinds of exceptions, including a traceback. But they shouldn't be marked as errors. That means that there's also no place here to store the exception message, since the span status description can't be set.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL 6e232d4

- SHOULD set [`SeverityNumber`][SeverityNumber].

When the error happens during an operation,
it is RECOMMENDED to set [`SeverityNumber`][SeverityNumber] below 9 (INFO).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you're saying here is "when the error happens before the operation has ended and is handled, so the operation is still expected to complete successfully". That should be clarified.

In that situation, I find it surprising to consider this as severity below INFO. I would usually log this as a warning, it's concerning enough to warrant some attention. A user is likely to configure logging to filter out everything below INFO, meaning they wouldn't see any sign of transient errors at all. If there was internal retrying to overcome the error, you'd likely see a span taking mysteriously long and think there was a performance problem.

Copy link
Member Author

@pellared pellared Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you're saying here is "when the error happens before the operation has ended and is handled, so the operation is still expected to complete successfully". That should be clarified.

It is also when it has ended and when it is not handled (operation fails).
This tries to address #3228 (comment)

I would usually log this as a warning, it's concerning enough to warrant some attention.

I disagree. Transient errors are happening all the time. I would consider it as logs overuse.

A user is likely to configure logging to filter out everything below INFO, meaning they wouldn't see any sign of transient errors at all. If there was internal retrying to overcome the error, you'd likely see a span taking mysteriously long and think there was a performance problem.

This is not true. Note that the user would get the information about the transient errors via metrics. Then if the users want diagnose it further they can always set even "TRACE" severity for given error.type.

Copy link
Contributor

@alexmojaki alexmojaki Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. Transient errors are happening all the time. I would consider it as logs overuse.

Based on what? We record transient errors in our application as warnings and I'm glad we do.

Note that the user would get the information about the transient errors via metrics.

Unless you're lucky, there would be no way to determine from metrics what happened for a specific given span. And the user might not even think to look at metrics at all, or know how.

Then if the users want diagnose it further

But they can't diagnose past events.

they can always set even "TRACE" severity for given error.type.

Is this particularly easy to do? In any case I think it would be better if the user had the option to set the severity of the log to something higher.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to address it together with #3228 (comment)

Saying "above info" or "below info" without enough context will seem confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL 14f21c6

@github-project-automation github-project-automation bot moved this from Untriaged to Blocked in Semantic Conventions Triage Jan 7, 2026
@pellared pellared requested a review from alexmojaki January 7, 2026 13:27
- SHOULD set the [`error.type`][ErrorType] attribute,
- SHOULD set [`error.message`][ErrorMessage] attribute to add additional
information about the error, for example, an exception message,
- SHOULD set the `error.stacktrace` attribute.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought you didn't want to introduce new attributes. why not exception.stacktrace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing "error" into "exception" terminology (which better aligns with the specification e.g. Span.RecordException and current semantic conventions) can be done in a separate PR.

There are too many open discussions around this and it is not clear yet which way we want to go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd only set the stacktrace in the case of an exception, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is implementation/language specific. E.g. usually in Go errors do not have stacktraces. However, they may have it e.g. when using github.com/pkg/errros.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so will you add a new attribute to the registry?

Copy link
Member Author

@pellared pellared Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in the scope of this PR or at least not until there is a clear agreement.

There are too many open discussions around this and it is not clear yet which way we want to go.

Note that this document is in Development status.

When the instrumented operation failed, the instrumentation:

It's NOT RECOMMENDED to duplicate status code or `error.type` in span status description.
- SHOULD set the span status code to `Error` if this is a semantical error,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the condition needs to be elaborated

Copy link
Member Author

@pellared pellared Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any proposal how? Here is an example what constitutes a semantic error for HTTP spans: https://github.com/open-telemetry/semantic-conventions/blob/main/docs/http/http-spans.md#status. Maybe a hyperlink as this is a good example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHOULD set the span status code to Error, unless this is an exception that doesn't actually represent an error, e.g. it is being used for control flow, or is expected to be handled gracefully outside the span. For example, a 400 HTTP status code is not an error on an HTTP server span because it indicates a problem with the user's request, not the application. But a 500 HTTP status code is an error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note. Problably when we change exception to error terminology it would be more clear.

Copy link
Member Author

@pellared pellared Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar feedback in #3228 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL 14f21c6

@pellared pellared requested a review from alexmojaki January 7, 2026 13:41
- SHOULD set the span status code to `Error` if this is a semantic error,
- SHOULD set the [`error.type`][ErrorType] attribute,
- SHOULD set [`error.message`][ErrorMessage] attribute to add additional
information about the error, for example, an exception message,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incompatible with:

It is also NOT RECOMMENDED to duplicate the value of `exception.message` in `error.message`.
`error.message` is NOT RECOMMENDED for metrics or spans due to its unbounded cardinality and overlap with span status.

one of them needs to change

@pellared pellared requested a review from alexmojaki January 7, 2026 16:58
…tion and clarifying recording practices for retried errors
it it renduant with the the line below
@pellared
Copy link
Member Author

pellared commented Jan 7, 2026

SIG meeting notes:
Some changes done here are seen as too big and too controversial.
Going to make smaller steps on things that we agreed on.

Personal opinion: I think it was a good as this PR initiated some important discussions and helps us planning next steps.

@pellared pellared marked this pull request as draft January 7, 2026 20:11
@pellared
Copy link
Member Author

pellared commented Jan 7, 2026

SIG meeting notes:
We also agreed that in otelhttp we should not use Record Exception API nor "recording exception" guidelines. We should just follow "recording errors on spans" and "recording errors on metrics" guidelines.
CC @open-telemetry/go-maintainers

@pellared pellared closed this Jan 7, 2026
@github-project-automation github-project-automation bot moved this from In progress to Done in Logs SIG Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done
Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants