Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTEP: Recording exceptions as log based events #4333

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

lmolkova
Copy link
Contributor

@lmolkova lmolkova commented Dec 10, 2024

Related to open-telemetry/semantic-conventions#1536

Changes

Recording exceptions as span events is problematic since it

  • ties recording exceptions to tracing/sampling
  • duplicates exceptions recorded by instrumented libraries on logs
  • does not leverage log features such as typical log filtering based on severity

This OTEP provides guidance on how to record exceptions using OpenTelemetry logs focusing on minimizing duplication and providing context to reduce the noise.

If accepted, the follow-up spec changes are expected to replace existing (stable) documents:


@lmolkova lmolkova changed the title OTEP: Recording exceptions and errors with OpenTelemetry OTEP: Recording exceptions as log based events Dec 10, 2024
@pellared
Copy link
Member

pellared commented Dec 10, 2024

I think this is a related issue:

oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
@tedsuo tedsuo added the OTEP OpenTelemetry Enhancement Proposal (OTEP) label Dec 12, 2024
@lmolkova lmolkova force-pushed the exceptions-on-logs-otep branch 2 times, most recently from b06a09f to 76c7d85 Compare December 17, 2024 17:30
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
@lmolkova lmolkova force-pushed the exceptions-on-logs-otep branch from db27087 to e9f38aa Compare January 3, 2025 01:42
@carlosalberto
Copy link
Contributor

A small doubt:

If this instrumentation supports tracing, it should capture the error in the scope of the processing span.

Although (I think) it's not called out, I'm understanding exceptions should now be explicitly reported as both 1) Span.Event and 2) Log/Event? i.e. coding wise you should do this:

currentSpan.recordException(e);
logger.logRecordBuilder
    .addException(e);

Is this the case?

Copy link
Contributor

@jsuereth jsuereth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm very supportive. Just some nits and one mitigation I'd like to see called out/addressed.

oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
@adriangb
Copy link

adriangb commented Feb 6, 2025

it is still possible to send exceptions reported via Logging API to be added to Span by writing LogRecordProcessor.

Yes that's what I meant by:

can not (easily, without buffering, etc.)

Just because it's possible doesn't mean it's easy. At the SDK level it's doable, you can assume there are no network failures, etc. Even then it's not simple because you're still buffering stuff in memory you didn't have to previously which can result in increased memory usage, having to sequence flushing correctly to avoid flushing the logs before you process the spans that they relate to, etc. Let's say you do write something that works pretty well: now you have to convince all of your users to install it and configure it correctly. That's not an easy thing to do.

Not sure I follow this.. If exceptions are being sent via spans today, then you are already buffering and dealing with all these today already. Using LogRecordProcessor to attach Exception to SpanEvent won't change that - exceptions continue to work the same way as before.

When I receive a span today I know that I've also received all of the exceptions associated with that span: they come in the same payload, at the same time. If the exceptions are sent as logs (or child spans, anything except the same span) they may arrive at any time, e.g. an hour before or after the span was received in the case of a networking issue. Or they may never arrive at all.

@cijothomas
Copy link
Member

it is still possible to send exceptions reported via Logging API to be added to Span by writing LogRecordProcessor.

Yes that's what I meant by:

can not (easily, without buffering, etc.)

Just because it's possible doesn't mean it's easy. At the SDK level it's doable, you can assume there are no network failures, etc. Even then it's not simple because you're still buffering stuff in memory you didn't have to previously which can result in increased memory usage, having to sequence flushing correctly to avoid flushing the logs before you process the spans that they relate to, etc. Let's say you do write something that works pretty well: now you have to convince all of your users to install it and configure it correctly. That's not an easy thing to do.

Not sure I follow this.. If exceptions are being sent via spans today, then you are already buffering and dealing with all these today already. Using LogRecordProcessor to attach Exception to SpanEvent won't change that - exceptions continue to work the same way as before.

When I receive a span today I know that I've also received all of the exceptions associated with that span: they come in the same payload, at the same time. If the exceptions are sent as logs (or child spans, anything except the same span) they may arrive at any time, e.g. an hour before or after the span was received in the case of a networking issue. Or they may never arrive at all.

Sorry, I don't follow...Here's an example showing my thinking:

my_method()
{
let span = tracer.startActiveSpan("my span");
try 
{
 doSomething()
}
catch(ex)
{
   log(ex);
   //  span.recordException(ex); this is commented out. i.e exception is *not* reported as SpanEvent by instrumentation.
}

span.end()
}
LogRecordProcessor
{
 OnEmit(LogRecord)
 {
    ex = logRecord.ex
    span = GetActiveSpan();
    span.recordException(ex);  // this will add exception to span itself. achieving same effect as instrumentation natively doing this.
 } 
}

^ This will have the same effect as doing the commented out part (span.recordException).
Span payload will have the exception information, just like before. (when instrumentation was doing span.RecordException)

@adriangb
Copy link

adriangb commented Feb 6, 2025

Sorry, I don't follow...Here's an example showing my thinking:

Now convince all of your users to put that into their code. Configuring OTEL correctly is already ~50 LOC, now it's 100 LOC. Users won't do it. And as explained above it's not really possible to do this reliably in a collector / backend.

@cijothomas
Copy link
Member

Sorry, I don't follow...Here's an example showing my thinking:

Now convince all of your users to put that into their code. Configuring OTEL correctly is already ~50 LOC, now it's 100 LOC. Users won't do it. And as explained above it's not really possible to do this reliably in a collector / backend.

If a user wants to do it (see exception inside Span), they can. I just showed it is possible. Whether users do it or not - I can't comment.

if # of lines is a concern, Otel SDKs can provide an option to do it automatically. #4333 (comment) has a link to a (Log->SpanEvent) option in one of the languages.

@adriangb
Copy link

adriangb commented Feb 6, 2025

In my opinion what matters is not what's possible, what matters is what the real world experience is going to be for users and backend implementers. And the reality is that this change may have a serious negative impact on both.

@cijothomas
Copy link
Member

In my opinion what matters is not what's possible, what matters is what the real world experience is going to be for users and backend implementers. And the reality is that this change may have a serious negative impact on both.

I don't think Otel ever recommended Exceptions MUST/SHOULD be reported via SpanEvents. It had conventions for reporting exception via SpanEvent and Logs. (logs convention came later than Span), but never recommended one over other.
This was also my first comment on this OTEP.

This OTEP would be the first time Otel officially makes a recommendation on the preferred way of reporting exception.

(That is my read. Happy to be corrected!)

@adriangb
Copy link

adriangb commented Feb 6, 2025

I agree with your interpretation. My point is that recording events via logs does have downsides. Right now you have to opt into it so it's okay to have to do something like the LogRecordProcessor you proposed to get that to work correctly (essentially duplicating the info). But if it becomes the recommendation and presumably eventually the default, as this OTEP is proposing, that results in a a worse user experience for both users and OTEL backend engineers.

@cijothomas
Copy link
Member

I agree with your interpretation. My point is that recording events via logs does have downsides. Right now you have to opt into it so it's okay to have to do something like the LogRecordProcessor you proposed to get that to work correctly (essentially duplicating the info). But if it becomes the recommendation and presumably eventually the default, as this OTEP is proposing, that results in a a worse user experience for both users and OTEL backend engineers.

that results in a a worse user experience for both users and OTEL backend engineers.

I think this depends on the backend/vendor.

Recording exceptions via logs has downsides. Recording exceptions via SpanEvents also has downsides too. Having no recommendation from Otel is also not good. It is definitely possible to have something in the Otel Spec for SDKs, to have a feature flag to control this based on user preference. (The feature flag, can do the conversion of exceptions from LogRecord to SpanEvents OR vice-versa, with or without duplicating.). It is also possible to have feature-flag for instrumentations, but it may be easier to have an SDK level thing that'll ensure consistent behavior, irrespective of the instrumentation used.

@lmolkova
Copy link
Contributor Author

lmolkova commented Feb 6, 2025

to @alexmojaki

With span events, it's easy for the backend to store the span and the exception together because OTLP keeps them together. If recording an exception on a span generates a separate log event then those are guaranteed to be in separate exports. For our backend at least it's very hard to then bring them together in storage.

You could bring them on the SDK level with a events-to-span-events processor. It's a great question how the span-events -> logs migration would look like and whether such processor should be provided by contrib component, individual distros or by default.

If a span is excluded by sampling, what happens to its child log events by default? Even if it's included, it will be missing the parent span describing the context of the exception. It's better than nothing, but this doesn't seem like a real solution to exceptions being lost by sampling. Tail sampling seems like the way to deal with that.

Correlated logs with errors and other details without parent span are pretty valuable regardless of tracing or sampling.

I understand that it's useful to be able to log exceptions without the need to create a span, but that's different from saying that spans should no longer contain exceptions, which is where this OTEP seems headed even if it's not requiring it yet. Why remove span events?

this is part of the previous OTEP on Events - https://github.com/open-telemetry/opentelemetry-specification/blob/main/oteps/0265-event-vision.md#relationship-to-span-events. In the long term, we hope to replace span events with log-based events with migration story for consumers.

We extract special columns like is_exception and exception_type from events which make it trivial for users to query exceptions, they don't have to know about events at all.

We report error.type on spans and it contains exception type if exception was the reason. PTAL at the https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/recording-errors.md.
If you have some scenarios in mind where it's necessary to capture that the error manifested as exception (vs some status code) is useful, please bring them up in semantic conventions.

Span events themselves are not special, but storing exception data directly inside relevant spans is important to us.

Does it include the stack trace? The rest is (or can be) stored for a single 'terminal' exception that the span ends with. Or is your goal to store exception chain and/or handled exceptions that happened during span lifetime?

The counter-argument I have is that there are a lot of exceptions that happen during span execution and most of them are already recorded as logs by runtimes, client libs, frameworks, etc. I.e. you're already in the world where exceptions are not exported along with spans.

It's a fair ask though that user app may want to associate arbitrary exceptions with a span today and we're taking it away when moving to logs.

oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
@trask
Copy link
Member

trask commented Feb 7, 2025

@adriangb @alexmojaki @samuelcolvin would you mind summarizing where we are with each of your concerns above as separate review comments (on the Files changed tab at the top, pick a relevant-ish line and add the summary of where we're at there) so we can have "threaded" discussions on each one of them? I really want to follow along and see what needs to be addressed, but am having trouble following with everything as mainline comments 😅 thank you ❤️

2. Recording exceptions as logs will result in UX degradation for users
leveraging trace-only backends such as Jaeger.

3. Having exceptions exported and stored along with span is beneficial for some backends.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting here to follow the request of commenting somewhere vaguely relevant for a threaded discussion.

Let me check if I understand things correctly. Currently this code:

from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from requests import ConnectionError

provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
tracer = provider.get_tracer(__name__)

with tracer.start_as_current_span("foo"):
    raise ConnectionError("bar")

prints a span containing:

{
  "status": {
    "status_code": "ERROR",
    "description": "ConnectionError: bar"
  },
  "attributes": {},
  "events": [
    {
      "name": "exception",
      "timestamp": "2025-02-07T10:35:52.726398Z",
      "attributes": {
        "exception.type": "requests.exceptions.ConnectionError",
        "exception.message": "bar",
        "exception.stacktrace": "Traceback...",
        "exception.escaped": "False"
      }
    }
  ]
}

Am I correct that the goal is to instead emit the following?

{
  "status": {
    "status_code": "ERROR",
    "description": "bar"
  },
  "attributes": {
    "error.type": "requests.exceptions.ConnectionError"
  }
}

The differences being:

  1. Span events are gone, and the stacktrace will only be found in a child event-log.
  2. The span event attribute exception.type containing the fully qualified exception type name is now in the span attribute error.type
  3. The status description no longer contains the (unqualified) exception name.

@ChristopherKlinge
Copy link

Hi,

I have a question regarding the proposed error.type attribute. As far as I can tell from the current spec and this proposal, this is the "technical" type of the exception (e.g. java.lang.NullPointerException). How is this supposed to translate to error codes for business errors?

In a more traditional logging based stack, when logging an exception, the developer should add a human readable error message explaining the underlying exception's effect on the current process. Without these messages, an observer has a very hard time figuring out what broke from a business perspective.

{
  "level": "ERROR",
  "message": "Could not generate invoice for order XYZ.",
  "exception": "java.lang.NullPointerException",
  "stack_trace": "..."
}

If I understand this OTEP correctly, this log message would not change much and the corresponding trace would look like this:

{
  "status": {
    "status_code": "ERROR",
    "description": "Could not generate invoice for order XYZ."
  },
  "attributes": {
    "error.type": "java.lang.NullPointerException"
  }
}

Is this correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OTEP OpenTelemetry Enhancement Proposal (OTEP)
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.