Skip to content

Let SDK users turn off panic reporting #8368

@natefinch

Description

@natefinch

Problem Statement

At work, we use a lot of spans, which is great for observability. However, we've noticed a problem - when we get a panic, our stack traces end up looking like this:

Unwrapped error: panic caught in middleware: panic: runtime error: invalid memory address or nil pointer dereference
stacktrace:
goroutine 4177365 [running]:
runtime/debug.Stack()
	/root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.26.3.linux-amd64/src/runtime/debug/stack.go:26 +0x5e
<elided>/middleware/recovery.(*Recovery).Handler-fm.(*Recovery).Handler.func1.1()
	<elided>/middleware/recovery/recovery.go:30 +0x59
panic({0x262c020?, 0x5872b50?})
	/root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.26.3.linux-amd64/src/runtime/panic.go:860 +0x13a
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End.deferwrap1()
	/app/vendor/go.opentelemetry.io/otel/sdk/trace/span.go:478 +0x1b
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End(0x23ac7988780, {0x0, 0x0, 0x23a88659038?})
	/app/vendor/go.opentelemetry.io/otel/sdk/trace/span.go:528 +0xc7b
panic({0x262c020?, 0x5872b50?})
	/root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.26.3.linux-amd64/src/runtime/panic.go:860 +0x13a
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End.deferwrap1()
	/app/vendor/go.opentelemetry.io/otel/sdk/trace/span.go:478 +0x1b
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End(0x23aa819ab40, {0x0, 0x0, 0x23a8294e960?})
	/app/vendor/go.opentelemetry.io/otel/sdk/trace/span.go:528 +0xc7b
panic({0x262c020?, 0x5872b50?})
	/root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.26.3.linux-amd64/src/runtime/panic.go:860 +0x13a
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End.deferwrap1()
	/app/vendor/go.opentelemetry.io/otel/sdk/trace/span.go:478 +0x1b
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End(0x23aa819ad20, {0x0, 0x0, 0x23a8294e960?})
	/app/vendor/go.opentelemetry.io/otel/sdk/trace/span.go:528 +0xc7b
panic({0x262c020?, 0x5872b50?})
	/root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.26.3.linux-amd64/src/runtime/panic.go:860 +0x13a
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End.deferwrap1()
	/app/vendor/go.opentelemetry.io/otel/sdk/trace/span.go:478 +0x1b
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End(0x23aa819b860, {0x0, 0x0, 0x23a8294e960?})
	/app/vendor/go.opentelemetry.io/otel/sdk/trace/span.go:528 +0xc7b

And like 100 more of these.

There's SO many of these, that when we get the stack trace of the panic, Go cuts out the middle of it with the message ...106 frames elided...

Go tries to be smart and remove the middle of the stack trace, with the idea that the top and bottom of the stack trace are usually the most interesting parts.
Unfortunately, because opentelemetry is tacking so much noise on the top of the stack, we actually can't see the line where the panic is happening or anything near it. The bottom of the stack is the HTTP server and middleware, and then 106 frames elided and then otel noise.

Proposed Solution

Let panic recording be optional. For how we use otel, it's not terribly useful for us to have that info on our spans, and the noise it causes in our stack traces makes them useless, which we actually do use.

It wouldn't be hard to include an option that was just "CheckPanics(bool)". In the span end code, if that boolean is false, don't call recover().

Alternatives

Copilot came up with a clever (but obscure) workaround, and it's what I'd suggest for anyone else who is also experiencing this issue before there is another solution:

Wrap otel spans in your own type with its own .End() method which gets deferred in your functions, and have it call otel's .End() underneath the hood (don't just embed otel's type). Put a //go:noinline tag on it so the compiler doesn't just hoist it out of your wrapper.

That removes otel's .End() from being directly in the stack when a panic is hit, and so when the wrapper's .End calls otel's .End, otel's recover ends up returning nil, and it doesn't recover/repanic, and doesn't clog up the stack trace.

The wrapper looks basically like this:

import (
	"go.opentelemetry.io/otel/trace"
	"go.opentelemetry.io/otel/trace/embedded"
)

type noRecoverSpan struct {
	embedded.Span
	span trace.Span
}

//go:noinline
func (s *noRecoverSpan) End(options ...trace.SpanEndOption) {
	s.span.End(options...)
}

// and then the rest of the functions just pass through to s.span as well

The //go:noinline is important to keep the compiler from just inlining this code, which would defeat the purpose of putting the underlying span.End inside our function, and thus out of the direct line of the panic.

This is a hack, and it works, but it's super unintuitive, and I really wish I could just configure the otel sdk to not try to recover panics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions