Problem Statement
At work, we use a lot of spans, which is great for observability. However, we've noticed a problem - when we get a panic, our stack traces end up looking like this:
Unwrapped error: panic caught in middleware: panic: runtime error: invalid memory address or nil pointer dereference
stacktrace:
goroutine 4177365 [running]:
runtime/debug.Stack()
/root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.26.3.linux-amd64/src/runtime/debug/stack.go:26 +0x5e
<elided>/middleware/recovery.(*Recovery).Handler-fm.(*Recovery).Handler.func1.1()
<elided>/middleware/recovery/recovery.go:30 +0x59
panic({0x262c020?, 0x5872b50?})
/root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.26.3.linux-amd64/src/runtime/panic.go:860 +0x13a
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End.deferwrap1()
/app/vendor/go.opentelemetry.io/otel/sdk/trace/span.go:478 +0x1b
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End(0x23ac7988780, {0x0, 0x0, 0x23a88659038?})
/app/vendor/go.opentelemetry.io/otel/sdk/trace/span.go:528 +0xc7b
panic({0x262c020?, 0x5872b50?})
/root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.26.3.linux-amd64/src/runtime/panic.go:860 +0x13a
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End.deferwrap1()
/app/vendor/go.opentelemetry.io/otel/sdk/trace/span.go:478 +0x1b
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End(0x23aa819ab40, {0x0, 0x0, 0x23a8294e960?})
/app/vendor/go.opentelemetry.io/otel/sdk/trace/span.go:528 +0xc7b
panic({0x262c020?, 0x5872b50?})
/root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.26.3.linux-amd64/src/runtime/panic.go:860 +0x13a
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End.deferwrap1()
/app/vendor/go.opentelemetry.io/otel/sdk/trace/span.go:478 +0x1b
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End(0x23aa819ad20, {0x0, 0x0, 0x23a8294e960?})
/app/vendor/go.opentelemetry.io/otel/sdk/trace/span.go:528 +0xc7b
panic({0x262c020?, 0x5872b50?})
/root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.26.3.linux-amd64/src/runtime/panic.go:860 +0x13a
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End.deferwrap1()
/app/vendor/go.opentelemetry.io/otel/sdk/trace/span.go:478 +0x1b
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End(0x23aa819b860, {0x0, 0x0, 0x23a8294e960?})
/app/vendor/go.opentelemetry.io/otel/sdk/trace/span.go:528 +0xc7b
And like 100 more of these.
There's SO many of these, that when we get the stack trace of the panic, Go cuts out the middle of it with the message ...106 frames elided...
Go tries to be smart and remove the middle of the stack trace, with the idea that the top and bottom of the stack trace are usually the most interesting parts.
Unfortunately, because opentelemetry is tacking so much noise on the top of the stack, we actually can't see the line where the panic is happening or anything near it. The bottom of the stack is the HTTP server and middleware, and then 106 frames elided and then otel noise.
Proposed Solution
Let panic recording be optional. For how we use otel, it's not terribly useful for us to have that info on our spans, and the noise it causes in our stack traces makes them useless, which we actually do use.
It wouldn't be hard to include an option that was just "CheckPanics(bool)". In the span end code, if that boolean is false, don't call recover().
Alternatives
Copilot came up with a clever (but obscure) workaround, and it's what I'd suggest for anyone else who is also experiencing this issue before there is another solution:
Wrap otel spans in your own type with its own .End() method which gets deferred in your functions, and have it call otel's .End() underneath the hood (don't just embed otel's type). Put a //go:noinline tag on it so the compiler doesn't just hoist it out of your wrapper.
That removes otel's .End() from being directly in the stack when a panic is hit, and so when the wrapper's .End calls otel's .End, otel's recover ends up returning nil, and it doesn't recover/repanic, and doesn't clog up the stack trace.
The wrapper looks basically like this:
import (
"go.opentelemetry.io/otel/trace"
"go.opentelemetry.io/otel/trace/embedded"
)
type noRecoverSpan struct {
embedded.Span
span trace.Span
}
//go:noinline
func (s *noRecoverSpan) End(options ...trace.SpanEndOption) {
s.span.End(options...)
}
// and then the rest of the functions just pass through to s.span as well
The //go:noinline is important to keep the compiler from just inlining this code, which would defeat the purpose of putting the underlying span.End inside our function, and thus out of the direct line of the panic.
This is a hack, and it works, but it's super unintuitive, and I really wish I could just configure the otel sdk to not try to recover panics.
Problem Statement
At work, we use a lot of spans, which is great for observability. However, we've noticed a problem - when we get a panic, our stack traces end up looking like this:
And like 100 more of these.
There's SO many of these, that when we get the stack trace of the panic, Go cuts out the middle of it with the message
...106 frames elided...Go tries to be smart and remove the middle of the stack trace, with the idea that the top and bottom of the stack trace are usually the most interesting parts.
Unfortunately, because opentelemetry is tacking so much noise on the top of the stack, we actually can't see the line where the panic is happening or anything near it. The bottom of the stack is the HTTP server and middleware, and then
106 frames elidedand then otel noise.Proposed Solution
Let panic recording be optional. For how we use otel, it's not terribly useful for us to have that info on our spans, and the noise it causes in our stack traces makes them useless, which we actually do use.
It wouldn't be hard to include an option that was just "CheckPanics(bool)". In the span end code, if that boolean is false, don't call
recover().Alternatives
Copilot came up with a clever (but obscure) workaround, and it's what I'd suggest for anyone else who is also experiencing this issue before there is another solution:
Wrap otel spans in your own type with its own
.End()method which gets deferred in your functions, and have it call otel's.End()underneath the hood (don't just embed otel's type). Put a//go:noinlinetag on it so the compiler doesn't just hoist it out of your wrapper.That removes otel's .End() from being directly in the stack when a panic is hit, and so when the wrapper's .End calls otel's .End, otel's
recoverends up returning nil, and it doesn't recover/repanic, and doesn't clog up the stack trace.The wrapper looks basically like this:
The
//go:noinlineis important to keep the compiler from just inlining this code, which would defeat the purpose of putting the underlying span.End inside our function, and thus out of the direct line of the panic.This is a hack, and it works, but it's super unintuitive, and I really wish I could just configure the otel sdk to not try to recover panics.