Skip to content

Looking for stuck/deadlocked fibers could be easier #2381

Open
@michoecho

Description

@michoecho

In ScyllaDB, it happens from time to time that some startup/shutdown/maintenance fiber gets stuck. The first step to debugging a problem like this is finding out what fiber got stuck and where, and this usually involves repeatedly adding debug logs and hoping that the bug reproduces again. But that's time-consuming, and not guaranteed to succeed. It would be better to be able to find the stuck fiber in the coredump.

It's already possible to find chains of awaiting seastar::tasks in the coredump, but AFAIK it's not possible to guess what points in the source code they correspond to. For coroutines, we can find the relevant coroutine, but not point out the relevant co_await inside it. For then continuations, we can try to guess the relevant function based on the type of the continuation, but it's hard (if possible at all) to do that reliably.

So maybe we could, for example, capture the program counter (rip) at the call sites of then (and its variants) and co_await, and put it into the awaiting task (continuation/coroutine)? Combined with debug info, this should allow us to see the source locations of the awaiting continuation points.

And it should be cheap enough. (Probably.)

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions