Looking for stuck/deadlocked fibers could be easier

In ScyllaDB, it happens from time to time that some startup/shutdown/maintenance fiber gets stuck. The first step to debugging a problem like this is finding out what fiber got stuck and where, and this usually involves repeatedly adding debug logs and hoping that the bug reproduces again. But that's time-consuming, and not guaranteed to succeed. It would be better to be able to find the stuck fiber in the coredump.

It's already possible to find chains of awaiting `seastar::task`s in the coredump, but AFAIK it's not possible to guess what points in the source code they correspond to. For coroutines, we can find the relevant coroutine, but not point out the relevant `co_await` inside it. For `then` continuations, we can try to guess the relevant function based on the *type* of the continuation, but it's hard (if possible at all) to do that reliably.

So maybe we could, for example, capture the program counter (`rip`) at the call sites of `then` (and its variants) and `co_await`, and put it into the awaiting task (continuation/coroutine)? Combined with debug info, this should allow us to see the source locations of the awaiting continuation points.

And it should be cheap enough. (Probably.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Looking for stuck/deadlocked fibers could be easier #2381

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Looking for stuck/deadlocked fibers could be easier #2381

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions