Description
In ScyllaDB, it happens from time to time that some startup/shutdown/maintenance fiber gets stuck. The first step to debugging a problem like this is finding out what fiber got stuck and where, and this usually involves repeatedly adding debug logs and hoping that the bug reproduces again. But that's time-consuming, and not guaranteed to succeed. It would be better to be able to find the stuck fiber in the coredump.
It's already possible to find chains of awaiting seastar::task
s in the coredump, but AFAIK it's not possible to guess what points in the source code they correspond to. For coroutines, we can find the relevant coroutine, but not point out the relevant co_await
inside it. For then
continuations, we can try to guess the relevant function based on the type of the continuation, but it's hard (if possible at all) to do that reliably.
So maybe we could, for example, capture the program counter (rip
) at the call sites of then
(and its variants) and co_await
, and put it into the awaiting task (continuation/coroutine)? Combined with debug info, this should allow us to see the source locations of the awaiting continuation points.
And it should be cheap enough. (Probably.)