Description
We got reports:
- from devops team that mainnet sampling can take more than 30secs
- from conduit that they observe "desyncs" with sampling on mainnet
- from mammoth testnet samplings fails periodically
All of these get recovered with retries, but this instability still causes a lot FUD, so we should consider fixing it. The best fix is too migrate of BS, but until we can afford this, we should look into simplest fixes for BS.
My hypothesis on why we observe this is coming from the way we use BS, which is different from the canonical one. Basically we have a session pool with long-term sessions, instead of short-term sessions which is a canonical way of using it. If we move on the canonical way, e.g. a session per sampling height, I believe things should stabilize stabilize as we would be hitting the canonical code-paths of IPFS that are very well battle-tested. It should less performant in theory than pooling causing more WANT_HAVE messages on the network and higher latency for sampling, but at least it should become stable with success rate.
If this doesn't work, the next venue to look at is prioritization.