Skip to content

Conversation

@lexnv
Copy link
Collaborator

@lexnv lexnv commented Jan 12, 2026

The Connection::poll_next implementation needlessly created an async block just to drop it when returning Poll::Pending. Instead, this PR polls the async_rx and sync_rx receivers manually. Considering that all substrate implementations use sync_rx, this takes priority (instead of the previous tokio::select!, which polled fairly).

Discovered during investigation of:

cc @paritytech/networking

@lexnv lexnv self-assigned this Jan 12, 2026
@lexnv lexnv added the bug Something isn't working label Jan 12, 2026

match future.poll_unpin(cx) {
Poll::Pending => None,
None => match this.async_rx.poll_recv(cx) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tokio select chooses randomly which future gets polled first, now async_rx is always polled first, potentially sync_rx never polled. Is fairness necessary here or this is intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that's a fair point :D In substrate, we are using exclusively synx_rx. IIRC, there's no usage of async_rx atm. Will change the order, thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud, if the mostly loaded one is sync_rx, we should poll first async_rx, so that we fall through to polling of sync_rx as well. In the opposite case we might end up polling only the sync_rx which is always loaded and starve async_rx.

May be we can keep tokio::select! that does the polling randomization internally?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, we can close this for now, no longer needed if we keep the select! 🙏

None => {
let future = async {
tokio::select! {
notification = this.async_rx.recv() => notification,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.rs/tokio/1.49.0/tokio/sync/mpsc/struct.Receiver.html#cancel-safety

But this is cancel safe. So, I don't get your argument?

The wake up call should still be registered and the entire future be called when there was an event?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the receivers should be cancel safe. The issue here is with the let future = async { }. The context waker is registered by the inner recv calls inside the temporary future. The future would later on be dropped if future.poll_unpin returns Poll::Pending. Then, when the sync_rx got a new notification, it would wake the waker corresponding to the dropped future, causing the poll_next to stall.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, when the sync_rx got a new notification, it would wake the waker corresponding to the dropped future, causing the poll_next to stall.

But the waker is for the entire task and not just the future. So, the waker just wakes up the entire task and not some particular future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've dug a bit into tokio to figure this out, indeed I'm mistaken with the "stalled connection" because I assumed recv worked similarly to reserve (the initial issue we noticed in webrtc):

  • The tokio's bounded Receiver uses a waiting list of "context waker" via a wrapper over Semaphore implementation

  • I assumed that sync_rx.recv() would call into the semaphore acquire or similar to place the context waker into the linked list (obtaining an Acquire)

    • Because the sync_rx.recv() future would get dropped immediately, the waker would be removed on Drop from the linked list
    • When the notification is received, there would be no registered waker in the list

However, the semaphore is only used for capacity. When we call into recv, the Receiver stores the waker into a separate variable:

    /// Receiver waker. Notified when a value is pushed into the channel.
    rx_waker: CachePadded<AtomicWaker>,


fn recv
   self.inner.rx_waker.register_by_ref(cx.waker());

So regardless if the temporary future gets dropped, we'll still wake the proper waker under the hood. This PR justs turns into a tiny optimization to not create and drop a dedicated async block :D

@lexnv lexnv changed the title notification: Fix stalled connection due to dropped wakers in poll_next notification: Replace async block with poll_recv Jan 14, 2026
@lexnv lexnv closed this Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants