Skip to content

websocket proxy: ping/pong downstream clients and proactively disconnect #264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

haardikk21
Copy link
Contributor

@haardikk21 haardikk21 commented May 28, 2025

has the ws proxy send ping frames to downstream clients periodically. if a pong isn't received back within a specified timeout (10s), it disconnects the clients

previously, we disconnected clients if message sending failed - but that only works for disconnecting on one-way networking issues.

there are a variety of cases where clients can continue to receive packets but aren't actually (able to) process them, adding unnecessary overhead to this service. devices going to sleep, battery saving/low prio mode in browsers, processing getting blocked on client side due to slow code, etc can all cause the TCP buffer to stay open and receive packets but being unable to process the data

by pinging, we ensure that clients are not just receiving messages but also actively processing them by needing them to respond with a pong

Copy link

vercel bot commented May 28, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
rollup-boost ⬜️ Ignored (Inspect) Visit Preview May 28, 2025 7:34am

@angel-ding-cb
Copy link
Contributor

angel-ding-cb commented May 28, 2025

I am not a rust expert and I am not familiar with the ping pong conventions in rust. But I think the code can be improved in a couple of ways.

  1. The health check logic is redundant in this implementation.
    The only scenario where last_pong wouldn't be updated is if the client stops responding to pings. But here's what happens in that case:
  • Ping is sent → No pong response from client
  • client.recv() continues waiting for the pong (or any message)
  • Next ping timer fires → client.ping().await is called
  • Ping will fail because the connection is broken/unresponsive
  • Loop breaks immediately on the ping error

Meanwhile, in your implementation, even if the pong fails, client.ping().await might not fail immediately. The client could enter a zombie state where it continues to receive pings but never responds with a pong. I suggest to simplify this logic by making client.ping() wait for the pong response. The overall flow should look something like this:

  1. Ping Timer Fires → Calls client.ping_and_wait(PONG_TIMEOUT)
  2. Ping Sent → Creates oneshot channel and sends ping frame
  3. Waits for Pong → Times out after x seconds if no pong received
  4. Pong Received → handle_pong() notifies the waiting ping
  5. Success or Failure → Ping returns Ok() or Error, breaking loop on failure

In this case, we could have cleaner code with no periodic healthchecks every 2 seconds

@angel-ding-cb
Copy link
Contributor

  1. I have some personal opinions (not strongly held) on the ping/pong mechanism as I am basically implementing the same thing in the conductors. I think the ping/pong mechanism should be observational (monitoring health) rather than operational (managing connections). We can afford keeping all the connections open even if they don't respond with a pong in 12 seconds (in your health check case). When we actually deploy this to mainnet, the client to rollup boost will be 1 conductor. So even if it is not responsive, we can still keep the connection open. We just need to log out error messages instead of closing the connections, which might result in deadlocks.

More context on the architecture is that we will have 5 conductors and 5 rollup boosts (1:1 mapping). And Conductor health is managed separately by conductor health check logic, which means unhealthy conductors won't become leaders anyway.

If we only log out warnings, we can avoid false positives: Network hiccup → Missed pong → Connection killed → Healthy conductor unnecessarily disconnected.

@angel-ding-cb
Copy link
Contributor

  1. This is more of a question than a comment since I am not familiar with the rust conventions of ping pong mechanisms. In my golang implementation in conductors, I wanted to make sure that it can send 2-3 pings without the client timing out. My pong read timeout is 60s and my ping interval is every 30 seconds. I am worried that 10s ping might be too frequent. But I will defer it to you!

@danyalprout
Copy link
Contributor

I agree with the client.ping_and_wait(PONG_TIMEOUT).

We can afford keeping all the connections open even if they don't respond with a pong in 12 seconds (in your health check case).

@angel-ding-cb I don't think that is true for the websocket proxy. We could have a huge number of clients

@haardikk21
Copy link
Contributor Author

I am not a rust expert and I am not familiar with the ping pong conventions in rust. But I think the code can be improved in a couple of ways.

1. The health check logic is redundant in this implementation.
   The only scenario where last_pong wouldn't be updated is if the client stops responding to pings. But here's what happens in that case:


* Ping is sent → No pong response from client

* `client.recv()` continues waiting for the pong (or any message)

* Next ping timer fires → client.ping().await is called

* Ping will fail because the connection is broken/unresponsive

* Loop breaks immediately on the ping error

Meanwhile, in your implementation, even if the pong fails, client.ping().await might not fail immediately. The client could enter a zombie state where it continues to receive pings but never responds with a pong. I suggest to simplify this logic by making client.ping() wait for the pong response. The overall flow should look something like this:

1. Ping Timer Fires → Calls `client.ping_and_wait(PONG_TIMEOUT)`

2. Ping Sent → Creates oneshot channel and sends ping frame

3. Waits for Pong → Times out after x seconds if no pong received

4. Pong Received → `handle_pong()` notifies the waiting ping

5. Success or Failure → Ping returns` Ok()` or Error, breaking loop on failure

In this case, we could have cleaner code with no periodic healthchecks every 2 seconds

It's important to note here that pings can definitely continue sending without the client ponging back. It's the same as our flashblocks messages continuing to send to them even if they're not actually processing/active. The cases in which this happens is the same - device put to sleep, CPU throttling processes, browsers sleeping their tabs, etc.

so its not good enough to just assume that the next ping will fail. the healthcheck therefore expects a pong or it forces a disconnect.

the ping_and_wait approach works though, i can switch to that instead


  1. This is more of a question than a comment since I am not familiar with the rust conventions of ping pong mechanisms. In my golang implementation in conductors, I wanted to make sure that it can send 2-3 pings without the client timing out. My pong read timeout is 60s and my ping interval is every 30 seconds. I am worried that 10s ping might be too frequent. But I will defer it to you!

im considering perhaps making these part of config/env instead of hardcoding so we can play with it. im not too sure on what the right value should be atm either.

@angel-ding-cb
Copy link
Contributor

@danyalprout I agree that for the websocket proxy, we could have a huge number of clients. That's why the conductor will close out stale/unresponsive connections gracefully. Since in the final state where the rollupboost will only stream messages from the builder to the conductor, I am worried that closing the connections will interrupt the flow when there is a false positive. But this is not a huge issue since the conductor will indefinitely try to connect to rollup boost if it ever loses the connection. So I will defer it to you @haardikk21

@danyalprout
Copy link
Contributor

@angel-ding-cb I believe this code is for managing connections to downstream clients from the websocket proxy (i.e. the nodes).

Unless I'm missing something, I don't think it'll impact the conductor setup at all

@haardikk21 haardikk21 closed this May 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants