Skip to content

RFC: Rollup Boost HA Design #185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

0xKitsune
Copy link
Collaborator

@0xKitsune 0xKitsune commented Apr 21, 2025

Opening this PR to migrate the design doc discussion from #181 so reviewers can leave comments directly on specific sections and suggest edits inline.

Copy link

vercel bot commented Apr 21, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
rollup-boost ⬜️ Ignored (Inspect) Visit Preview May 7, 2025 6:37am

- Maintain compatibility with `op-conductor` and its sequencing assumptions.

## Non Goals
- Define how Flashblocks are handled, consumed or streamed to the network.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a very important part that should be included in this design.

Think about one situation, if say in #2 load balancing solution below, what if get_payload call returns from a different builder that streams to public? This will cause flashblock <-> actual block inconsistencies which we aim to avoid as much as possible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we update this now? I think we need to take this into consideration when designing HA solutions for flashblocks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, on the design review call last week we aligned on focusing this doc on an HA design that is forward compatible with Flashblocks without defining the full Flashblocks HA specifics here.

The plan is to fast follow with a separate document that extends the HA design detailing Flashblocks HA behavior including op-conductor integration and consistency guarantees. Happy to update and clarify to make this more explicit. Let me know if you have any additional thoughts on this.


## 1:1 Rollup Boost to Builder Deployments

In this design, each `rollup-boost` instance is configured with a single external builder and default execution client. When `op-node` sends an FCU containing payload attributes, `rollup-boost` forwards the request to both the default execution client and its paired builder. Upon receiving a `get_payload` request from `op-node`, `rollup-boost` queries both the execution client and the builder. If the builder returns a payload, it is validated via a `new_payload` request sent to the default execution client. If the builder payload is invalid or unavailable, `rollup-boost` falls back to the execution client's payload.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the current setup, another feature will need to be added here.

We'll need to allow conductor to proxy flashblocks web socket connections / results to the public (based on if the current node is leader or not). This is a common pattern used in conductor, proxy any interaction with the system that strictly requires to interact with the leader.

Copy link

@0x00101010 0x00101010 Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note here is that during leadership transfer, there could be flashblocks reorg in the current design, but much more controllable and only happens during leadership transfer

And if we want to prevent reorgs, we'll need to consider committing flashblocks to the raft consensus for consistency during leadership transfer, and that is second phase of our planned work

@jelias2
Copy link

jelias2 commented Apr 30, 2025

  1. Flashblocks TX Ingress.
  • Please correct this statement if I'm wrong, but flashblocks receipts are provided to the end user when the tx is sent directly to the builder. If this is the case then we may want to discuss how to design ingress solutions which target the active builder yet do not overload it.
  1. Flashblock Data Exports
  • Base has built the flashblocks-websocket-proxy. Ideally in a production setup this data should be exportable by every sequencer in a conductor set. Solutions may need to be built for this as well

Comment on lines +97 to +109
## Health Checks

In high availability deployments, `op-conductor` must assess the full health of the block production path. Rollup Boost will expose a composite `/healthz` endpoint to report on both builder synchronization and payload production status. These checks allow `op-conductor` to detect degraded block building conditions and make informed leadership decisions.

Rollup Boost will continuously monitors two independent conditions to inform the health of the builder and the default execution client:

- **Builder Synchronization**:
A background task periodically queries the builder’s latest unsafe block via `engine_getBlockByNumber`. The task compares the timestamp of the returned block to the local system time. If the difference exceeds a configured maximum unsafe interval (`max_unsafe_interval`), the builder is considered out of sync. Failure to fetch a block from the builder or detection of an outdated block timestamp results in the health status being downgraded to Partial. If the builder is responsive and the block timestamp is within the acceptable interval, the builder is considered synchronized and healthy. Alternatively instead of periodic polling, builder synchronization can be inferred if the builder returns a `VALID` response to a `newPayload` call forwarded from Rollup Boost.

- **Payload Production**:
During each `get_payload` request, Rollup Boost will verify payload availability from both the builder and the execution client. If the builder fails to deliver a payload, Rollup Boost will report partial health. If the execution client fails to deliver a payload, Rollup Boost will report unhealthy.

`op-conductor` should also be configurable in how it interprets health status for failover decisions. This allows chain operators to define thresholds based on their risk tolerance and operational goals. For example, operators may choose to maintain leadership with a sequencer reporting `206 Partial Content` to avoid unnecessary fail overs or they may configure `op-conductor` to immediately fail over when any degradation is detected. This flexibility allows the chain operator to configure a failover policy that aligns with network performance expectations and builder reliability.
Copy link

@0x00101010 0x00101010 May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a little bit more, curious is there any specific reason (other than convenience that rollup-boost already connects to both builder and default EL) that we do the health check in rollup-boost, not conductor?

The potential drawback here is that due to potentially different configurations or actual health check cadence, there might be a delay in when builder is reported unhealthy and conductor knows it and starts the leadership transfer.

It's not a big deal if we can tolerate a little bit delay in that, but feels like in general conductor (with a new flashblocks / rbuilder health check strategy) is the ideal home for this

Copy link
Collaborator Author

@0xKitsune 0xKitsune May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a little bit more, curious is there any specific reason (other than convenience that rollup-boost already connects to both builder and default EL) that we do the health check in rollup-boost, not conductor?

Our current thinking is that rollup-boost health checks should evaluate both if the builder is synced as well as if it is producing valid payloads.

By placing the builder health check in rollup-boost, we can assess payload health during get_payload calls. Each builder payload is validated via a newPayload call to the local execution client, confirming that the builder is producing valid payloads before marking the builder as healthy.

While op-conductor could perform basic sync checks, this setup allows the health check to ensure block production/validity, giving us a stronger signal of health than sync checks alone. Let me know if you have any thoughts on this.

Copy link

@0x00101010 0x00101010 May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I think this makes sense, one potential suggestion here is to mention (maybe not exactly here as design doc) that to minimize health check delay, would suggest to clearly consider / configure matching health check intervals for both conductor and rollup-boost

Copy link

@0x00101010 0x00101010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link

@jelias2 jelias2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants