-
Notifications
You must be signed in to change notification settings - Fork 35
RFC: Rollup Boost HA Design #185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
- Maintain compatibility with `op-conductor` and its sequencing assumptions. | ||
|
||
## Non Goals | ||
- Define how Flashblocks are handled, consumed or streamed to the network. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a very important part that should be included in this design.
Think about one situation, if say in #2 load balancing solution below, what if get_payload call returns from a different builder that streams to public? This will cause flashblock <-> actual block inconsistencies which we aim to avoid as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we update this now? I think we need to take this into consideration when designing HA solutions for flashblocks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, on the design review call last week we aligned on focusing this doc on an HA design that is forward compatible with Flashblocks without defining the full Flashblocks HA specifics here.
The plan is to fast follow with a separate document that extends the HA design detailing Flashblocks HA behavior including op-conductor
integration and consistency guarantees. Happy to update and clarify to make this more explicit. Let me know if you have any additional thoughts on this.
docs/rollup-boost-ha.md
Outdated
|
||
## 1:1 Rollup Boost to Builder Deployments | ||
|
||
In this design, each `rollup-boost` instance is configured with a single external builder and default execution client. When `op-node` sends an FCU containing payload attributes, `rollup-boost` forwards the request to both the default execution client and its paired builder. Upon receiving a `get_payload` request from `op-node`, `rollup-boost` queries both the execution client and the builder. If the builder returns a payload, it is validated via a `new_payload` request sent to the default execution client. If the builder payload is invalid or unavailable, `rollup-boost` falls back to the execution client's payload. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the current setup, another feature will need to be added here.
We'll need to allow conductor to proxy flashblocks web socket connections / results to the public (based on if the current node is leader or not). This is a common pattern used in conductor, proxy any interaction with the system that strictly requires to interact with the leader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note here is that during leadership transfer, there could be flashblocks reorg in the current design, but much more controllable and only happens during leadership transfer
And if we want to prevent reorgs, we'll need to consider committing flashblocks to the raft consensus for consistency during leadership transfer, and that is second phase of our planned work
|
Co-authored-by: Francis Li <[email protected]>
## Health Checks | ||
|
||
In high availability deployments, `op-conductor` must assess the full health of the block production path. Rollup Boost will expose a composite `/healthz` endpoint to report on both builder synchronization and payload production status. These checks allow `op-conductor` to detect degraded block building conditions and make informed leadership decisions. | ||
|
||
Rollup Boost will continuously monitors two independent conditions to inform the health of the builder and the default execution client: | ||
|
||
- **Builder Synchronization**: | ||
A background task periodically queries the builder’s latest unsafe block via `engine_getBlockByNumber`. The task compares the timestamp of the returned block to the local system time. If the difference exceeds a configured maximum unsafe interval (`max_unsafe_interval`), the builder is considered out of sync. Failure to fetch a block from the builder or detection of an outdated block timestamp results in the health status being downgraded to Partial. If the builder is responsive and the block timestamp is within the acceptable interval, the builder is considered synchronized and healthy. Alternatively instead of periodic polling, builder synchronization can be inferred if the builder returns a `VALID` response to a `newPayload` call forwarded from Rollup Boost. | ||
|
||
- **Payload Production**: | ||
During each `get_payload` request, Rollup Boost will verify payload availability from both the builder and the execution client. If the builder fails to deliver a payload, Rollup Boost will report partial health. If the execution client fails to deliver a payload, Rollup Boost will report unhealthy. | ||
|
||
`op-conductor` should also be configurable in how it interprets health status for failover decisions. This allows chain operators to define thresholds based on their risk tolerance and operational goals. For example, operators may choose to maintain leadership with a sequencer reporting `206 Partial Content` to avoid unnecessary fail overs or they may configure `op-conductor` to immediately fail over when any degradation is detected. This flexibility allows the chain operator to configure a failover policy that aligns with network performance expectations and builder reliability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this a little bit more, curious is there any specific reason (other than convenience that rollup-boost already connects to both builder and default EL) that we do the health check in rollup-boost, not conductor?
The potential drawback here is that due to potentially different configurations or actual health check cadence, there might be a delay in when builder is reported unhealthy and conductor knows it and starts the leadership transfer.
It's not a big deal if we can tolerate a little bit delay in that, but feels like in general conductor (with a new flashblocks / rbuilder health check strategy) is the ideal home for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this a little bit more, curious is there any specific reason (other than convenience that rollup-boost already connects to both builder and default EL) that we do the health check in rollup-boost, not conductor?
Our current thinking is that rollup-boost
health checks should evaluate both if the builder is synced as well as if it is producing valid payloads.
By placing the builder health check in rollup-boost
, we can assess payload health during get_payload
calls. Each builder payload is validated via a newPayload
call to the local execution client, confirming that the builder is producing valid payloads before marking the builder as healthy.
While op-conductor
could perform basic sync checks, this setup allows the health check to ensure block production/validity, giving us a stronger signal of health than sync checks alone. Let me know if you have any thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I think this makes sense, one potential suggestion here is to mention (maybe not exactly here as design doc) that to minimize health check delay, would suggest to clearly consider / configure matching health check intervals for both conductor and rollup-boost
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Opening this PR to migrate the design doc discussion from #181 so reviewers can leave comments directly on specific sections and suggest edits inline.