-
Notifications
You must be signed in to change notification settings - Fork 35
Slow rollout feature #190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: flashblocks
Are you sure you want to change the base?
Slow rollout feature #190
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
@ferranbt @avalonche wanna give a review on this PR? |
|
||
/// Percentage of blocks built by the builder | ||
#[arg(long, env, default_value = "100")] | ||
pub rollout_pct: u16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this approach, it seems rollup-boost
would need to be restarted every time the rollout percentage is updated.
IIUC, the problem this PR is solving is to verify that the builder is healthy and producing blocks correctly before fully enabling it.
It seems like another approach to solve this problem could be using ExecutionMode::DryRun. DryRun
forwards payload building jobs to the builder without sending get_payload
requests, allowing operators to evaluate builder health/metrics before fully enabling the builder.
This allows us to validate the builder’s readiness without needing rollup-boost
restarts/config changes. Curious to hear your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the problem this PR is solving is to verify that the builder is healthy and producing blocks correctly before fully enabling it.
Not really, this PR is aimed at only using the builder to build real blocks X% amount of time, which is not the same as dry run.
You could argue that rollout_pct = 0 is the same as dry run but otherwise it's not. Once you turn off dry run it's 100% by the builder which might not be ideal if you want to evaluate the builder building real blocks into DB state with some production traffic first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be part of the debug api so the % can can configured without restarts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I can make it part of the debug API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really see an issue with this - but why is this useful functionality to have within rollup-boost? Fallback
execution mode would actually allow you to send FCU's, and receive payloads from the builder without propagating those payloads to the CL. This would allow you to fully evaluate the health of the builder (web socket streams, etc.) while building 100% of blocks without fully enabling block production on the network which seems most useful.
Curious why an operator would want the builder to only build x% of blocks at random points in time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, this PR is aimed at only using the builder to build real blocks X% amount of time, which is not the same as dry run.
Right, Im not suggesting that DryRun
does the same thing as a partial rollout. Im asking if they seek to solve the same problem (ie. how to verify that the builder is healthy and producing blocks correctly before fully enabling the builder). If partial rollout fits your deployment approach compared to something like using DryRun
, agreed that we could make it part of the debug API to avoid restarts.
if you want to evaluate the builder building real blocks into DB state with some production traffic first.
Just noting that if the ultimate goal of partial rollouts is to validate builder payload correctness before fully enabling the builder and propagating these blocks throughout the network, this could be achieved via DryRun
(or other execution modes like Fallback
) and inspecting traces/logs to evaluate builder produced blocks without publishing them to the network.
Within the Debug API there is also Fallback mode which sends FCUs with payload attributes to the builder and validates payloads with the default execution client, but ultimately falls back on the default execution client's block. This approach could also be used to derisk deployments, allowing you to not only inspect the block via logs but also validate builder blocks via new_payload
calls to the local execution client. It's worth noting that @ferranbt and I had been discussing simplifying DryRun
and Fallback
into a single execution mode, since they are quite similar. We could incorporate the problem that partial rollout is trying to solve into those changes as well.
Let me know if I'm overlooking something here. I won't block on this, just pointing out that we could potentially use or extend existing execution modes to handle builder block correctness and health validation during incremental rollout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why an operator would want the builder to only build x% of blocks at random points in time?
Yeah this is the key question here, internally we are going to revisit this tomorrow to see if this assumption really makes sense.
What I was thinking here was unrelated to correctness, but whether this partial rollout could allow us to observe some new user behaviours from the new blocks (e.g. users might start to increase fees, more/less spams, etc), because the external builder inherently builds different blocks than the sequencer. But maybe if it's just random blocks it doesn't really help, perhaps some kind of switchback experiment makes more sense here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey just bumping this to see if there are any updates after revisiting the assumptions around partial rollout or if we should close this PR.
Adding a slow rollout feature where it'd only try to fetch the builder block x% of the time, this allows the rollout to happen in an incremental basis rather than all or nothing.