Skip to content

Slow rollout feature #190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: flashblocks
Choose a base branch
from

Conversation

cody-wang-cb
Copy link

Adding a slow rollout feature where it'd only try to fetch the builder block x% of the time, this allows the rollout to happen in an incremental basis rather than all or nothing.

Copy link

vercel bot commented Apr 28, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
rollup-boost ⬜️ Ignored (Inspect) Visit Preview Apr 28, 2025 0:52am

@cody-wang-cb
Copy link
Author

@ferranbt @avalonche wanna give a review on this PR?


/// Percentage of blocks built by the builder
#[arg(long, env, default_value = "100")]
pub rollout_pct: u16,
Copy link
Collaborator

@0xKitsune 0xKitsune Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this approach, it seems rollup-boost would need to be restarted every time the rollout percentage is updated.

IIUC, the problem this PR is solving is to verify that the builder is healthy and producing blocks correctly before fully enabling it.

It seems like another approach to solve this problem could be using ExecutionMode::DryRun. DryRun forwards payload building jobs to the builder without sending get_payload requests, allowing operators to evaluate builder health/metrics before fully enabling the builder.

This allows us to validate the builder’s readiness without needing rollup-boost restarts/config changes. Curious to hear your thoughts.

Copy link
Author

@cody-wang-cb cody-wang-cb Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the problem this PR is solving is to verify that the builder is healthy and producing blocks correctly before fully enabling it.

Not really, this PR is aimed at only using the builder to build real blocks X% amount of time, which is not the same as dry run.
You could argue that rollout_pct = 0 is the same as dry run but otherwise it's not. Once you turn off dry run it's 100% by the builder which might not be ideal if you want to evaluate the builder building real blocks into DB state with some production traffic first.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be part of the debug api so the % can can configured without restarts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I can make it part of the debug API

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see an issue with this - but why is this useful functionality to have within rollup-boost? Fallback execution mode would actually allow you to send FCU's, and receive payloads from the builder without propagating those payloads to the CL. This would allow you to fully evaluate the health of the builder (web socket streams, etc.) while building 100% of blocks without fully enabling block production on the network which seems most useful.

Curious why an operator would want the builder to only build x% of blocks at random points in time?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, this PR is aimed at only using the builder to build real blocks X% amount of time, which is not the same as dry run.

Right, Im not suggesting that DryRun does the same thing as a partial rollout. Im asking if they seek to solve the same problem (ie. how to verify that the builder is healthy and producing blocks correctly before fully enabling the builder). If partial rollout fits your deployment approach compared to something like using DryRun, agreed that we could make it part of the debug API to avoid restarts.

if you want to evaluate the builder building real blocks into DB state with some production traffic first.

Just noting that if the ultimate goal of partial rollouts is to validate builder payload correctness before fully enabling the builder and propagating these blocks throughout the network, this could be achieved via DryRun (or other execution modes like Fallback) and inspecting traces/logs to evaluate builder produced blocks without publishing them to the network.

Within the Debug API there is also Fallback mode which sends FCUs with payload attributes to the builder and validates payloads with the default execution client, but ultimately falls back on the default execution client's block. This approach could also be used to derisk deployments, allowing you to not only inspect the block via logs but also validate builder blocks via new_payload calls to the local execution client. It's worth noting that @ferranbt and I had been discussing simplifying DryRun and Fallback into a single execution mode, since they are quite similar. We could incorporate the problem that partial rollout is trying to solve into those changes as well.

Let me know if I'm overlooking something here. I won't block on this, just pointing out that we could potentially use or extend existing execution modes to handle builder block correctness and health validation during incremental rollout.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why an operator would want the builder to only build x% of blocks at random points in time?

Yeah this is the key question here, internally we are going to revisit this tomorrow to see if this assumption really makes sense.
What I was thinking here was unrelated to correctness, but whether this partial rollout could allow us to observe some new user behaviours from the new blocks (e.g. users might start to increase fees, more/less spams, etc), because the external builder inherently builds different blocks than the sequencer. But maybe if it's just random blocks it doesn't really help, perhaps some kind of switchback experiment makes more sense here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey just bumping this to see if there are any updates after revisiting the assumptions around partial rollout or if we should close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants