Skip to content

Slow rollout feature #190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: flashblocks
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ url = "2.5"
metrics-util = "0.19.0"
eyre = "0.6.12"
paste = "1.0.15"
rand = "0.8.5"

# dev dependencies for integration tests
parking_lot = "0.12.3"
Expand Down
1 change: 1 addition & 0 deletions src/bin/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@ async fn main() -> eyre::Result<()> {
boost_sync_enabled,
args.execution_mode,
flashblocks_client,
args.rollout_pct,
);

// Spawn the debug server
Expand Down
4 changes: 4 additions & 0 deletions src/cli.rs
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,10 @@ pub struct Args {
/// Enable Flashblocks client
#[clap(flatten)]
pub flashblocks: FlashblocksArgs,

/// Percentage of blocks built by the builder
#[arg(long, env, default_value = "100")]
pub rollout_pct: u16,
Copy link
Collaborator

@0xKitsune 0xKitsune Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this approach, it seems rollup-boost would need to be restarted every time the rollout percentage is updated.

IIUC, the problem this PR is solving is to verify that the builder is healthy and producing blocks correctly before fully enabling it.

It seems like another approach to solve this problem could be using ExecutionMode::DryRun. DryRun forwards payload building jobs to the builder without sending get_payload requests, allowing operators to evaluate builder health/metrics before fully enabling the builder.

This allows us to validate the builder’s readiness without needing rollup-boost restarts/config changes. Curious to hear your thoughts.

Copy link
Author

@cody-wang-cb cody-wang-cb Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the problem this PR is solving is to verify that the builder is healthy and producing blocks correctly before fully enabling it.

Not really, this PR is aimed at only using the builder to build real blocks X% amount of time, which is not the same as dry run.
You could argue that rollout_pct = 0 is the same as dry run but otherwise it's not. Once you turn off dry run it's 100% by the builder which might not be ideal if you want to evaluate the builder building real blocks into DB state with some production traffic first.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be part of the debug api so the % can can configured without restarts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I can make it part of the debug API

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see an issue with this - but why is this useful functionality to have within rollup-boost? Fallback execution mode would actually allow you to send FCU's, and receive payloads from the builder without propagating those payloads to the CL. This would allow you to fully evaluate the health of the builder (web socket streams, etc.) while building 100% of blocks without fully enabling block production on the network which seems most useful.

Curious why an operator would want the builder to only build x% of blocks at random points in time?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, this PR is aimed at only using the builder to build real blocks X% amount of time, which is not the same as dry run.

Right, Im not suggesting that DryRun does the same thing as a partial rollout. Im asking if they seek to solve the same problem (ie. how to verify that the builder is healthy and producing blocks correctly before fully enabling the builder). If partial rollout fits your deployment approach compared to something like using DryRun, agreed that we could make it part of the debug API to avoid restarts.

if you want to evaluate the builder building real blocks into DB state with some production traffic first.

Just noting that if the ultimate goal of partial rollouts is to validate builder payload correctness before fully enabling the builder and propagating these blocks throughout the network, this could be achieved via DryRun (or other execution modes like Fallback) and inspecting traces/logs to evaluate builder produced blocks without publishing them to the network.

Within the Debug API there is also Fallback mode which sends FCUs with payload attributes to the builder and validates payloads with the default execution client, but ultimately falls back on the default execution client's block. This approach could also be used to derisk deployments, allowing you to not only inspect the block via logs but also validate builder blocks via new_payload calls to the local execution client. It's worth noting that @ferranbt and I had been discussing simplifying DryRun and Fallback into a single execution mode, since they are quite similar. We could incorporate the problem that partial rollout is trying to solve into those changes as well.

Let me know if I'm overlooking something here. I won't block on this, just pointing out that we could potentially use or extend existing execution modes to handle builder block correctness and health validation during incremental rollout.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why an operator would want the builder to only build x% of blocks at random points in time?

Yeah this is the key question here, internally we are going to revisit this tomorrow to see if this assumption really makes sense.
What I was thinking here was unrelated to correctness, but whether this partial rollout could allow us to observe some new user behaviours from the new blocks (e.g. users might start to increase fees, more/less spams, etc), because the external builder inherently builds different blocks than the sequencer. But maybe if it's just random blocks it doesn't really help, perhaps some kind of switchback experiment makes more sense here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey just bumping this to see if there are any updates after revisiting the assumptions around partial rollout or if we should close this PR.

}

#[derive(Parser, Debug)]
Expand Down
52 changes: 52 additions & 0 deletions src/server.rs
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,7 @@ pub struct RollupBoostServer {
pub payload_trace_context: Arc<PayloadTraceContext>,
pub flashblocks_client: Option<Arc<FlashblocksService>>,
pub execution_mode: Arc<Mutex<ExecutionMode>>,
pub rollout_pct: u16,
}

impl RollupBoostServer {
Expand All @@ -133,6 +134,7 @@ impl RollupBoostServer {
boost_sync: bool,
initial_execution_mode: ExecutionMode,
flashblocks_client: Option<FlashblocksService>,
rollout_pct: u16,
) -> Self {
Self {
l2_client: Arc::new(l2_client),
Expand All @@ -141,6 +143,7 @@ impl RollupBoostServer {
payload_trace_context: Arc::new(PayloadTraceContext::new()),
flashblocks_client: flashblocks_client.map(Arc::new),
execution_mode: Arc::new(Mutex::new(initial_execution_mode)),
rollout_pct,
}
}

Expand Down Expand Up @@ -596,6 +599,15 @@ impl RollupBoostServer {
));
}

if !should_rollout(self.rollout_pct) {
// Skip due to slow rollout
return Err(ErrorObject::owned(
INVALID_REQUEST_CODE,
"Skipped because of (self.rollout_pct)% slow rollout",
None::<String>,
));
}

if let Some(cause) = self.payload_trace_context.trace_id(&payload_id) {
tracing::Span::current().follows_from(cause);
}
Expand Down Expand Up @@ -673,11 +685,21 @@ impl RollupBoostServer {
"number" = %block_number,
%context,
%payload_id,
"rollout_pct" = self.rollout_pct,
);
Ok(payload)
}
}

/// Returns true if the rollout percentage is 100 or if the random number is less than the rollout percentage
fn should_rollout(rollout_pct: u16) -> bool {
if rollout_pct == 100 {
true
} else {
rand::random::<f64>() < rollout_pct as f64 / 100.0
}
}

#[cfg(test)]
#[allow(clippy::complexity)]
mod tests {
Expand Down Expand Up @@ -796,6 +818,7 @@ mod tests {
boost_sync,
ExecutionMode::Enabled,
None,
100,
);

let module: RpcModule<()> = rollup_boost_client.try_into().unwrap();
Expand Down Expand Up @@ -837,6 +860,7 @@ mod tests {
boost_sync_enabled().await;
builder_payload_err().await;
test_local_external_payload_ids_same().await;
test_should_rollout();
}

async fn engine_success() {
Expand Down Expand Up @@ -1100,4 +1124,32 @@ mod tests {

test_harness.cleanup().await;
}

fn test_should_rollout() {
// Test with 100% rollout - should always return true
for _ in 0..100 {
assert!(
should_rollout(100),
"100% rollout should always return true"
);
}

// Test with 0% rollout - should always return false
for _ in 0..100 {
assert!(!should_rollout(0), "0% rollout should always return false");
}

// Test with 50% rollout - should be statistically around 50%
let trials = 1000;
let successes = (0..trials).filter(|_| should_rollout(50)).count();

let success_rate = successes as f64 / trials as f64;

// With 1000 trials, we should be within 5% of expected 50%
assert!(
(success_rate - 0.5).abs() < 0.05,
"Expected success rate near 50%, got {}%",
success_rate * 100.0
);
}
}
Loading