Skip to content

Conversation

@bw-solana
Copy link

@bw-solana bw-solana commented Apr 11, 2025

Problem

We would like to move to fixed (32:32) FEC set sizes to simplify multiple areas of the code.

This PR is simply a draft to prove out the concept and is not meant to be merged.

If we decide to pursue this, we will need to:

  1. Tweak some of the batching logic to make sure we're not sending in too many small data sets and generating excessive padding <-- this is now WIP
  2. Chunk up into several smaller PRs to make reviewable
  3. Revisit the unit tests to make sure they still make sense and add value.

Summary of Changes

@bw-solana
Copy link
Author

Currently seeing ~7% padding overhead running with 4eb25e2
image

This is similar to what Jump has observed using the same entry coalesce bytes target.
~6% overhead seen on mainnet with the current variable FEC set size.

@bw-solana
Copy link
Author

As far as what is driving the need for padding, here are some logs that shed light:

[2025-04-11T20:07:37.683983999Z WARN    solana_turbine::broadcast_stage::broadcast_utils] #BW: Breaking out of   coalesce loop, serialized_batch_byte_count: 83048, entry_bytes: 13593,   target_serialized_batch_byte_count: 92448
[2025-04-11T20:07:37.686228591Z WARN    solana_turbine::broadcast_stage::broadcast_utils] #BW: Not entering   coalesce loop, serialized_batch_byte_count: 116803,   target_serialized_batch_byte_count: 92448
[2025-04-11T20:07:37.691491722Z WARN    solana_turbine::broadcast_stage::broadcast_utils] #BW: Breaking out of   coalesce loop, serialized_batch_byte_count: 92151, entry_bytes: 693,   target_serialized_batch_byte_count: 92448
[2025-04-11T20:07:37.700535140Z WARN    solana_turbine::broadcast_stage::broadcast_utils] #BW: Breaking out of   coalesce loop, serialized_batch_byte_count: 90002, entry_bytes: 3488,   target_serialized_batch_byte_count: 92448
[2025-04-11T20:07:37.713070269Z WARN    solana_turbine::broadcast_stage::broadcast_utils] #BW: Breaking out of   coalesce loop, serialized_batch_byte_count: 92015, entry_bytes: 10368,   target_serialized_batch_byte_count: 92448
[2025-04-11T20:07:37.716405526Z WARN    solana_turbine::broadcast_stage::broadcast_utils] #BW: Breaking out of   coalesce loop, serialized_batch_byte_count: 83192, entry_bytes: 13593,   target_serialized_batch_byte_count: 92448
[2025-04-11T20:07:37.721629706Z WARN    solana_turbine::broadcast_stage::broadcast_utils] #BW: Breaking out of   coalesce loop, serialized_batch_byte_count: 90408, entry_bytes: 3488,   target_serialized_batch_byte_count: 92448
[2025-04-11T20:07:37.736128600Z WARN    solana_turbine::broadcast_stage::broadcast_utils] #BW: Breaking out of   coalesce loop, serialized_batch_byte_count: 90170, entry_bytes: 3488,   target_serialized_batch_byte_count: 92448
[2025-04-11T20:07:37.744027342Z WARN    solana_turbine::broadcast_stage::broadcast_utils] #BW: Breaking out of   coalesce loop, serialized_batch_byte_count: 90552, entry_bytes: 6928,   target_serialized_batch_byte_count: 92448
[2025-04-11T20:07:37.746227733Z WARN    solana_turbine::broadcast_stage::broadcast_utils] #BW: Not entering   coalesce loop, serialized_batch_byte_count: 108491,   target_serialized_batch_byte_count: 92448
[2025-04-11T20:07:37.799141052Z WARN    solana_turbine::broadcast_stage::broadcast_utils] #BW: Breaking out of   coalesce loop, timed out, serialized_batch_byte_count: 48782,   target_serialized_batch_byte_count: 92448
[2025-04-11T20:07:37.830889203Z WARN    solana_turbine::broadcast_stage::broadcast_utils] #BW: Breaking out of   coalesce loop, serialized_batch_byte_count: 83146, entry_bytes: 13593,   target_serialized_batch_byte_count: 92448
[2025-04-11T20:07:37.834880165Z WARN    solana_turbine::broadcast_stage::broadcast_utils] #BW: Not entering   coalesce loop, serialized_batch_byte_count: 93368,   target_serialized_batch_byte_count: 92448
[2025-04-11T20:07:37.843824541Z WARN    solana_turbine::broadcast_stage::broadcast_utils] #BW: Breaking out of   coalesce loop, serialized_batch_byte_count: 85962, entry_bytes: 13593,   target_serialized_batch_byte_count: 92448
[2025-04-11T20:07:37.847022015Z WARN    solana_turbine::broadcast_stage::broadcast_utils] #BW: Breaking out of   coalesce loop, serialized_batch_byte_count: 84434, entry_bytes: 13593,   target_serialized_batch_byte_count: 92448
[2025-04-11T20:07:37.849953892Z WARN    solana_turbine::broadcast_stage::broadcast_utils] #BW: Not entering   coalesce loop, serialized_batch_byte_count: 143154,   target_serialized_batch_byte_count: 92448

There are a couple of "bad" cases that lead to more padding:

  1. Draining large number of entries from the receiver up front. So many that it already exceeds the target. The portion that exceeds the target is going to result in ~1/2 a batch of padding on average.
  2. The large entry size (>10kB) making it easy to exceed target and resulting in large amount of empty space to pad at the end.

A few potential options to get better here:

  1. Increase the target coalesce bytes even more. Downside here is delaying pushing out shreds.
  2. Force smaller tx batching upstream. This seems complex and will introduce other perf implications.
  3. Try to buffer excess entries for "bad" case 1.

@codecov-commenter
Copy link

codecov-commenter commented Apr 11, 2025

Codecov Report

Attention: Patch coverage is 92.45810% with 27 lines in your changes missing coverage. Please review.

Project coverage is 82.9%. Comparing base (86b229b) to head (7041ad4).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #5771     +/-   ##
=========================================
- Coverage    82.9%    82.9%   -0.1%     
=========================================
  Files         830      830             
  Lines      376347   376498    +151     
=========================================
+ Hits       312282   312386    +104     
- Misses      64065    64112     +47     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@bw-solana
Copy link
Author

Running the same experiment (bench-tps spamming ~30k TPS) using this code vs. master code I observe the following:

This code:

  • ~200kB of extra data pad bytes per slot --> ~200kB of extra coding bytes
  • ~7.5k shreds per slot
  • ~33k TPS observed
  • Overall padding bytes = ~400kB per slot --> ~5% padding overhead

Master code:

  • ~51kB of extra data pad bytes per slot --> ~51kB of extra coding bytes
  • ~15kB worth of extra coding shreds per slot due to erasure batch < 32 data shreds
  • ~30k TPS
  • Overall padding bytes = ~117k per slot --> ~2% padding overhead

@bw-solana
Copy link
Author

bw-solana commented Apr 15, 2025

On mainnet over the last 2 weeks, the overhead from variable coding size has been around 2%:
image

SELECT (mean("num_merkle_coding_shreds")-mean("num_merkle_data_shreds"))/mean("num_merkle_data_shreds")*100 AS "var_coding_overhead" FROM "mainnet-beta"."autogen"."broadcast-process-shreds-stats" WHERE time > :dashboardTime: AND time < :upperDashboardTime: GROUP BY time(1d) FILL(null)

On mainnet over the last 2 weeks, the overhead from padding out the last data shreds has averaged around ~10 data shreds per slot or ~1.2% overhead:
image

SELECT mean("data_buffer_residual")/1024*100/mean("num_merkle_data_shreds") AS "data_buffer_residual_overhead" FROM "mainnet-beta"."autogen"."broadcast-process-shreds-stats" WHERE time > :dashboardTime: AND time < :upperDashboardTime: GROUP BY time(:interval:) FILL(null)

The final source of overhead would be the padding out of the last FEC set to 32 shreds. We don't have metrics for this, but my assumption is that we're padding half the FEC set on average, which would be 16 shreds, and this would result in ~1.92% padding overhead.

So overall, my understanding is mainnet has ~5% padding overhead. This lines up w/ some other measurements analyzing zero content data in the 5-6% range

@bw-solana
Copy link
Author

bw-solana commented Apr 15, 2025

Took a sample of 100k entries on mainnet, and the workload is much more favorable for packing fixed batches than the synthetic testing. This is because most entries contain a single tx (76%) or none (10.5%) and are only hundreds of bytes in size.
image

Looks like most of the high tx count entries are a result of votes being batched together (inferring based on AVG tx size being right around 352B for larger entries)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest moving the shredding logic to another file. merkle.rs is already huge enough.


// Wait up to `ENTRY_COALESCE_DURATION` to try to coalesce entries into a 32 shred batch
let data_shred_bytes =
ShredData::capacity(Some((6, true, false))).expect("Failed to get capacity") as u64;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is all constant, no need recomputing this every time

) {
// Fetch the next entry.
let Ok((try_bank, (entry, tick_height))) = receiver.recv_deadline(
coalesce_start + max_coalesce_time(serialized_batch_byte_count, max_batch_byte_count),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest we try to maybe wake up some 5 ms early to avoid OS being annoying and waking this thread 5 ms too late instead.

@bw-solana
Copy link
Author

Testing w/ latest entry coalescing policy is looking great so far. I'm seeing 5% padding, which is in line with our current padding (maybe slightly less?) w/ variable FEC sets.

Main changes are to:

  1. Wait up to 200ms to coalesce entries, but linearly reduce this amount of time the fuller the current entry batch gets down to a minimum of 50ms (matches current mainnet static limit). This allows us to not just send out tick only highly padded out entry batches at the end of the slot. Given we fill slots in 150-200ms, we were getting a few of these mostly padded out batches per slot. We will also stop coalescing and shred/send if we hit the end of the slot.
  2. If we drain the channel and already exceed the target batch size, keep coalescing entries to get close to the next multiple of erasure batch size.
  3. Reduce max vote batch size to 16 (down from 64). This gives smaller max entry size and makes it easier to tightly pack entry batches. I confirmed we're still packing ~1300 votes per slot.

Picture shows a per slot breakdown of padding bytes and why we decided to exit the entry coalescing routine. Lots of tightly packed is 👍. Exiting due to reaching max size is okay. Exiting due to receive timeout is usually not good:
image

Also note we appear to be maxing out CUs for the first block in a leader span, but the rest are light (seems to be due to demand)

@bw-solana
Copy link
Author

Padding bytes relatively small to data bytes:
image

Broadcast below 350ms:
image

Replay total elapsed time for a 12 slot sequence (334517276 to 334517287), the middle 4 generated by a leader running this new code. Times are in line with current behavior:
image

@alexpyattaev
Copy link

I think hyperoptimizing this is not necessary, as blocks become larger padding will disappear in the overall traffic. We just need to pack more TXs into the blocks in general.

@bw-solana bw-solana closed this Jul 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants