Description
We've received a finding from Sunnyside Lab while testing PeerDAS with high blob count:
I'm seeing much higher column reconstruction time on Lighthouse nodes compared to others especially when blob number's high. (Discord link)
The metrics and logs both shows that reconstruction is taking a significant amount of time.
I randomly picked a block that performed reconstruction from the logs and immediately noticed a few problems. This is a block with 19 blobs.
- It took 1.7s to get 64 columns from peers (via
DataColumnsByRoot
), and6.4s
to get all 128 columns (!) - ~~Reconstruction took 5+ seconds for 19 blobs. This is very high compared to some local benchmark - 16 blobs completed in ~80ms and 32 blobs in ~156ms. It may suggest that there isn't enough available threads during reconstruction~~
- It look 660ms for RPC columns to get to DA Checker (it goes through beacon processor, fork choice read lock,
observed_slashable
write lock, sig verification, batch kzg verification..etc) - Because RPC columns got processed after reconstructed columns, the lookup was dropped which result in all children dropped as well.
UPDATE: I think I am wrong with the reconstruction time above - I incorrectly assumed that it's triggered after we received 64 columns, BUT we don't actually send it to processing until we have the complete set of columns. I'm suspecting the reconstruction may be triggered from gossip data columns.
Possible improvements
- The long duration to get all 128 columns suggests it's worth performing reconstruction if it can be done quickly, otherwise requesting from peers may be quicker.
- We may want to consider either perform reconstruction after receiving 64 columns OR requesting all 128 columns, but not both - doing so may cause a strain on CPU resources and slowing down overall process - e.g. less available threads for reconstruction.
- It took 660ms to get the RPC columns to DA Checker, there's many possibilities - congested beacon processor, read locks or slow verification - we should investigate further.
- The behaviour to drop all children isn't ideal - we'd have to start a new lookup again. There's likely a race condition here - if reconstruction resulted in a block import, the block would be available in fork choice after some time - however when the RPC block arrived, the block wasn't in fork choice, and therefore the valid but already imported columns accidentally got into DA checker, and returned a
MissingComponents
causing the lookup to be dropped.
Logs
// lookup started
May 21 04:01:28.283 DEBUG Created block lookup peers: [], block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, awaiting_parent: "0x87eac93c6a5896aea3b82c12585505319fed49d1fec3c20a7e9f8afe78c32d27", id: 20145, service: "lookup_sync"
...
// lookup block receceived (19 blobs), fetch el blobs, failed
May 21 04:01:29.029 DEBUG Processing RPC block block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, proposer: 303, slot: 5698, commitments_formatted: "[0x83f6…2089, 0x8f7a…2005, 0xa95c…767d, 0x8dfc…7ff1, 0
xab25…d913, 0xb1da…87c3, 0xac4a…0359, 0xb574…499d, 0xb73b…fb0c, 0x8633…f3bb, 0x8972…d2dc, 0xa95c…767d, 0x8cf8…e582, 0x98f8…202f, 0xa95c…767d, 0xb264…7b75, 0xb93d…0b72, 0xa015…fb3e, 0x8d04…76f6]", process_type: SingleBlock { id: 20145 }
May 21 04:01:29.138 DEBUG Fetching blobs from the EL num_expected_blobs: 19, service: "fetch_engine_blobs", block_root: "0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce"
May 21 04:01:29.150 DEBUG Not all blobs fetched from the EL info: "Unable to compute data columns", num_fetched_blobs: 0, num_expected_blobs: 19, service: "fetch_engine_blobs", block_root: "0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce"
...
// 64 columns received, reconstruction started
May 21 04:01:29.932 DEBUG Component added to data availability checker component: "data_columns", block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, status: "block 1 data_columns 64/128"
...
// All RPC custody columns received (took so long?)
May 21 04:01:34.686 DEBUG Sending custody columns for processing block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, process_type: SingleCustodyColumn(20145), service: "lookup_sync"service: "network_context"
May 21 04:01:34.687 DEBUG RPC custody data columns received indices: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127], block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, slot: 5698
// reconstruction complete (took ~5+ seconds )
May 21 04:01:34.884 DEBUG Reconstructed columns count: 33, block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, slot: 5698
May 21 04:01:34.886 DEBUG Component added to data availability checker component: "data_columns", block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, status: "block 1 data_columns 128/128"
May 21 04:01:34.887 DEBUG Writing data columns to store block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, count: 128
May 21 04:01:34.922 DEBUG Block components available via reconstruction result: "imported block and custody columns", block_hash: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce
May 21 04:01:34.922 DEBUG Publishing data column batch publish_count: 32, block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce
May 21 04:01:34.922 DEBUG Fork choice updated head new_head_weight: Some(0), new_head: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, old_head_weight: Some(416000000000), old_head: 0x87eac93c6a5896aea3b82c12585505319fed49d1fec3c20a7e9f8afe78c32d27
May 21 04:01:34.924 DEBUG Delayed head block block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, proposer_index: 303, slot: 5698, total_delay_ms: 6924, observed_delay_ms: "279", blob_delay_ms: "unknown", consensus_time_ms: "821", execution_time_ms: "37", available_delay_ms: "1138", attestable_delay_ms: "6887", imported_time_ms: "5782", set_as_head_time_ms: "3"
# RPC custody columns getting added to DA checker (took 660ms to get there)
May 21 04:01:35.306 DEBUG Component added to data availability checker component: "data_columns", block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, status: "block 0 data_columns 128/?"
May 21 04:01:35.306 DEBUG Missing components over rpc block_hash: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce
May 21 04:01:35.306 DEBUG Received lookup processing result component: CustodyColumn, block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, id: 20145, result: Ok(MissingComponents(Slot(5698), 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce)), service: "lookup_sync"
# ☠️ Dropping lookup, this will also drop all children
May 21 04:01:35.307 DEBUG Dropping lookup id: 20145, block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, awaiting_parent: None, service: "lookup_sync"