Skip to content

Data column reconstruction taking 5+ seconds on devnets with high blob count #7526

Open
@jimmygchen

Description

@jimmygchen

We've received a finding from Sunnyside Lab while testing PeerDAS with high blob count:

I'm seeing much higher column reconstruction time on Lighthouse nodes compared to others especially when blob number's high. (Discord link)

The metrics and logs both shows that reconstruction is taking a significant amount of time.

I randomly picked a block that performed reconstruction from the logs and immediately noticed a few problems. This is a block with 19 blobs.

  1. It took 1.7s to get 64 columns from peers (via DataColumnsByRoot), and 6.4s to get all 128 columns (!)
  2. ~~Reconstruction took 5+ seconds for 19 blobs. This is very high compared to some local benchmark - 16 blobs completed in ~80ms and 32 blobs in ~156ms. It may suggest that there isn't enough available threads during reconstruction~~
  3. It look 660ms for RPC columns to get to DA Checker (it goes through beacon processor, fork choice read lock, observed_slashable write lock, sig verification, batch kzg verification..etc)
  4. Because RPC columns got processed after reconstructed columns, the lookup was dropped which result in all children dropped as well.

UPDATE: I think I am wrong with the reconstruction time above - I incorrectly assumed that it's triggered after we received 64 columns, BUT we don't actually send it to processing until we have the complete set of columns. I'm suspecting the reconstruction may be triggered from gossip data columns.

Possible improvements

  • The long duration to get all 128 columns suggests it's worth performing reconstruction if it can be done quickly, otherwise requesting from peers may be quicker.
  • We may want to consider either perform reconstruction after receiving 64 columns OR requesting all 128 columns, but not both - doing so may cause a strain on CPU resources and slowing down overall process - e.g. less available threads for reconstruction.
  • It took 660ms to get the RPC columns to DA Checker, there's many possibilities - congested beacon processor, read locks or slow verification - we should investigate further.
  • The behaviour to drop all children isn't ideal - we'd have to start a new lookup again. There's likely a race condition here - if reconstruction resulted in a block import, the block would be available in fork choice after some time - however when the RPC block arrived, the block wasn't in fork choice, and therefore the valid but already imported columns accidentally got into DA checker, and returned a MissingComponents causing the lookup to be dropped.

Logs

// lookup started
May 21 04:01:28.283 DEBUG Created block lookup                          peers: [], block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, awaiting_parent: "0x87eac93c6a5896aea3b82c12585505319fed49d1fec3c20a7e9f8afe78c32d27", id: 20145, service: "lookup_sync"
...

// lookup block receceived (19 blobs), fetch el blobs, failed
May 21 04:01:29.029 DEBUG Processing RPC block                          block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, proposer: 303, slot: 5698, commitments_formatted: "[0x83f6…2089, 0x8f7a…2005, 0xa95c…767d, 0x8dfc…7ff1, 0
xab25…d913, 0xb1da…87c3, 0xac4a…0359, 0xb574…499d, 0xb73b…fb0c, 0x8633…f3bb, 0x8972…d2dc, 0xa95c…767d, 0x8cf8…e582, 0x98f8…202f, 0xa95c…767d, 0xb264…7b75, 0xb93d…0b72, 0xa015…fb3e, 0x8d04…76f6]", process_type: SingleBlock { id: 20145 }
May 21 04:01:29.138 DEBUG Fetching blobs from the EL                    num_expected_blobs: 19, service: "fetch_engine_blobs", block_root: "0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce"
May 21 04:01:29.150 DEBUG Not all blobs fetched from the EL             info: "Unable to compute data columns", num_fetched_blobs: 0, num_expected_blobs: 19, service: "fetch_engine_blobs", block_root: "0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce"
...

// 64 columns received, reconstruction started
May 21 04:01:29.932 DEBUG Component added to data availability checker  component: "data_columns", block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, status: "block 1 data_columns 64/128"
...

// All RPC custody columns received (took so long?)
May 21 04:01:34.686 DEBUG Sending custody columns for processing        block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, process_type: SingleCustodyColumn(20145), service: "lookup_sync"service: "network_context"
May 21 04:01:34.687 DEBUG RPC custody data columns received             indices: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127], block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, slot: 5698

// reconstruction complete (took ~5+ seconds )
May 21 04:01:34.884 DEBUG Reconstructed columns                         count: 33, block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, slot: 5698
May 21 04:01:34.886 DEBUG Component added to data availability checker  component: "data_columns", block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, status: "block 1 data_columns 128/128"
May 21 04:01:34.887 DEBUG Writing data columns to store                 block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, count: 128
May 21 04:01:34.922 DEBUG Block components available via reconstruction  result: "imported block and custody columns", block_hash: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce
May 21 04:01:34.922 DEBUG Publishing data column batch                  publish_count: 32, block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce
May 21 04:01:34.922 DEBUG Fork choice updated head                      new_head_weight: Some(0), new_head: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, old_head_weight: Some(416000000000), old_head: 0x87eac93c6a5896aea3b82c12585505319fed49d1fec3c20a7e9f8afe78c32d27
May 21 04:01:34.924 DEBUG Delayed head block                            block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, proposer_index: 303, slot: 5698, total_delay_ms: 6924, observed_delay_ms: "279", blob_delay_ms: "unknown", consensus_time_ms: "821", execution_time_ms: "37", available_delay_ms: "1138", attestable_delay_ms: "6887", imported_time_ms: "5782", set_as_head_time_ms: "3"

# RPC custody columns getting added to DA checker (took 660ms to get there)
May 21 04:01:35.306 DEBUG Component added to data availability checker  component: "data_columns", block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, status: "block 0 data_columns 128/?"
May 21 04:01:35.306 DEBUG Missing components over rpc                   block_hash: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce
May 21 04:01:35.306 DEBUG Received lookup processing result             component: CustodyColumn, block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, id: 20145, result: Ok(MissingComponents(Slot(5698), 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce)), service: "lookup_sync"

# ☠️ Dropping lookup, this will also drop all children
May 21 04:01:35.307 DEBUG Dropping lookup                               id: 20145, block_root: 0xa25cb28249a395648cba5d2d90c35ae0ba678c29df8a0baf50a75463b2a3c2ce, awaiting_parent: None, service: "lookup_sync"

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdasData Availability Samplingsyncing

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions