Skip to content

Add genesis sync test to CI #7561

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 25 commits into
base: unstable
Choose a base branch
from

Conversation

chong-he
Copy link
Member

@chong-he chong-he commented Jun 5, 2025

Issue Addressed

Use existing code from @jimmygchen in #7530 and modify for genesis sync test. Thanks @jimmygchen !

@chong-he chong-he added test improvement Improve tests work-in-progress PR is a work-in-progress syncing infra-ci labels Jun 5, 2025
@chong-he
Copy link
Member Author

chong-he commented Jun 6, 2025

This PR is ready for review. In this test, we use the minimal preset so that it is faster. The local testnet script is also updated to build for minimal:

The rebuild image -b true (link here) is required or Kurtosis wouldn't start

For 120s stop duration, we trigger short distance sync, the lookup sync. We see in the logs:

DEBUG Created block lookup                          peers: [PeerId("16Uiu2HAkzuo2s4zS8Ab1FVPU1GAue212NT3YhHVRgJwE9S1HQ12J")], block_root: 0x149f7747579266942795085564847f3e903de77524ddb59de3077fae5b238432, awaiting_parent: "none", id: 1, service: "lookup_sync"

For 300s stop duration, we trigger long distance sync, the range sync. We see in the logs:

DEBUG Syncing new finalized chain                   id: 1, component: "range_sync"
DEBUG Sync RPC request sent                         method: "BlocksByRange", slots: 8, epoch: 0, peer: 16Uiu2HAm2pJM6EUvzVtN5JwaKyXBx8DAyg9dSXrNPBc81UCP8WBp, id: 3/2/RangeSync/0/1, chain: 1, service: "range_sync"

All sync completed very fast, under 5s. So I reduce the polling time in the script to 0.5s so that we can see some outputs in the terminal, and set a shorter timeout of 5 minutes.

@chong-he chong-he added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Jun 6, 2025
@chong-he chong-he requested a review from jimmygchen June 6, 2025 00:58
# ------------------------------------------------------
# Interval for polling the /lighthouse/syncing endpoint for sync status
# Reduce the polling time so that some progress can be seen
POLL_INTERVAL_SECS=0.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very frequent, I think it's not necessary and we could probably live with something like 5s, or 6s (1 poll per slot)? Don't forgot HTTP API requests goes through BeaconProcessor as well and it's Priority::P0, which means it will be prioritised over most tasks.

task_spawner.blocking_json_task(Priority::P0, move || {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, the sync completed very fast, like under 2s, e.g., for sync-electra-300s, we see from the logs for cl-3-lighthouse-geth:

Jun 06 06:33:34.585 DEBUG Range request block roots retrieved           req_type: "BlocksByRange", start_slot: 8, req_count: 8, roots_count: 8, source: "store", elapsed: 89.035µs, finalized_slot: 32
Jun 06 06:33:34.774 DEBUG Range request block roots retrieved           req_type: "BlocksByRange", start_slot: 16, req_count: 8, roots_count: 8, source: "store", elapsed: 40.275µs, finalized_slot: 32 
Jun 06 06:33:34.774 DEBUG Range request block roots retrieved           req_type: "BlocksByRange", start_slot: 24, req_count: 8, roots_count: 8, source: "store", elapsed: 61.705µs, finalized_slot: 32

and the sync is completed about 0.2-0.3 second later:

Jun 06 06:33:35.000 INFO  Synced                                        peers: "3", exec_hash: "0xa95848dca51fbedfce6f84bd2d7f2df0b459a5672c8c8fb118ad44b321170f2d (unverified)", finalized_root: 0x57274ba74a8a091caa527a1af01269489c333da085503a0ae76d3534ecbe03d0, finalized_epoch: 4, epoch: 6, block: "0x2fb201cc9202c794b31212e49d8c828274fa76669db0453993bd8650a19dad10", slot: 52 

So if we poll every 5s, we may not see any output, unless we increase the sync distance (stop the node for longer periods of time

echo "${node_type} status: ${status:-Unknown}"

# The test is complete when the node is synced
if [ "$status" = "Synced" ]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the "Synced" check in the checkpoint sync script, because of the issue we talked about on the call - there's a bug and LH may switch to Synced even though it's not.

But I guess for forward sync this is fine, as it should only switch to this status if we're within certain distance. If this doesn' work well, we could consider using the standard node/syncing API, as it also returns the sync distance, which provides a bit more certainty.

Copy link
Member

@jimmygchen jimmygchen Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the sync-electra-300s logs, and realised it didn't even start syncing, i suspect LH is reporting Synced status too early. Would it be worth looking at the sync distance?

Copy link
Member Author

@chong-he chong-he Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For forward sync it looks ok? This is the output for sync-electra-300s:
https://github.com/sigp/lighthouse/actions/runs/15484117320/job/43595585222?pr=7561

We can increase the poll time to 1-2s if that's better?

I have also tested using the beacon API /eth/v1/node/syncing, the output could look like this (also with 0.5s poll time):

olling sync status until nodes are synced or timeout of 5 mins
supernode status: Synced (head_slot: 51, sync_distance: 0)
supernode completed sync in 1 seconds
fullnode status: No response or null response
fullnode status: No response or null response
fullnode status: No response or null response
fullnode status: Syncing (head_slot: 23, sync_distance: 28, is_optimistic: true, el_offline: false)
fullnode status: Synced (head_slot: 51, sync_distance: 0)
fullnode completed sync in 3 seconds
Genesis sync test complete! Both supernode and fullnode have synced successfully.
Supernode time: 1 seconds
Fullnode time: 3 seconds

It has the head_slot (which I believe is the same as target_slot in the lighthouse/syncing endpoint (the beacon API also has the some other info in the response like el_offline and is_optimistic). If the beacon API is better, I can switch to that

@jimmygchen jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Jun 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infra-ci syncing test improvement Improve tests waiting-on-author The reviewer has suggested changes and awaits thier implementation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants