Skip to content

Network metrics per client #7445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: unstable
Choose a base branch
from
Open

Conversation

iri031
Copy link

@iri031 iri031 commented May 12, 2025

Issue Addressed

Which issue # does this PR address?
#7389

Proposed Changes

Metrics for:

  • Blocks received and imported per client
  • Sync time per client
  • Bytes and messages received per client

@iri031 iri031 requested a review from jxs as a code owner May 12, 2025 20:18
@iri031 iri031 changed the title initial instrumentation Network metrics per client May 12, 2025
@@ -214,6 +214,19 @@ impl<T: BeaconChainTypes> NetworkBeaconProcessor<T> {
reprocess_tx: Option<mpsc::Sender<ReprocessQueueMessage>>,
seen_timestamp: Duration,
) {
let attestation_size = attestation.as_ssz_bytes().len() as u64;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This calls force us to spend cycles re-serializing a lot of SSZ objects. To track this for free we should do the tracking in the lighthouse-network layer which I'm not sure has access to peer client information. If not possible to track this for free skip tracking bytes per client for now

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to lighthouse-network


metrics::inc_counter_vec(
&metrics::MESSAGES_RECEIVED_PER_CLIENT,
&[&client]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should label this metric by [client, object_type] where object type could be

  • gossip_attestation
  • gossip_block
  • range_sync_block
  • backfill_sync_block

etc

@@ -241,6 +254,21 @@ impl<T: BeaconChainTypes> NetworkBeaconProcessor<T> {
packages: GossipAttestationBatch<T::EthSpec>,
reprocess_tx: Option<mpsc::Sender<ReprocessQueueMessage>>,
) {
for package in &packages {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move attestation metrics to process_gossip_attestation_result and only track valid objects

metrics::inc_counter_vec(
&metrics::MESSAGES_RECEIVED_PER_CLIENT,
&[&client]
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add a helper method

fn track_messages_received_per_client(&self, peer_id: PeerId, object_type: &'static str) {
    let client = self.network_globals.client(&peer_id).kind.to_string();
    metrics::inc_counter_vec(
            &metrics::MESSAGES_RECEIVED_PER_CLIENT,
            &[&client, object_type]
        );
}

to de-duplicate a lot of code

@@ -169,6 +169,11 @@ impl<E: EthSpec> PeerDB<E> {
}
}

/// Returns the sync start time of the peer if it exists
pub fn sync_start_time(&self, peer_id: &PeerId) -> Option<&Instant> {
self.peers.get(peer_id).and_then(|info| info.sync_start_time())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow the meaning/purpose of sync start time, could you detail it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to calculate sync time per client. When a sync request is received I update this parameter to record the time when the sync started here and later when the state of the peer changes to is_synced here I calculate the time taken to include in the metric.

.peer_info(peer_id)
.map(|info| info.client().version.clone())
.unwrap_or_default()
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a single metric, not one per client type and one for client version. In mainnet is version field set for most clients? Could you test this and report back? If so, to what?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can I test for the mainnet?

Copy link

@gitToki gitToki Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • One potential way to do it would be to test it directly in the mainnet with temporary logging to you're lighthouse client>

first add analysis to network_contexte:

// Add to SyncNetworkContext impl
pub fn analyze_peer_versions_sample(&self) -> (usize, usize, Vec<String>) {
    let peers: Vec<PeerId> = self.network_globals()
        .peers
        .read()
        .connected_peers()
        .take(50) // sample 50 but you can adjust that
        .cloned()
        .collect();
    
    let mut with_version = 0;
    let mut without_version = 0;
    let mut examples = Vec::new();
    
    for peer_id in peers {
        let version = self.client_version(&peer_id);
        let client_type = self.client_type(&peer_id);
        
        if version.is_empty() {
            without_version += 1;
        } else {
            with_version += 1;
            if examples.len() < 8 {
                examples.push(format!("{}: {}", client_type, version));
            }
        }
    }
    
    (with_version, without_version, examples)
    }

then call this function every X min in sync manager:

// Add this in main sync loop or network service
let mut last_version_check = Instant::now();

// Inside your main loop:
if last_version_check.elapsed() > Duration::from_secs(300) { // here I took 5 min
    let (with_ver, without_ver, examples) = 
        self.network_context.analyze_peer_versions_sample();
    
    let total = with_ver + without_ver;
    let percentage = if total > 0 { 
        (with_ver as f64 / total as f64) * 100.0 
    } else { 0.0 };
    
    info!("MAINNET VERSION TEST: {}/{} peers have versions ({:.1}%). Examples: {:?}", 
          with_ver, total, percentage, examples);
          
    last_version_check = Instant::now();
}

and then build and enable the log to get the info

cargo build --release


./target/release/lighthouse bn \
    --network mainnet \
    --datadir /your/datadir \
    --http \
    --log-level info \
    2>&1 | tee mainnet_version_test.log

you should get something like: MAINNET VERSION TEST: 42/50 peers have versions (84.0%). Examples: ["xx"]

there is around 8200 CL nodes that you can communicate with in the mainnet, so adjusting the sample might be good

  • Another way could be to use a crawler, use discv5 to discover nodes (starting from boot-nodes) and then libp2p to the discovered nodes to look at the version field (first option might be better, it depends).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi again, I modified my crawler to get the data that we needed and made it runs for the past few hours. Here is what I got @dapplion

image image

got "Unknown" client names result from parsing failures. It makes the data inaccurate, my bad for that

to get better data you can look here -> https://monitoreth.io/nodes

@dapplion
Copy link
Collaborator

FYI @iri031 there are still some unaddressed comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants