-
Notifications
You must be signed in to change notification settings - Fork 877
Network metrics per client #7445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
@@ -214,6 +214,19 @@ impl<T: BeaconChainTypes> NetworkBeaconProcessor<T> { | |||
reprocess_tx: Option<mpsc::Sender<ReprocessQueueMessage>>, | |||
seen_timestamp: Duration, | |||
) { | |||
let attestation_size = attestation.as_ssz_bytes().len() as u64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This calls force us to spend cycles re-serializing a lot of SSZ objects. To track this for free we should do the tracking in the lighthouse-network layer which I'm not sure has access to peer client information. If not possible to track this for free skip tracking bytes per client for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to lighthouse-network
|
||
metrics::inc_counter_vec( | ||
&metrics::MESSAGES_RECEIVED_PER_CLIENT, | ||
&[&client] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should label this metric by [client, object_type] where object type could be
gossip_attestation
gossip_block
range_sync_block
backfill_sync_block
etc
@@ -241,6 +254,21 @@ impl<T: BeaconChainTypes> NetworkBeaconProcessor<T> { | |||
packages: GossipAttestationBatch<T::EthSpec>, | |||
reprocess_tx: Option<mpsc::Sender<ReprocessQueueMessage>>, | |||
) { | |||
for package in &packages { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can move attestation metrics to process_gossip_attestation_result
and only track valid objects
metrics::inc_counter_vec( | ||
&metrics::MESSAGES_RECEIVED_PER_CLIENT, | ||
&[&client] | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add a helper method
fn track_messages_received_per_client(&self, peer_id: PeerId, object_type: &'static str) {
let client = self.network_globals.client(&peer_id).kind.to_string();
metrics::inc_counter_vec(
&metrics::MESSAGES_RECEIVED_PER_CLIENT,
&[&client, object_type]
);
}
to de-duplicate a lot of code
@@ -169,6 +169,11 @@ impl<E: EthSpec> PeerDB<E> { | |||
} | |||
} | |||
|
|||
/// Returns the sync start time of the peer if it exists | |||
pub fn sync_start_time(&self, peer_id: &PeerId) -> Option<&Instant> { | |||
self.peers.get(peer_id).and_then(|info| info.sync_start_time()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow the meaning/purpose of sync start time, could you detail it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.peer_info(peer_id) | ||
.map(|info| info.client().version.clone()) | ||
.unwrap_or_default() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a single metric, not one per client type and one for client version. In mainnet is version
field set for most clients? Could you test this and report back? If so, to what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can I test for the mainnet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- One potential way to do it would be to test it directly in the mainnet with temporary logging to you're lighthouse client>
first add analysis to network_contexte:
// Add to SyncNetworkContext impl
pub fn analyze_peer_versions_sample(&self) -> (usize, usize, Vec<String>) {
let peers: Vec<PeerId> = self.network_globals()
.peers
.read()
.connected_peers()
.take(50) // sample 50 but you can adjust that
.cloned()
.collect();
let mut with_version = 0;
let mut without_version = 0;
let mut examples = Vec::new();
for peer_id in peers {
let version = self.client_version(&peer_id);
let client_type = self.client_type(&peer_id);
if version.is_empty() {
without_version += 1;
} else {
with_version += 1;
if examples.len() < 8 {
examples.push(format!("{}: {}", client_type, version));
}
}
}
(with_version, without_version, examples)
}
then call this function every X min in sync manager:
// Add this in main sync loop or network service
let mut last_version_check = Instant::now();
// Inside your main loop:
if last_version_check.elapsed() > Duration::from_secs(300) { // here I took 5 min
let (with_ver, without_ver, examples) =
self.network_context.analyze_peer_versions_sample();
let total = with_ver + without_ver;
let percentage = if total > 0 {
(with_ver as f64 / total as f64) * 100.0
} else { 0.0 };
info!("MAINNET VERSION TEST: {}/{} peers have versions ({:.1}%). Examples: {:?}",
with_ver, total, percentage, examples);
last_version_check = Instant::now();
}
and then build and enable the log to get the info
cargo build --release
./target/release/lighthouse bn \
--network mainnet \
--datadir /your/datadir \
--http \
--log-level info \
2>&1 | tee mainnet_version_test.log
you should get something like: MAINNET VERSION TEST: 42/50 peers have versions (84.0%). Examples: ["xx"]
there is around 8200 CL nodes that you can communicate with in the mainnet, so adjusting the sample might be good
- Another way could be to use a crawler, use discv5 to discover nodes (starting from boot-nodes) and then libp2p to the discovered nodes to look at the version field (first option might be better, it depends).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi again, I modified my crawler to get the data that we needed and made it runs for the past few hours. Here is what I got @dapplion


got "Unknown" client names result from parsing failures. It makes the data inaccurate, my bad for that
to get better data you can look here -> https://monitoreth.io/nodes
FYI @iri031 there are still some unaddressed comment |
Issue Addressed
Which issue # does this PR address?
#7389
Proposed Changes
Metrics for: