Prometheus metrics for libp2p protocols#1199
Conversation
metrics-2026-02-08_19.57.51.mp4ping latency metrics(Histogram) on grafana |
gossipsub-metrics.mp4Screencast of the gossipsub metrics. Following metrics are getting recorded:
|
3ab8490 to
1592d66
Compare
|
@lla-dane : Hi Abhinav, this is a really strong and impactful PR, great work 👏 Love how you’ve brought Prometheus/Grafana observability directly into py-libp2p, the coverage across Ping, Gossipsub, Kad-DHT, and Swarm gives a solid, end-to-end view of protocol behavior. The metrics feel well chosen and immediately useful for debugging and performance analysis. The metrics-demo + Docker setup is a big win for DX as well, makes it super easy to spin things up and actually see what’s happening across nodes. Overall, this is a big step toward production-grade observability for py-libp2p. Happy to help test or review further & excited to see this land. We will discuss this in detail tomorrow. On the same note, wish if you could resolve the CI/CD issues. |
…sage in prometheus
Introduction
This pull request introduces Prometheus/Grafana metrics for core py-libp2p protocols, for real-time monitoring and analysis.
It enables developers to run a libp2p node and directly inspect internal protocol behavior—such as latency, message propagation, and DHT activity—through standard metrics pipelines.
A working demo (metrics-demo) is included in the examples directory, to showcase how multiple services operate together and how their metrics can be visualized using Prometheus and Grafana.
What's included
The following libp2p services are currently instrumented and exposed via Prometheus metrics:
Ping
ping: Round-trip time (RTT) measurements.ping_failure: Failed ping attempts.Provides visibility into peer-to-peer latency and connectivity reliability.
Gossipsub / Pubsub
gossipsub_received_total: Messages receivedgossipsub_publish_total: Messages publishedgossipsub_subopts_total: Subscription updatesgossipsub_control_total: Control messagesgossipsub_message_bytes: Message sizesEnables monitoring of message propagation, throughput, and pubsub activity.
Kademlia (Kad-DHT)
kad_inbound_total: Total inbound requestskad_inbound_find_node: FIND_NODE requestskad_inbound_get_value: GET_VALUE requestskad_inbound_put_value: PUT_VALUE requestskad_inbound_get_providers: GET_PROVIDERS requestskad_inbound_add_provider: ADD_PROVIDER requestsSwarm / Connection Lifecycle
swarm_incoming_conn: Incoming connectionsswarm_incoming_conn_error: Incoming connection failuresswarm_dial_attempt: Outgoing dial attemptsswarm_dial_attempt_error: Dial failuresTracks connection establishment behavior and network stability.
Demo & Observability Setup
A
metrics-demoCLI is included to:A Docker-based setup is provided to launch:
This allows real-time inspection of protocol-level behavior across nodes.
Necessity
Currently, diagnosing issues in py-libp2p (e.g., latency spikes, dropped messages, or DHT inconsistencies) relies heavily on logs, which are:
This PR introduces structured, queryable metrics that:
Reference
Inspired by the metrics design in the Rust implementation:
https://github.com/libp2p/rust-libp2p/tree/master/misc/metrics