Description
At present, we store the online and offline events of peers that we have channels open with to get an record of peer uptime. This information is lost on restart. Persisting this information provides a more long lived record of our peer’s uptime over the lifetime of a single channel, and a historical record of how a peer performed in past channels.
[peer-info-bucket]
[peer-pubkey-bucket]
[uptime-bucket]
[timestamp] online
[timestamp] offline
[timestamp] channel closed
[peer-pubkey-bucket]
[uptime-bucket]
[timestamp] offline
[timestamp] online
Uptime queries are made over a desired time period - what was uptime last month/ over the last 7 days, for example - so entries are keyed by timestamp. We also require the raw online/offline events; there is no simple way to aggregate events and still allow queries over time ranges. Since our peer has no incentive to remain connected to us on channel close, we also need to differentiate between disconnections due to peer flakes and due to channel closes this can be done by adding a channel closed record for each peer.
Another consideration would be rate limiting the number of writes we allow each peer; we don’t want a flaky peer to fill up our disk. This DoS vector is an existing issue in our in-memory uptime tracking; we will add in memory events for every online/offline event, this issue should address this DoS vector as well. Rate limiting allows us to reduce the number of events we store (each event will require 9 bytes on disk, timestamp + uint8
enum).
Ideally we would like to maintain the accuracy of our uptime data, so we can scale our rate limiting with the degree to which our peer flaps, trading off accuracy for disk space as a peer requires more space. This flap rate can be tracked in the peer's bucket under its own key. We won't write each flap to disk, but rather flush our total count periodically so that this write can't be used to DoS us.
[peer-pubkey-bucket]
[uptime-bucket]
[flap-count-key]
Thresholds tend to be magic numbers, so I would suggest that we pick a max on disk size we're ok with dedicating to these events for one peer for one month and work backwards form there.
We need to keep a record of our own downtime, if we do not, we will mistakenly interpret our own downtime as an online/offline period for each of our peers, depending on the state they were in when we went down.
- We record Alice as online, Bob as offline
- LND does down for 3 days
- On restart, we will interpret our on disk data as Alice having had 3 days of online time, and Bob having 3 days offline time
This could be done with an on-shutdown timestamp written to disk, but if lnd does not shutdown cleanly we will not have this event, and will have the same issue as before. Instead, we can track our own uptime in a special bucket in peer-info
on a best effort basis. This can be done by writing a last-online
timestamp to disk periodically (say every minute). On startup, we can record our own downtime by last liveliness timestamp we have as the time we went offline, and our current timestamp as the time we came back online.
As part of this change, I also suggest the addition of a dedicated uptime rpc endpoint, since adding a start/end time parameter to ListChannels
, where uptime currently resides, does not make sense in the context of the call.
Rpc PeerUptime(PeerUptimeRequest) returns PeerUptimeResponse
Message PeerUptimeRequest{
String peer_pubkey = 1;
Uint64 start_time = 2;
Uint64 end_time = 3;
}
Message PeerUptimeResponse{
// The amount of time that we monitored our peer's uptime over the period queried.
uint64 monitored_time = 1;
// The amount of time that our peer was online for while we monitored it.
uint64 online_time = 2;
// The amount of time that our node was down, and thus unable to monitor peer online status, for over the period queried.
uint64 local_downtime = 3;
}