-
Notifications
You must be signed in to change notification settings - Fork 506
Attributable failures (feature 36/37) #1044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I've started implementing it in eclair, do you have some test vectors so we can check that we are compatible? |
I don't have test vectors yet, but I can produce them. Will add them to this PR when ready. Capping the max hops at a lower number is fine to me, but do you have a scenario in mind where this would really make the difference? Or is it to more generally that everything above 8 is wasteful? |
4b48481
to
24b10d5
Compare
@thomash-acinq added a happy fat error test vector. |
24b10d5
to
76dbf21
Compare
09-features.md
Outdated
@@ -41,6 +41,7 @@ The Context column decodes as follows: | |||
| 20/21 | `option_anchor_outputs` | Anchor outputs | IN | `option_static_remotekey` | [BOLT #3](03-transactions.md) | | |||
| 22/23 | `option_anchors_zero_fee_htlc_tx` | Anchor commitment type with zero fee HTLC transactions | IN | `option_static_remotekey` | [BOLT #3][bolt03-htlc-tx], [lightning-dev][ml-sighash-single-harmful]| | |||
| 26/27 | `option_shutdown_anysegwit` | Future segwit versions allowed in `shutdown` | IN | | [BOLT #2][bolt02-shutdown] | | |||
| 28/29 | `option_fat_error` | Can generate/relay fat errors in `update_fail_htlc` | IN | | [BOLT #4][bolt04-fat-errors] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this big gap in the bits has emerged here because of tentative spec changes that may or may not make it. Not sure why that is necessary. I thought for unofficial extensions, the custom range is supposed to be used?
I can see that with unofficial features deployed in the wild, it is easier to keep the same bit when something becomes official. But not sure if that is worth creating the gap here? An alternative is to deploy unofficial features in the custom range first, and then later recognize both the official and unofficial bit. Slightly more code, but this feature list remains clean.
Added fat error signaling to the PR. |
76dbf21
to
2de919a
Compare
I've spent a lot of time trying to make the test vector pass and I've finally found what was wrong:
implying that we need to concatenate them in that order. But in your code you follow a different order:
I think the order message + hop payloads + hmacs is more intuitive as it matches the order of the fields in the packet. |
Oh great catch! Will produce a new vector. |
2de919a
to
bcf022b
Compare
@thomash-acinq updated vector |
Updated LND implementation with sender-picked fat error structure parameters: lightningnetwork/lnd#7139 |
bcf022b
to
6bf0729
Compare
This proposal brings us two things:
The second thing is not really related to failures. If we think it is valuable, we should add it to the success case too. |
The association between the forward and return packets is handled outside of | ||
this onion routing protocol, e.g. via association with an HTLC in a payment | ||
channel. | ||
The field `htlc_hold_times` contains the htlc hold time in milliseconds for each hop. The sender can use this information to score nodes on latency. Nodes along the path that lack accurate timing information may simply report a value of zero. In such cases, the sender should distribute any potential latency penalty across multiple nodes. This encourages path nodes to provide timing data to avoid being held responsible for the high latency of other nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could clarify how the hold time is computed exactly. Do we start counting when receiving update_add_htlc
, commitment_signed
or revoke_and_ack
? It should probably be revoke_and_ack
because that's when we can start relaying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this on multiple occasions offline, but while we're at it, it would be good to introduce the notion of 'acceptable delay threshold' to the spec. For privacy and HTLC-batching/efficiency reasons, forwarding nodes still want to introduce some artificial forwarding delay, and scoring/penalization should really only start above a certain threshold deemed acceptable by the community.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is up to implementations to decide how to compute hold time exactly, and make it so that it is in the best interest of their users. In the end, the rational thing to do is look at the sender logic for penalizing latency, and optimize for minimal penalty?
@thomash-acinq carried out an interop test between LDK and Eclair and it passed 🎉 A milestone. Will proceed with updating this PR and get it ready for merge. |
04-onion-routing.md
Outdated
|
||
Each HMAC covers the following data: | ||
|
||
* The return packet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify that we use the return packet before the xor.
04-onion-routing.md
Outdated
currently handling the failure message) assuming that this node is `y` hops | ||
away from the erring node. | ||
|
||
Each HMAC covers the following data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"covers" is a bit ambiguous, make it explicit that these things are concatenated in this order. Same below for hold times and hmacs. For hold times, you can say that we take the first (y + 1) * 4
bytes of htlc_hold_times
.
shared_secret = 53eb63ea8a3fec3b3cd433b85cd62a4b145e1dda09391b348c4e1cd36a03ea66 | ||
ammag_key = 3761ba4d3e726d8abb16cba5950ee976b84937b61b7ad09e741724d7dee12eb5 | ||
stream = 3699fd352a948a05f604763c0bca2968d5eaca2b0118602e52e59121f050936c8dd90c24df7dc8cf8f1665e39a6c75e9e2c0900ea245c9ed3b0008148e0ae18bbfaea0c711d67eade980c6f5452e91a06b070bbde68b5494a92575c114660fb53cf04bf686e67ffa4a0f5ae41a59a39a8515cb686db553d25e71e7a97cc2febcac55df2711b6209c502b2f8827b13d3ad2f491c45a0cafe7b4d8d8810e805dee25d676ce92e0619b9c206f922132d806138713a8f69589c18c3fdc5acee41c1234b17ecab96b8c56a46787bba2c062468a13919afc18513835b472a79b2c35f9a91f38eb3b9e998b1000cc4a0dbd62ac1a5cc8102e373526d7e8f3c3a1b4bfb2f8a3947fe350cb89f73aa1bb054edfa9895c0fc971c2b5056dc8665902b51fced6dff80c4d247db977c15a710ce280fbd0ae3ca2a245b1c967aeb5a1a4a441c50bc9ceb33ca64b5ca93bd8b50060520f35a54a148a4112e8762f9d0b0f78a7f46a5f06c7a4b0845d020eb505c9e527aabab71009289a6919520d32af1f9f51ce4b3655c6f1aae1e26a16dc9aae55e9d4a6f91d4ba76e96fcb851161da3fc39d0d97ce30a5855c75ac2f613ff36a24801bcbd33f0ce4a3572b9a2fca21efb3b07897fc07ee71e8b1c0c6f8dbb7d2c4ed13f11249414fc94047d1a4a0be94d45db56af4c1a3bf39c9c5aa18209eaebb9e025f670d4c8cc1ee598c912db154eaa3d0c93cb3957e126c50486bf98c852ad326b5f80a19df6b2791f3d65b8586474f4c5dcb2aca0911d2257d1bb6a1e9fc1435be879e75d23290f9feb93ed40baaeca1c399fc91fb1da3e5f0f5d63e543a8d12fe6f7e654026d3a118ab58cb14bef9328d4af254215eb1f639828cc6405a3ab02d90bb70a798787a52c29b3a28fc67b0908563a65f08112abd4e9115cb01db09460c602aba3ddc375569dc3abe42c61c5ea7feb39ad8b05d8e2718e68806c0e1c34b0bc85492f985f8b3e76197a50d63982b780187078f5c59ebd814afaeffc7b2c6ee39d4f9c8c45fb5f685756c563f4b9d028fe7981b70752f5a31e44ba051ab40f3604c8596f1e95dc9b0911e7ede63d69b5eecd245fbecbcf233cf6eba842c0fec795a5adeab2100b1a1bc62c15046d48ec5709da4af64f59a2e552ddbbdcda1f543bb4b687e79f2253ff0cd9ba4e6bfae8e510e5147273d288fd4336dbd0b6617bf0ef71c0b4f1f9c1dc999c17ad32fe196b1e2b27baf4d59bba8e5193a9595bd786be00c32bae89c5dbed1e994fddffbec49d0e2d270bcc1068850e5d7e7652e274909b3cf5e3bc6bf64def0bbeac974a76d835e9a10bdd7896f27833232d907b7405260e3c986569bb8fdd65a55b020b91149f27bda9e63b4c2cc5370bcc81ef044a68c40c1b178e4265440334cc40f59ab5f82a022532805bfa659257c8d8ab9b4aef6abbd05de284c2eb165ef35737e3d387988c566f7b1ca0b1fc3e7b4ed991b77f23775e1c36a09a991384a33b78 | ||
error packet for node 0: 2dd2f49c1f5af0fcad371d96e8cddbdcd5096dc309c1d4e110f955926506b3c03b44c192896f45610741c85ed4074212537e0c118d472ff3a559ae244acd9d783c65977765c5d4e00b723d00f12475aafaafff7b31c1be5a589e6e25f8da2959107206dd42bbcb43438129ce6cce2b6b4ae63edc76b876136ca5ea6cd1c6a04ca86eca143d15e53ccdc9e23953e49dc2f87bb11e5238cd6536e57387225b8fff3bf5f3e686fd08458ffe0211b87d64770db9353500af9b122828a006da754cf979738b4374e146ea79dd93656170b89c98c5f2299d6e9c0410c826c721950c780486cd6d5b7130380d7eaff994a8503a8fef3270ce94889fe996da66ed121741987010f785494415ca991b2e8b39ef2df6bde98efd2aec7d251b2772485194c8368451ad49c2354f9d30d95367bde316fec6cbdddc7dc0d25e99d3075e13d3de0822669861dafcd29de74eac48b64411987285491f98d78584d0c2a163b7221ea796f9e8671b2bb91e38ef5e18aaf32c6c02f2fb690358872a1ed28166172631a82c2568d23238017188ebbd48944a147f6cdb3690d5f88e51371cb70adf1fa02afe4ed8b581afc8bcc5104922843a55d52acde09bc9d2b71a663e178788280f3c3eae127d21b0b95777976b3eb17be40a702c244d0e5f833ff49dae6403ff44b131e66df8b88e33ab0a58e379f2c34bf5113c66b9ea8241fc7aa2b1fa53cf4ed3cdd91d407730c66fb039ef3a36d4050dde37d34e80bcfe02a48a6b14ae28227b1627b5ad07608a7763a531f2ffc96dff850e8c583461831b19feffc783bc1beab6301f647e9617d14c92c4b1d63f5147ccda56a35df8ca4806b8884c4aa3c3cc6a174fdc2232404822569c01aba686c1df5eecc059ba97e9688c8b16b70f0d24eacfdba15db1c71f72af1b2af85bd168f0b0800483f115eeccd9b02adf03bdd4a88eab03e43ce342877af2b61f9d3d85497cd1c6b96674f3d4f07f635bb26add1e36835e321d70263b1c04234e222124dad30ffb9f2a138e3ef453442df1af7e566890aedee568093aa922dd62db188aa8361c55503f8e2c2e6ba93de744b55c15260f15ec8e69bb01048ca1fa7bbbd26975bde80930a5b95054688a0ea73af0353cc84b997626a987cc06a517e18f91e02908829d4f4efc011b9867bd9bfe04c5f94e4b9261d30cc39982eb7b250f12aee2a4cce0484ff34eebba89bc6e35bd48d3968e4ca2d77527212017e202141900152f2fd8af0ac3aa456aae13276a13b9b9492a9a636e18244654b3245f07b20eb76b8e1cea8c55e5427f08a63a16b0a633af67c8e48ef8e53519041c9138176eb14b8782c6c2ee76146b8490b97978ee73cd0104e12f483be5a4af414404618e9f6633c55dda6f22252cb793d3d16fae4f0e1431434e7acc8fa2c009d4f6e345ade172313d558a4e61b4377e31b8ed4e28f7cd13a7fe3f72a409bc3bdabfe0ba47a6d861e21f64d2fac706dab18b3e546df4 | ||
attribution data for node 0: 84986c936d26bfd3bb2d34d3ec62cfdb63e0032fdb3d9d75f3e5d456f73dffa7e35aab1db4f1bd3b98ff585caf004f656c51037a3f4e810d275f3f6aea0c8e3a125ebee5f374b6440bcb9bb2955ebf706f42be9999a62ed49c7a81fc73c0b4a16419fd6d334532f40bf179dd19afec21bd8519d5e6ebc3802501ef373bc378eee1f14a6fc5fab5b697c91ce31d5922199d1b0ad5ee12176aacafc7c81d54bc5b8fb7e63f3bfd40a3b6e21f985340cbd1c124c7f85f0369d1aa86ebc66def417107a7861131c8bcd73e8946f4fb54bfac87a2dc15bd7af642f32ae583646141e8875ef81ec9083d7e32d5f135131eab7a43803360434100ff67087762bbe3d6afe2034f5746b8c50e0c3c20dd62a4c174c38b1df7365dccebc7f24f19406649fbf48981448abe5c858bbd4bef6eb983ae7a23e9309fb33b5e7c0522554e88ca04b1d65fc190947dead8c0ccd32932976537d869b5ca53ed4945bccafab2a014ea4cbdc6b0250b25be66ba0afff2ff19c0058c68344fd1b9c472567147525b13b1bc27563e61310110935cf89fda0e34d0575e2389d57bdf2869398ca2965f64a6f04e1d1c2edf2082b97054264a47824dd1a9691c27902b39d57ae4a94dd6481954a9bd1b5cff4ab29ca221fa2bf9b28a362c9661206f896fc7cec563fb80aa5eaccb26c09fa4ef7a981e63028a9c4dac12f82ccb5bea090d56bbb1a4c431e315d9a169299224a8dbd099fb67ea61dfc604edf8a18ee742550b636836bb552dabb28820221bf8546331f32b0c143c1c89310c4fa2e1e0e895ce1a1eb0f43278fdb528131a3e32bfffe0c6de9006418f5309cba773ca38b6ad8507cc59445ccc0257506ebc16a4c01d4cd97e03fcf7a2049fea0db28447858f73b8e9fe98b391b136c9dc510288630a1f0af93b26a8891b857bfe4b818af99a1e011e6dbaa53982d29cf74ae7dffef45545279f19931708ed3eede5e82280eab908e8eb80abff3f1f023ab66869297b40da8496861dc455ac3abe1efa8a6f9e2c4eda48025d43a486a3f26f269743eaa30d6f0e1f48db6287751358a41f5b07aee0f098862e3493731fe2697acce734f004907c6f11eef189424fee52cd30ad708707eaf2e441f52bcf3d0c5440c1742458653c0c8a27b5ade784d9e09c8b47f1671901a29360e7e5e94946b9c75752a1a8d599d2a3e14ac81b84d42115cd688c8383a64fc6e7e1dc5568bb4837358ebe63207a4067af66b2027ad2ce8fb7ae3a452d40723a51fdf9f9c9913e8029a222cf81d12ad41e58860d75deb6de30ad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to add a new test vector where the failing node returns random garbage and we can still attribute the failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GeorgeTsagk suggested the exact same thing today. In LDK, I've added a test that mutates every byte for every node, and asserts that it is properly attributed. Basically an exhaustive test to make sure it always works.
Will also add a random garbage test vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what that vector should look like though. The transformation of the failure packet is already covered in the happy flow test vector. Nodes can't read the data returned by their upstream peer, and the transformation isn't affected by that data being random. Testing that attribution works is just a matter of asserting that the decrypt process picks up the node that returned the random data.
Open to ideas for extending the test vectors!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make a vector where an intermediate hop simply drops all data (as if they don't understand it) and then the next upstream hop re-creates the attr data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But is this testing anything new? That next upstream hop will just recreate attribution data as if it were the failure source, and from there it will flow upstream again.
Regarding hold time reporting, selection pressure in pathfinding, and its potential negative effect on privacy: can't senders opt-in to privacy-preserving random delays via the forward onion? That way there is no loss of precision and nodes can still serve both privacy-valueing users as well as the ones that aim for the absolute fastest routes. |
That could be an interesting trade-off. The sender would include a TLV in the onion to ask relaying nodes to add a random delay? We'd go from delays/batching decided by the intermediate nodes to something that is driven by the senders instead. At first glance, I think it would work. The downside would be that it's off by default (because most app developers will care more about UX than privacy), while I believe that Matt would like basic privacy features to be on by default at the network level. I personally think using 100ms precision would be a good trade-off: batch/delays of this order of magnitude still allow for sub-second payment latency for paths that aren't too long, which I believe is good enough in terms of payment experience while offering some privacy if done at the network level. I'm more in favor of always having 100ms precision (instead of 1ms precision), but if there is a larger consensus for 1ms precision, I'd be ok with it! Not sure how to move forward though: should we just have a small survey among implementers? |
Well, I don't think this is a good idea as "privacy loves company" and having senders signal "hey, look at me, I want to be more private - I have something to hide" might actually have them stand out more overall. The preferred outcome would really be that we agree on a reasonable delay threshold that we encourage for all forwarding nodes. If all just do a little, it might introduce sufficient noise to throw on-path and AS-level attackers off. I'm def. +1 for 100ms buckets, which should be mostly sufficient to cover at least 1-2 RTTs of the handshake, hence introducing at least some degree of uncertainty on the attacker's side. |
There are users who don't like privacy delays. They patch their nodes to remove the delay. And surely there will also be users who don't like coarse grained hold times. Do we really want to force this trade off onto them via the protocol without a way to opt-out, and rule out possibly undiscovered use cases that require ultra-fast payment settlement? Picking a constant also doesn't look particularly attractive to me. |
If you require ultra-fast settlement, there are always cases where multi-hop payments will fail you: shouldn't you just open a channel to your destination instead? |
I still believe in a future of the lightning network where failures aren't tolerated at all, and all routing nodes deliver excellent service. Then it's just a matter of measuring latency and going for the sub-10 ms ones. |
Hold time filtering to avoid users doing things we think they shouldn't. OP_RETURN vibes 😂 |
I agree, but that's orthogonal to payment latency? I honestly don't understand why it's important to aim for minimal latency: is it for performance's sake, to boast about numbers? A payment that takes 500ms to complete is almost indistinguishable from being instantaneous for our brains, so why does it matter that it takes 50ms instead of 500ms if you have to sacrifice privacy for it? |
For me it is two-fold:
Furthermore I think that an instant instant payment with no observable latency at all is just cool too indeed. |
Are we talking about introducing actual time delays here, or just bucketing the reported value of the hold time? If we're talking about further delaying the actual forwarding of the HTLC then I'm very much against it. Keep in mind this is done per payment attempt so if I need to try out 20 different routes this will accumulate quite fast, significantly killing ux. Seems like the concern is that the accurate reporting of hold times does not directly ruin the privacy for the nodes reporting the hold times, but will eventually lead to intense competition around latency, contributing to further centralization in the network? If we're talking about hold times leaking information about the routing nodes (fingerprinting?) then this is already too noisy to produce consistent results. Every node has different hardware and internet connectivity configuration, I wouldn't expect all LNDs or LDKs to have some entropy in their reports, and there exist better ways to fingerprint anyway. I don't think we should pollute the protocol with this kind of safeguard/filter against more precise reports. If we assume that a user is willing to get their hands dirty to somehow signal lower hold times, they'll find a way. Also let's not underestimate the power of what implementations default to. By having implementations default to scoring all hold times <=100ms the same, or reporting buckets of 100ms by default, most of the concerns seem to be eliminated. Of course, if we assume the majority of the network to be running custom software then what's even the point of this discussion? |
I’m very much against the intentional slowing down of HTLC resolves. Every milli second a HTLC is unresolved it poses a risk to the node runner. A HTLC gets stuck when a node goes offline after the HTLC passed that node. So if you intentionally slow down the HTLC resolve speed, you also increase the number of force closures. I would consider the intentional hold of a HTLC of more than 10ms as an jamming attack. |
That's not at all a valid conclusion. HTLCs get stuck and channels force-close because of bugs, not because nodes go temporarily offline or introduce batching/delays. That's completely unrelated. I 100% agree with you that all bugs that lead to stuck HTLCs or force-closed channels must be fixed and should be the highest priority for all implementations. But that has absolutely nothing to do with whether or not we should introduce delays or measure relay latency in buckets of 100ms. |
I do agree, that it is usually a bug or sometimes negligence, that is the root cause of most force-closes. I also totally agree that there should be an easy way to measure relay latency. In fact I do exactly this to generate the list you can find here on the “node speeds” tab But saying that adding a relay delay would not have an impact of the number of force-closes is - in my opinion - wrong. If a channel with active HTLC goes offline and does not come online again before the HTLC reaches its timeout, this will lead to a force-close. I send out more than 100’000 HTLCs each day. |
Failure attribution is important to properly penalize nodes after a payment failure occurs. The goal of the penalty is to give the next attempt a better chance at succeeding. In the happy failure flow, the sender is able to determine the origin of the failure and penalizes a single node or pair of nodes.
Unfortunately it is possible for nodes on the route to hide themselves. If they return random data as the failure message, the sender won't know where the failure happened.
This PR proposes a new failure message format that lets each node commit to the failure message. If one of the nodes corrupts the failure message, the sender will be able to identify that node.
For more information, see https://lists.linuxfoundation.org/pipermail/lightning-dev/2022-October/003723.html.
LND implementation: lightningnetwork/lnd#7139
LDK implementation: lightningdevkit/rust-lightning#3611
Eclair implementation: ACINQ/eclair#2519