-
Notifications
You must be signed in to change notification settings - Fork 405
Rewrite InvoicePayer retry to correctly handle MPP partial failures #1141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite InvoicePayer retry to correctly handle MPP partial failures #1141
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1141 +/- ##
==========================================
- Coverage 90.21% 90.20% -0.02%
==========================================
Files 70 70
Lines 36201 36358 +157
==========================================
+ Hits 32660 32796 +136
- Misses 3541 3562 +21
Continue to review full report at Codecov.
|
e17c968
to
04e3016
Compare
04e3016
to
32d40c1
Compare
Pushed some integration tests that show the fixed issue :) |
for res in results.iter() { | ||
let mut pending_amt_unsent = 0; | ||
let mut max_unsent_cltv_delta = 0; | ||
for (res, path) in results.iter().zip(route.paths.iter()) { | ||
if res.is_ok() { has_ok = true; } | ||
if res.is_err() { has_err = true; } | ||
if let &Err(APIError::MonitorUpdateFailed) = res { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we're kind of reconceiving of TemporaryFailure
as more like AsyncPersist
, it may not make sense to have MonitorUpdateFailed
be an Err
anymore. But, fine to save this discussion for the PR renaming TemporaryFailure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think that'd be a good idea, totally agreed. Also agree lets leave it for renaming TemporaryFailure.
@@ -2226,10 +2237,25 @@ impl<Signer: Sign, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref> ChannelMana | |||
has_err = true; | |||
has_ok = true; | |||
break; | |||
} else if res.is_err() { | |||
pending_amt_unsent += path.last().unwrap().fee_msat; | |||
max_unsent_cltv_delta = cmp::max(max_unsent_cltv_delta, path.last().unwrap().cltv_expiry_delta); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be more true to the original invoice parameters to send over the min_unsent_cltv_delta
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? I mean in principle if its just paying an invoice they should all be the same. I picked max because its the most conservative option from the pov of the recipient - if we include too high a CLTV they shouldn't complain, min maybe they might see as an error? dunno but I'd hope it doesn't really matter in practice.
Err(PaymentSendFailure::PartialFailure { | ||
results, | ||
payment_id, | ||
failed_paths_retry: if pending_amt_unsent != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure how annoying this would be to implement rn, but I think it'd make more sense to consider MonitorUpdateFailed
(if TemporaryFailure
) a success, and won't result in a PartialFailure
(since, iiuc, PartialFailure
is mainly for retry purposes?). Then maybe failed_paths_retry
could be a non-Option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, can we leave it for a followup? I don't want to touch any more code than strictly necessary here.
32d40c1
to
ccab5c9
Compare
let route = self.router.find_route( | ||
&payer, | ||
¶ms, | ||
Some(&first_hops.iter().collect::<Vec<_>>()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be set to None
if payer.first_hops()
is empty? Guess that way payer
s could implicitly specify using the network graph first-hop channels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think generally not. Like, if we have all of our peers disconnected and no available channels we should just outright fail all payments instead of adding HTLC updates that won't complete until the peer comes back online (ie creating stuck payments).
072586e
to
4b31eea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM mod CI and squash
4b31eea
to
f2ab405
Compare
Went ahead and squashed since there were no other reviewers yet with only a trivial change to fix build: $ git diff-tree -U1 4b31eea0d f2ab405ab
diff --git a/lightning-invoice/src/payment.rs b/lightning-invoice/src/payment.rs
index e51ccf296..489d61776 100644
--- a/lightning-invoice/src/payment.rs
+++ b/lightning-invoice/src/payment.rs
@@ -240,3 +240,3 @@ where
};
- if has_expired(params) {
+ if has_expired(¶ms) {
log_trace!(self.logger, "Invoice expired prior to first send for payment {}", log_bytes!(payment_hash.0));
$ |
47864f1
to
c8e24b7
Compare
Rebased after merge of #1144. |
c8e24b7
to
54b710f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Largely looks good. Comments are mainly about code organization to make the logic easier to follow.
lightning-invoice/src/payment.rs
Outdated
let payment_id = Some(invoice_payer.pay_invoice(&invoice).unwrap()); | ||
std::thread::sleep(Duration::from_secs(2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need to sleep in the test. Just give an expiry in the PaymentPathFailed
event that would cause the condition to be hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm? The point of this test is that, unlike fails_paying_invoice_after_expiration
, we expire while waiting to retry instead of expiring before the original pay_invoice
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but as far as InvoicePayer
is concerned, it doesn't care what the original Invoice
said the expiry was when retrying. It just cares what's in the PaymentPathFailed
event that the test is later feeding it. So that can be changed to have expired, thus simulating the passage of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that suggestion was too clever by half.
lightning-invoice/src/payment.rs
Outdated
} else if let Some(params) = retry { | ||
if self.retry_payment(!all_paths_failed, payment_id.unwrap(), *payment_hash, params).is_ok() { | ||
// We retried at least somewhat, don't provide the PaymentPathFailed event to the user. | ||
return; | ||
} | ||
|
||
// Either the payment was rejected, the maximum attempts were exceeded, or an | ||
// error occurred when attempting to retry. | ||
entry.remove(); | ||
} else { | ||
unreachable!(); | ||
log_trace!(self.logger, "Payment {} missing retry params; not retrying", log_bytes!(payment_hash.0)); | ||
if *all_paths_failed { self.payment_cache.lock().unwrap().remove(payment_hash); } | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you flip these conditions as it was before so the code is not nested and error cases come first? That way you can retain the fall-through that removes the entry at the end instead of repeating the code three times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, good point, I'd moved some of the removal logic into retry_payment
, but its always 1:1 with returning an Err
so it can just be at the callsite.
Err(PaymentSendFailure::ParameterError(e)) => | ||
return Err(PaymentError::Sending(PaymentSendFailure::ParameterError(e))), | ||
Err(PaymentSendFailure::PathParameterError(e)) => | ||
return Err(PaymentError::Sending(PaymentSendFailure::PathParameterError(e))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using a catch-all Err(e)
instead of re-forming the errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm, I kinda like doing the explicit catches. I'm pretty oldschool but I like having the verbosity because it means if we change the PaymentSendFailure
enum to contain a new variant we are compiler-required to update this code vs it silently handling a new case that we don't have currently. If you feel strongly we can change it (grep for the enum always turns it up, of course), but I vaguely prefer it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point
lightning-invoice/src/payment.rs
Outdated
let htlc_msgs = nodes[0].node.get_and_clear_pending_msg_events(); | ||
assert_eq!(htlc_msgs.len(), 2); | ||
check_added_monitors!(nodes[0], 2); | ||
assert!(!*event_handled.borrow()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how useful this check is if nothing is fed into the event handler / invoice payer's event handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that's the point? The InvoicePayer
in this test shouldn't generate any user-visible events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InvoicePayer
doesn't generate events. It simply passes them on conditionally. But you aren't giving it any events from what I can see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but ChannelManager
does? I get ChannelManager
doesn't currently generate any such events, but its nice to test the full thing in the integration tests IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want something like nodes[0].node.process_pending_events(&invoice_payer)
here then. That way if ChannelManager
does generate events, the test would fail if InvoicePayer
delegated to the decorated event handler and wasn't supposed to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dropped the excess bool and just paniced in the function. We have to have a function, so no reason not to, but no need to test it in detail.
@@ -1246,4 +1246,50 @@ mod tests { | |||
check_added_monitors!(nodes[0], 2); | |||
assert!(!*event_handled.borrow()); | |||
} | |||
|
|||
#[test] | |||
fn immediate_retry_on_failure() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comments for this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall SGTM, though still building my understanding of InvoicePayer
.
final_cltv_expiry_delta: invoice.min_final_cltv_expiry() as u32, | ||
let failed_paths_data = loop { | ||
let mut payment_cache = self.payment_cache.lock().unwrap(); | ||
match payment_cache.entry(payment_hash) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you be able to use a InvoicePayer
instance for multiple invoices ? If yes how do we deal with collisioning invoices, i.e with the same payment_hash
? Not clear to me reading InvoicePayer
high-level documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could key by payment_id
. I think we added that after I wrote InvoicePayer
but prior to merge. Can't recall if there was a reason to not use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the idea is definitely that you'd have a single InvoicePayer
for all your pending invoices. Still, it doesn't handle duplicate payment hashes explicitly - if you call it twice with the same invoice/payment_hash it will fail if the original one is still in-flight, but will go to pay if the original one is completed. I documented that the user is responsible for de-duplication.
Err(PaymentSendFailure::PathParameterError(e)) => | ||
return Err(PaymentError::Sending(PaymentSendFailure::PathParameterError(e))), | ||
Err(PaymentSendFailure::AllFailedRetrySafe(e)) => { | ||
if retry_count >= self.retry_attempts.0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, retry_attempts
doesn't dissociate MPP and simple payment, it's the number of global retries, wherever the failure is located ? If I've a retry_attempts
of value 3, and I send a MPP with 5 paths, I'm going to retry at most of 3 local path failure.
If so RetryAttempts
documentation could be a bit clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's spelled out a bit more in the module-level docs, but the docs on RetryAttempts
could be similarly worded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, your understanding is correct, updated docs.
54b710f
to
42b0cd7
Compare
This should allow us to fix lightningdevkit/ldk-garbagecollected#52
This will allow `InvoicePayer` to properly retry payments that only partially failed to send.
Users can provide anything they want as `RouteParameters` so we shouldn't assume any fields are set any particular way, including `expiry_time` set at all.
a4e15f3
to
c42ee6a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments but otherwise I think we can handle pay/retry refactors in a follow-up.
lightning-invoice/src/payment.rs
Outdated
let htlc_msgs = nodes[0].node.get_and_clear_pending_msg_events(); | ||
assert_eq!(htlc_msgs.len(), 2); | ||
check_added_monitors!(nodes[0], 2); | ||
assert!(!*event_handled.borrow()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want something like nodes[0].node.process_pending_events(&invoice_payer)
here then. That way if ChannelManager
does generate events, the test would fail if InvoicePayer
delegated to the decorated event handler and wasn't supposed to.
break None; | ||
}, | ||
Err(PaymentSendFailure::PartialFailure { results: _, failed_paths_retry, payment_id }) => { | ||
if let Some(retry_data) = failed_paths_retry { | ||
entry.insert(retry_count); | ||
break Some((retry_data, payment_id)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed about being able to reuse retry_handler
, so might require a bit of re-work. Doesn't have to be done on this PR.
c42ee6a
to
79b9ee8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 79b9ee8
This rewrites a good chunk of the retry logic in `InvoicePayer` to address two issues: * it was not considering the return value of `send_payment` (and `retry_payment`) may indicate a failure on some paths but not others, * it was not considering that more failures may still come later when removing elements from the retry count map. This could result in us seeing an MPP-partial-failure, failing to retry, removing the retries count entry, and then retrying other parts, potentially forever.
This tests the multi-part-single-failure-immediately fixes in the previous commit.
79b9ee8
to
199d258
Compare
Squashed fixups without other changes. Will land after CI.
|
Based on #1059, this fixes some API holes in the return values of
ChannelManager::send/retry_payment
, and then uses those fixes to fix some MPP retry holes in InvoicePayer.