Skip to content

Rewrite InvoicePayer retry to correctly handle MPP partial failures #1141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

TheBlueMatt
Copy link
Collaborator

Based on #1059, this fixes some API holes in the return values of ChannelManager::send/retry_payment, and then uses those fixes to fix some MPP retry holes in InvoicePayer.

@codecov
Copy link

codecov bot commented Oct 26, 2021

Codecov Report

Merging #1141 (79b9ee8) into main (c53048a) will decrease coverage by 0.01%.
The diff coverage is 85.46%.

❗ Current head 79b9ee8 differs from pull request most recent head 199d258. Consider uploading reports for the commit 199d258 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1141      +/-   ##
==========================================
- Coverage   90.21%   90.20%   -0.02%     
==========================================
  Files          70       70              
  Lines       36201    36358     +157     
==========================================
+ Hits        32660    32796     +136     
- Misses       3541     3562      +21     
Impacted Files Coverage Δ
lightning/src/ln/functional_test_utils.rs 95.14% <ø> (ø)
lightning/src/ln/channelmanager.rs 83.72% <66.66%> (-0.10%) ⬇️
lightning-invoice/src/payment.rs 91.13% <87.31%> (-1.66%) ⬇️
lightning/src/ln/chanmon_update_fail_tests.rs 97.65% <100.00%> (ø)
lightning/src/ln/functional_tests.rs 97.27% <0.00%> (-0.04%) ⬇️
lightning-invoice/src/utils.rs 83.48% <0.00%> (+8.95%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c53048a...199d258. Read the comment docs.

@TheBlueMatt TheBlueMatt added this to the 0.0.103 milestone Oct 26, 2021
@TheBlueMatt TheBlueMatt force-pushed the 2021-10-payment-id-on-partial-failure branch 3 times, most recently from e17c968 to 04e3016 Compare October 27, 2021 17:02
@TheBlueMatt TheBlueMatt force-pushed the 2021-10-payment-id-on-partial-failure branch from 04e3016 to 32d40c1 Compare October 27, 2021 22:23
@TheBlueMatt
Copy link
Collaborator Author

Pushed some integration tests that show the fixed issue :)

@valentinewallace valentinewallace self-requested a review October 28, 2021 16:16
for res in results.iter() {
let mut pending_amt_unsent = 0;
let mut max_unsent_cltv_delta = 0;
for (res, path) in results.iter().zip(route.paths.iter()) {
if res.is_ok() { has_ok = true; }
if res.is_err() { has_err = true; }
if let &Err(APIError::MonitorUpdateFailed) = res {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we're kind of reconceiving of TemporaryFailure as more like AsyncPersist, it may not make sense to have MonitorUpdateFailed be an Err anymore. But, fine to save this discussion for the PR renaming TemporaryFailure

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think that'd be a good idea, totally agreed. Also agree lets leave it for renaming TemporaryFailure.

@@ -2226,10 +2237,25 @@ impl<Signer: Sign, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref> ChannelMana
has_err = true;
has_ok = true;
break;
} else if res.is_err() {
pending_amt_unsent += path.last().unwrap().fee_msat;
max_unsent_cltv_delta = cmp::max(max_unsent_cltv_delta, path.last().unwrap().cltv_expiry_delta);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be more true to the original invoice parameters to send over the min_unsent_cltv_delta?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? I mean in principle if its just paying an invoice they should all be the same. I picked max because its the most conservative option from the pov of the recipient - if we include too high a CLTV they shouldn't complain, min maybe they might see as an error? dunno but I'd hope it doesn't really matter in practice.

Err(PaymentSendFailure::PartialFailure {
results,
payment_id,
failed_paths_retry: if pending_amt_unsent != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not sure how annoying this would be to implement rn, but I think it'd make more sense to consider MonitorUpdateFailed (if TemporaryFailure) a success, and won't result in a PartialFailure (since, iiuc, PartialFailure is mainly for retry purposes?). Then maybe failed_paths_retry could be a non-Option

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, can we leave it for a followup? I don't want to touch any more code than strictly necessary here.

@TheBlueMatt TheBlueMatt force-pushed the 2021-10-payment-id-on-partial-failure branch from 32d40c1 to ccab5c9 Compare October 28, 2021 17:22
let route = self.router.find_route(
&payer,
&params,
Some(&first_hops.iter().collect::<Vec<_>>()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be set to None if payer.first_hops() is empty? Guess that way payers could implicitly specify using the network graph first-hop channels

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think generally not. Like, if we have all of our peers disconnected and no available channels we should just outright fail all payments instead of adding HTLC updates that won't complete until the peer comes back online (ie creating stuck payments).

@TheBlueMatt TheBlueMatt force-pushed the 2021-10-payment-id-on-partial-failure branch 2 times, most recently from 072586e to 4b31eea Compare October 28, 2021 18:48
@jkczyz jkczyz self-requested a review October 28, 2021 20:08
Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM mod CI and squash

@TheBlueMatt TheBlueMatt force-pushed the 2021-10-payment-id-on-partial-failure branch from 4b31eea to f2ab405 Compare October 28, 2021 20:15
@TheBlueMatt
Copy link
Collaborator Author

TheBlueMatt commented Oct 28, 2021

Went ahead and squashed since there were no other reviewers yet with only a trivial change to fix build:

$ git diff-tree -U1 4b31eea0d f2ab405ab
diff --git a/lightning-invoice/src/payment.rs b/lightning-invoice/src/payment.rs
index e51ccf296..489d61776 100644
--- a/lightning-invoice/src/payment.rs
+++ b/lightning-invoice/src/payment.rs
@@ -240,3 +240,3 @@ where
 					};
-					if has_expired(params) {
+					if has_expired(&params) {
 						log_trace!(self.logger, "Invoice expired prior to first send for payment {}", log_bytes!(payment_hash.0));
$

@TheBlueMatt TheBlueMatt force-pushed the 2021-10-payment-id-on-partial-failure branch from 47864f1 to c8e24b7 Compare October 29, 2021 20:21
@TheBlueMatt
Copy link
Collaborator Author

Rebased after merge of #1144.

@TheBlueMatt TheBlueMatt force-pushed the 2021-10-payment-id-on-partial-failure branch from c8e24b7 to 54b710f Compare October 29, 2021 20:25
Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Largely looks good. Comments are mainly about code organization to make the logic easier to follow.

let payment_id = Some(invoice_payer.pay_invoice(&invoice).unwrap());
std::thread::sleep(Duration::from_secs(2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need to sleep in the test. Just give an expiry in the PaymentPathFailed event that would cause the condition to be hit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm? The point of this test is that, unlike fails_paying_invoice_after_expiration, we expire while waiting to retry instead of expiring before the original pay_invoice call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but as far as InvoicePayer is concerned, it doesn't care what the original Invoice said the expiry was when retrying. It just cares what's in the PaymentPathFailed event that the test is later feeding it. So that can be changed to have expired, thus simulating the passage of time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that suggestion was too clever by half.

Comment on lines 417 to 432
} else if let Some(params) = retry {
if self.retry_payment(!all_paths_failed, payment_id.unwrap(), *payment_hash, params).is_ok() {
// We retried at least somewhat, don't provide the PaymentPathFailed event to the user.
return;
}

// Either the payment was rejected, the maximum attempts were exceeded, or an
// error occurred when attempting to retry.
entry.remove();
} else {
unreachable!();
log_trace!(self.logger, "Payment {} missing retry params; not retrying", log_bytes!(payment_hash.0));
if *all_paths_failed { self.payment_cache.lock().unwrap().remove(payment_hash); }
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you flip these conditions as it was before so the code is not nested and error cases come first? That way you can retain the fall-through that removes the entry at the end instead of repeating the code three times.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, good point, I'd moved some of the removal logic into retry_payment, but its always 1:1 with returning an Err so it can just be at the callsite.

Comment on lines +275 to +290
Err(PaymentSendFailure::ParameterError(e)) =>
return Err(PaymentError::Sending(PaymentSendFailure::ParameterError(e))),
Err(PaymentSendFailure::PathParameterError(e)) =>
return Err(PaymentError::Sending(PaymentSendFailure::PathParameterError(e))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a catch-all Err(e) instead of re-forming the errors.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, I kinda like doing the explicit catches. I'm pretty oldschool but I like having the verbosity because it means if we change the PaymentSendFailure enum to contain a new variant we are compiler-required to update this code vs it silently handling a new case that we don't have currently. If you feel strongly we can change it (grep for the enum always turns it up, of course), but I vaguely prefer it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point

let htlc_msgs = nodes[0].node.get_and_clear_pending_msg_events();
assert_eq!(htlc_msgs.len(), 2);
check_added_monitors!(nodes[0], 2);
assert!(!*event_handled.borrow());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how useful this check is if nothing is fed into the event handler / invoice payer's event handler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that's the point? The InvoicePayer in this test shouldn't generate any user-visible events?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InvoicePayer doesn't generate events. It simply passes them on conditionally. But you aren't giving it any events from what I can see.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but ChannelManager does? I get ChannelManager doesn't currently generate any such events, but its nice to test the full thing in the integration tests IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want something like nodes[0].node.process_pending_events(&invoice_payer) here then. That way if ChannelManager does generate events, the test would fail if InvoicePayer delegated to the decorated event handler and wasn't supposed to.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped the excess bool and just paniced in the function. We have to have a function, so no reason not to, but no need to test it in detail.

@@ -1246,4 +1246,50 @@ mod tests {
check_added_monitors!(nodes[0], 2);
assert!(!*event_handled.borrow());
}

#[test]
fn immediate_retry_on_failure() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comments for this test.

Copy link

@ariard ariard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall SGTM, though still building my understanding of InvoicePayer.

final_cltv_expiry_delta: invoice.min_final_cltv_expiry() as u32,
let failed_paths_data = loop {
let mut payment_cache = self.payment_cache.lock().unwrap();
match payment_cache.entry(payment_hash) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you be able to use a InvoicePayer instance for multiple invoices ? If yes how do we deal with collisioning invoices, i.e with the same payment_hash ? Not clear to me reading InvoicePayer high-level documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could key by payment_id. I think we added that after I wrote InvoicePayer but prior to merge. Can't recall if there was a reason to not use it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the idea is definitely that you'd have a single InvoicePayer for all your pending invoices. Still, it doesn't handle duplicate payment hashes explicitly - if you call it twice with the same invoice/payment_hash it will fail if the original one is still in-flight, but will go to pay if the original one is completed. I documented that the user is responsible for de-duplication.

Err(PaymentSendFailure::PathParameterError(e)) =>
return Err(PaymentError::Sending(PaymentSendFailure::PathParameterError(e))),
Err(PaymentSendFailure::AllFailedRetrySafe(e)) => {
if retry_count >= self.retry_attempts.0 {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, retry_attempts doesn't dissociate MPP and simple payment, it's the number of global retries, wherever the failure is located ? If I've a retry_attempts of value 3, and I send a MPP with 5 paths, I'm going to retry at most of 3 local path failure.

If so RetryAttempts documentation could be a bit clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's spelled out a bit more in the module-level docs, but the docs on RetryAttempts could be similarly worded.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, your understanding is correct, updated docs.

@TheBlueMatt TheBlueMatt force-pushed the 2021-10-payment-id-on-partial-failure branch from 54b710f to 42b0cd7 Compare October 30, 2021 01:41
This will allow `InvoicePayer` to properly retry payments that only
partially failed to send.
Users can provide anything they want as `RouteParameters` so we
shouldn't assume any fields are set any particular way, including
`expiry_time` set at all.
@TheBlueMatt TheBlueMatt force-pushed the 2021-10-payment-id-on-partial-failure branch 2 times, most recently from a4e15f3 to c42ee6a Compare October 30, 2021 02:34
Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments but otherwise I think we can handle pay/retry refactors in a follow-up.

let htlc_msgs = nodes[0].node.get_and_clear_pending_msg_events();
assert_eq!(htlc_msgs.len(), 2);
check_added_monitors!(nodes[0], 2);
assert!(!*event_handled.borrow());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want something like nodes[0].node.process_pending_events(&invoice_payer) here then. That way if ChannelManager does generate events, the test would fail if InvoicePayer delegated to the decorated event handler and wasn't supposed to.

Comment on lines +283 to +300
break None;
},
Err(PaymentSendFailure::PartialFailure { results: _, failed_paths_retry, payment_id }) => {
if let Some(retry_data) = failed_paths_retry {
entry.insert(retry_count);
break Some((retry_data, payment_id));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed about being able to reuse retry_handler, so might require a bit of re-work. Doesn't have to be done on this PR.

@TheBlueMatt TheBlueMatt force-pushed the 2021-10-payment-id-on-partial-failure branch from c42ee6a to 79b9ee8 Compare October 30, 2021 23:30
Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 79b9ee8

This rewrites a good chunk of the retry logic in `InvoicePayer` to
address two issues:
 * it was not considering the return value of `send_payment` (and
   `retry_payment`) may indicate a failure on some paths but not
   others,
 * it was not considering that more failures may still come later
   when removing elements from the retry count map. This could
   result in us seeing an MPP-partial-failure, failing to retry,
   removing the retries count entry, and then retrying other parts,
   potentially forever.
This tests the multi-part-single-failure-immediately fixes in the
previous commit.
@TheBlueMatt TheBlueMatt force-pushed the 2021-10-payment-id-on-partial-failure branch from 79b9ee8 to 199d258 Compare October 31, 2021 17:43
@TheBlueMatt
Copy link
Collaborator Author

TheBlueMatt commented Oct 31, 2021

Squashed fixups without other changes. Will land after CI.

$ git diff-tree -U1 79b9ee84 199d258b
$

@TheBlueMatt TheBlueMatt merged commit 61341df into lightningdevkit:main Oct 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants