-
Notifications
You must be signed in to change notification settings - Fork 404
Improve Robustness of Inbound MPP Claims Across Restart #1434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Robustness of Inbound MPP Claims Across Restart #1434
Conversation
Wanted to throw out the idea of breaking this PR into the first 3 commits and the last 4 (or so)? I think people have mentioned preferring smaller PRs when feasible |
Pulled the first two commits into #1435, the rest are all really just "update claim_funds/fail_htlc and then some tests". |
721c415
to
49fe7c9
Compare
Remaining issue: (a) claim payment, (b) crash, (c) restart see PaymentReceived + PaymentClaimed events still in processing queue but payment has been claimed (on-chain), (d) sender sends again (e) call claim_funds in response to the dup PaymentRecived claiming the payment duplicatively. Not a big deal but should fix it by dropping the PaymentReceived event from the event queue on startup if we do a re-claim. |
49fe7c9
to
4e24f02
Compare
lightning/src/ln/channelmanager.rs
Outdated
payment_preimage: Some(payment_preimage), | ||
} | ||
}); | ||
} | ||
match self.claim_funds_from_hop(channel_state.as_mut().unwrap(), htlc.prev_hop, payment_preimage) { | ||
ClaimFundsFromHop::MonitorUpdateFail(pk, err, _) => { | ||
if let msgs::ErrorAction::IgnoreError = err.err.action { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we get a permanent failure here (but claimed_any_htlcs
is true), will the permfailed monitor be able to claim the htlc on-chain? Just concerned it won't have learned the preimage in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I mean there's no good answer here (and this PR doesn't change that). TBH I think we need to rethink the whole perm-failure logic a bit. That's kinda the intent behind #1106, but ultimately the "what does perm-failure mean" topic isn't super well-defined currently. Lets leave this for a followup because there's nothing changed here, but, does perm-failure mean "we updated the in-memory copy and nothing else" or "we failed to update anything" needs addressing.
if let Some(channel) = by_id.get_mut(&previous_channel_id) { | ||
channel.claim_htlc_while_disconnected_dropping_mon_update(claimable_htlc.prev_hop.htlc_id, payment_preimage, &args.logger); | ||
} | ||
if let Some(previous_hop_monitor) = args.channel_monitors.get(&claimable_htlc.prev_hop.outpoint) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not worth checking that this is the same monitor as from L6781, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My line numbers aren't matching up, but I dont think there's any harm in providing extra preimages for monitors in this super-rare case. They'll get pruned anyway, but even if not it wouldn't matter.
|
||
nodes[0].node.handle_update_fulfill_htlc(&nodes[2].node.get_our_node_id(), &cs_updates.update_fulfill_htlcs[0]); | ||
commitment_signed_dance!(nodes[0], nodes[2], cs_updates.commitment_signed, false, true); | ||
expect_payment_sent!(nodes[0], payment_preimage); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kinda throwing me off that nodes[3]
is not claiming the payment for the persisted monitor backwards, only the crashed monitor. To check my understanding, the only difference would be that there would be no PaymentSent
generated here, it'd only be generated for the first successful backwards claim, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, just on-chain. I didn't bother writing the on-chain eforcement checks cause they're a pain and we have lots of coverage there already.
4e24f02
to
3c8c013
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went through each commit, largely looks good.
lightning/src/util/events.rs
Outdated
/// | ||
/// # Note | ||
/// LDK will not stop an inbound payment from being paid multiple times, so multiple | ||
/// `PaymentReceived` events may be generated for the same payment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/PaymentReceived/PaymentClaimed iiuc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that's confusing, I did mean to leave it as-is but added a second sentence.
3c8c013
to
6c6d939
Compare
lightning/src/ln/channelmanager.rs
Outdated
for (_, monitor) in args.channel_monitors.iter() { | ||
for (payment_hash, payment_preimage) in monitor.get_stored_preimages() { | ||
if let Some(claimable_htlcs) = claimable_htlcs.remove(&payment_hash) { | ||
log_info!(args.logger, "Re-claimaing HTLCs with payment hash {} due to partial-claim.", log_bytes!(payment_hash.0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Re-claiming
It took me a second to figure out how we know this is a re-claim, and definitely not a new claim. Is it because the monitor knows the preimage yet the htlc is still in the claimable_htlcs
map, so the monitor must be ahead of the ChannelManager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, yea. Technically its possible its a duplicate payment and the previous payment was claimed but not fully revoked before restart, but its not really worth checking for that case IMO. I updated the log line a bit.
return; | ||
} | ||
if claimable_amt_msat != expected_amt_msat.unwrap() { | ||
log_info!(self.logger, "Attempted to claim an incomplete payment, expected {} msat, had {} available to claim.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the API be more ergonomic if we returned a Result<(), Error>
from this method, for cases like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like that sends the wrong impression - an Ok(())
does not imply that the payment is fully claimed, which is ultimately why the bool was removed from the return value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we expect these errors to be super rare? They do seem useful. Returning an Option<Error>
is weird though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, basically an error should only occur in a strange restart case of if a user took several blocks to try to claim a payment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like that sends the wrong impression - an
Ok(())
does not imply that the payment is fully claimed, which is ultimately why the bool was removed from the return value.
Wouldn't we want to return the Err
variant in this case? User attempts to claim a payment that is no longer claimable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I feel like that's confusing - you call claim_funds
and get an Err
, a user would reasonably assume that the payment was not claimed/is still pending, but in fact it could have been claimed!
b8c99f5
to
bf9ec6d
Compare
Rebased after merge of #1435. |
bf9ec6d
to
763add0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK after CI and another reviewer
return; | ||
} | ||
if claimable_amt_msat != expected_amt_msat.unwrap() { | ||
log_info!(self.logger, "Attempted to claim an incomplete payment, expected {} msat, had {} available to claim.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we expect these errors to be super rare? They do seem useful. Returning an Option<Error>
is weird though
43907a8
to
48e5842
Compare
Ended up having to rewrite this to some extent as we're moving towards removing the |
48e5842
to
b4cf258
Compare
Codecov Report
@@ Coverage Diff @@
## main #1434 +/- ##
==========================================
+ Coverage 90.90% 92.10% +1.19%
==========================================
Files 77 80 +3
Lines 42201 53330 +11129
Branches 42201 53330 +11129
==========================================
+ Hits 38364 49120 +10756
- Misses 3837 4210 +373
Continue to review full report at Codecov.
|
f9515c4
to
7d4faf6
Compare
0fdabf5
to
fb7f2b6
Compare
While the HTLC-claim process happens across all MPP parts under one lock, this doesn't imply that they are claimed fully atomically on disk. Ultimately, an application can crash after persisting one `ChannelMonitorUpdate` out of multiple monitor updates needed for the full claim. Previously, this would leave us in a very bad state - because of the all-channels-available check in `claim_funds` we'd refuse to claim the payment again on restart (even though the `PaymentReceived` event will be passed to the user again), and we'd end up having partially claimed the payment! The fix for the consistency part of this issue is pretty straightforward - just check for this condition on startup and complete the claim across all channels/`ChannelMonitor`s if we detect it. This still leaves us in a confused state from the perspective of the user, however - we've actually claimed a payment but when they call `claim_funds` we return `false` indicating it could not be claimed.
fb7f2b6
to
319df3a
Compare
Squashed without changes:
|
@@ -6827,7 +6839,7 @@ impl<'a, Signer: Sign, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref> | |||
for (_, monitor) in args.channel_monitors.iter() { | |||
for (payment_hash, payment_preimage) in monitor.get_stored_preimages() { | |||
if let Some(claimable_htlcs) = claimable_htlcs.remove(&payment_hash) { | |||
log_info!(args.logger, "Re-claimaing HTLCs with payment hash {} due to partial-claim.", log_bytes!(payment_hash.0)); | |||
log_info!(args.logger, "Re-claiming HTLCs with payment hash {} as we've released the preimage to a ChannelMonitor!", log_bytes!(payment_hash.0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is in the wrong commit but wtv
319df3a
to
cc25213
Compare
This replaces the return value of `claim_funds` with an event. It does not yet change behavior in any material way.
If we crashed during a payment claim and then detected a partial claim on restart, we should ensure the user is aware that the payment has been claimed. We do so here by using the new partial-claim detection logic to create a `PaymentClaimed` event.
As additional sanity checks, before claiming a payment, we check that we have the full amount available in `claimable_htlcs` that the payment should be for. Concretely, this prevents one somewhat-absurd edge case where a user may receive an MPP payment, wait many *blocks* before claiming it, allowing us to fail the pending HTLCs and the sender to retry some subset of the payment before we go to claim. More generally, this is just good belt-and-suspenders against any edge cases we may have missed.
`ChannelManager::fail_htlc_backwards`' bool return value is quite confusing - just because it returns false doesn't mean the payment wasn't (already) failed. Worse, in some race cases around shutdown where a payment was claimed before an unclean shutdown and then retried on startup, `fail_htlc_backwards` could return true even though (a duplicate copy of the same payment) was claimed, but the claim event has not been seen by the user yet. While its possible to use it correctly, its somewhat confusing to have a return value at all, and definitely lends itself to misuse. Instead, we should push users towards a model where they don't care if `fail_htlc_backwards` succeeds - either they've locally marked the payment as failed (prior to seeing any `PaymentReceived` events) and will fail any attempts to pay it, or they have not and the payment is still receivable until its timeout time is reached. We can revisit this decision based on user feedback, but will need to very carefully document the potential failure modes here if we do.
cc25213
to
8a56702
Compare
Squashed, only change since yesterday was taking @jkczyz's suggested wording from #1434 (comment) |
6bc4ad0
to
531d6c8
Compare
Oops lol, duh, force-pushed the last commit to actually change the references to |
This rewrites our claim pipeline to be substantially more robust against particularly poorly timed crashes, including the potentially partial-MPP-claim-causing bug described in
Ensure all HTLCs for a claimed payment are claimed on startup
!It also adds some belt-and-suspenders checks when claiming and removes some API bits that I find somewhat confusing and likely hard to use (without replacement, so hopefully users can live without it, otherwise we'll have to revert the last commit and do some really careful documentation).