Bugfix/htlc flush shutdown #8145

ProofOfKeags · 2023-11-02T15:23:47Z

Change Description

This change fixes an issue where we would fail a channel when a peer sent us a shutdown message while there were still active HTLCs: see #6039

Steps to Test

Steps for reviewers to follow to test the change.

Pull Request Checklist

Testing

Your PR passes all CI checks.
Tests covering the positive and negative (error paths) are included.
Bug fixes contain tests triggering the bug to prevent regressions.

Code Style and Documentation

The change obeys the Code Documentation and Commenting guidelines, and lines wrap at 80.
Commits follow the Ideal Git Commit Structure.
Any new logging statements use an appropriate subsystem and logging level.
There is a change description in the release notes, or [skip ci] in the commit message for small changes.

📝 Please see our Contribution Guidelines for further guidance.

this commit introduces many of the most common functions you will want to use with the Option type. Not all of them are used immediately in this PR.

ProofOfKeags · 2023-11-07T17:24:44Z

htlcswitch/link_test.go

TODO: Figure out an alternative test to this, since just deleting this test is unacceptable.

Note for Reviewers: If you have opinions on how I should go about testing this new asynchronous shutdown procedure please reply to this thread with your ideas.

Test cases I want to add:

Ensure that the flush operation itself works. It must block update_adds while flushing

Ensure that the switch doesn't forward to the link when it is flushing.

Ensure that inbound shutdown results in immediate outbound shutdown and then puts it into a flushing state

ProofOfKeags · 2023-11-08T23:52:26Z

htlcswitch/interfaces.go

Note for reviewers: I am having second thoughts about this as I think that having this kind of "unsafe" method on the interface isn't a good idea. I think it may be better to replace it with a "true" shutdown method that implicitly calls the flush before calling an internal unsafe shutdownHtlcManager

This commit removes the requirement that the channel state is clean prior to shutdown. Now we invoke the new flush api and make the htlcManager quit and remove the link from the switch when the flush is complete.

bhandras

Only reviewed the MVar for now as I think it needs some changes.

fn/option.go

fn/mvar.go

Crypt-iQ

I don't think mvar is necessary here and the same thing can be accomplished in a simpler way.

One thing that we'll want to add is retransmission of Shutdown which will likely require a database flag to let us know whether we've ever sent Shutdown before.

Crypt-iQ · 2023-11-10T16:05:21Z

peer/brontide.go

-	p.cfg.Switch.RemoveLink(cid)
-
-	return nil
+	return chanLink.Flush(func() {


Some awareness/signal needs to be added to the chancloser or we'll run into spec violations. Consider the case where we're the responder in the coop close flow:

We receive a Shutdown and call fetchActiveChanCloser.

fetchActiveChanCloser calls tryLinkShutdown which calls chanLink.Flush. Flush is non-blocking so this returns immediately. We can't make this blocking as this would prevent other coop close requests from continuing.

ProcessCloseMsg is called on the received Shutdown message. This will make us send a Shutdown to the peer.

It's possible that the ChannelLink sends an htlc after this in handleDownstreamUpdateAdd. This would be a spec violation.

The case where we are the coop close initiator leads to the same scenario above, but the initiator will actually send a premature ClosingSigned when the chancloser receives the peer's Shutdown.

Shutdown should only be sent if it's not possible for the link to send a htlc after Shutdown due to concurrency.

Ok If I understand this correctly you're saying:

It's not sufficient to handle this in EligibleToForward because it may have already been added to the downstream mailbox but not processed by the htlcManager. It must be handled in the actual handleDownstreamPkt itself otherwise something may get into the link process queue, then we flush and send shutdown to our peer, then when we process that downstream add we still send it out because it was already in the queue.

I don't think that automatically implies we need to add something to the ChanCloser, but I do think it means that something needs to change here.

It's not sufficient to handle this in EligibleToForward because it may have already been added to the downstream mailbox but not processed by the htlcManager

My understanding is that the change in EligibleToForward is necessary but not sufficient. We also need a way to cancel back anything that's un-ack'd that might be sitting in the mailbox. We can do that by doing a similar flush check in handleDownstreamUpdateAdd, which'll then allow us to cancel stuff within the mailbox (mailbox.FailAdd) batch to the switch.

Crypt-iQ · 2023-11-10T16:12:05Z

htlcswitch/link.go

+		// net new HTLCs rather than forwarding them. This is the first
+		// opportunity we have to bounce invalid HTLC adds without
+		// doing a force-close.
+		if l.IsFlushing() {


IMO this doesn't need to be caught here, it can be caught earlier in handleUpstreamMsg when the add is first sent across. It should also only apply if we've received the peer's Shutdown

Additionally, this should also be caught in handleDownstreamMsg or in handleDownstreamUpdateAdd

I talked to @Roasbeef about this. Earlier versions of this did exactly what you suggest but the issue is a question of how you'd respond in the case when they issued the add. From a code cleanliness perspective I would love to do this in handleUpstreamMsg but the issue is it will result in more aggressive behavior than is necessary.

I don't think it's safe to fail HTLCs here. If we've previously forwarded the HTLC it may exist on the outgoing link's commitment tx. If we then startup and call resolveFwdPkgs, it will sequentially call processRemoteAdds for each non-full forwarding package. If we then call Flush at some point during this, we'll cancel one of these HTLCs on the incoming side. Since the outgoing side may still have the HTLC, we'd lose the value of an HTLC

My understanding from my conversation with @Roasbeef yesterday was that this is the function responsible for actually locking in upstream HTLCs and then deciding to forward them after they have been locked in.

My understanding of my change here is that if we were in this flushing state when these irrevocable commitments were made to net-new htlc's, then we haven't forwarded them yet.

What I am getting from your comment here is that if we are mid-way through this process of dealing with the remote adds, then we shouldn't be able to put things into a flushing state, which I agree with.

That said, I'd like to understand what you mean by "if we've previously forwarded the HTLC..." because as I understand things that shouldn't be possible.

On link startup, resolveFwdPkgs is called which will reforward adds, settles, fails to ensure that these messages reach the intended link (this link will be different from the one doing the reforwarding). This is done in processRemoteSettleFails & in processRemoteAdds. We do this because it ensures that if our node crashed or shutdown before these messages reached the other link, we're able to recover on startup. So my point above was that if the add reached the outgoing link and then we end up canceling here due to flushing, we're in a bad state. One way to get around this might be to not cancel if the state is FwdStateProcessed (see several lines below), but I'd have to see if there is any funny business with exit hops

So my point above was that if the add reached the outgoing link and then we end up canceling here due to flushing, we're in a bad state.

If we've reached this point after an internal retransmission, then my understanding is that we haven't wrote anything in the forwarding packages for that entry so it's safe to cancel back. Otherwise, the calls to sendHTLCError above would introduce the same issue (cancel back HTLC when we have something locked in the outgoing link).

The forwarding package is only written to once the response to the add is signed for on the incoming link. So it's possible that an outgoing htlc exists already. The calls to sendHTLCError above and below should be reproducible such that if it this is a retransmission and it fails, any earlier invocation should have also failed (otherwise we have a bug).

This isn't the case with the flush logic because an earlier processRemoteAdds may have forwarded the HTLC and now if we've come back up and are flushing, we could hit this error while the HTLC is outgoing

Crypt-iQ · 2023-11-10T16:16:40Z

htlcswitch/link.go

+		// After we are finished processing the event, if the link is
+		// flushing, we check if the channel is clean and invoke the
+		// post-flush hook if it is.
+		if l.IsFlushing() && l.channel.IsChannelClean() {


We can only shutdown here if we've also received their Shutdown. IsFlushing may be set when we're the initiator meaning we haven't received their Shutdown. The counterparty is still allowed to add HTLC's in this case according to the spec.

I think what we'll need is one variable to know when we've received their Shutdown and another variable to know when we've sent our Shutdown

I'll take a closer look at the spec and see what we can do. I was trying to avoid duplicating the ChanCloser state logic inside of the link since that was a barrier in the PR that precedes this one. As far as I know the ChanCloser is responsible for the state machine from the "initial state" where no shutdowns have been sent all the way through closing_signeds. Is there a reason you think this has to be duplicated in the link itself?

Also keep in mind that if we had a "flush" operation that wasn't a shutdown (dyncomm execution), then this wouldn't translate properly. Conceptually all this does is execute the continuation at the first clean state after the flush has been initiated. The shutdown use case puts the shutdown logic into that continuation.

Currently flushing means that Flush has been called once (either we've sent Shutdown or the peer has), but I think there could be a new interface call that means we've sent and received Shutdown. The issue is that if we've sent Shutdown, the peer can still send HTLCs or update_fee and if we've exited, the peer will still be waiting for that HTLC or update_fee to get resolved. I'm not sure whether this translates well to dynamic commitments, but ideally we'd find a way to do that as well

So the Flush can only begin once we've both received and sent shutdown, then.

Crypt-iQ · 2023-11-10T16:18:06Z

htlcswitch/link.go

+		// net new HTLCs rather than forwarding them. This is the first
+		// opportunity we have to bounce invalid HTLC adds without
+		// doing a force-close.
+		if l.IsFlushing() {


Additionally, this should also be caught in handleDownstreamMsg or in handleDownstreamUpdateAdd

Roasbeef · 2023-11-10T21:32:55Z

Deleted the staging branch, but this should go into master.

I can't re-open this for some reason.

ProofOfKeags · 2023-11-10T21:44:16Z

I don't think mvar is necessary here and the same thing can be accomplished in a simpler way.

@Crypt-iQ can you hint at what the easier way might be? Maybe it's an atomic.Pointer[func ()] ? I worry about using that as it leaves a lot of the synchronization stuff to the caller and I'm trying to move us towards a world where we aren't actively coding synchronization everywhere in the codebase.

fn: introduce option type

3d8c57d

this commit introduces many of the most common functions you will want to use with the Option type. Not all of them are used immediately in this PR.

ProofOfKeags force-pushed the bugfix/htlc-flush-shutdown branch 7 times, most recently from db1d08a to 4154113 Compare November 7, 2023 17:23

ProofOfKeags commented Nov 7, 2023

View reviewed changes

ProofOfKeags added 2 commits November 7, 2023 10:14

fn: add mvar implementation

18e06f3

htlcswitch+peer: add flush api to channel link

3c56bb2

ProofOfKeags force-pushed the bugfix/htlc-flush-shutdown branch from 7601fef to 5401d4e Compare November 7, 2023 18:21

saubyk added this to the v0.18.0 milestone Nov 7, 2023

ProofOfKeags commented Nov 8, 2023

View reviewed changes

saubyk linked an issue Nov 9, 2023 that may be closed by this pull request

Can't mutual close with pending HTLC #6039

Closed

ProofOfKeags marked this pull request as ready for review November 9, 2023 15:26

saubyk requested review from Crypt-iQ, bhandras and ellemouton and removed request for ellemouton November 9, 2023 15:27

ProofOfKeags force-pushed the bugfix/htlc-flush-shutdown branch 3 times, most recently from 7a5d9ba to 62dc6b6 Compare November 10, 2023 00:04

htlcswitch: update link logic to drive forward active flushes

d2257fd

ProofOfKeags force-pushed the bugfix/htlc-flush-shutdown branch from 62dc6b6 to 75dd860 Compare November 10, 2023 00:10

ProofOfKeags added 2 commits November 9, 2023 16:12

htlcswitch+peer: make shutdown procedure use flush api

0ca384e

This commit removes the requirement that the channel state is clean prior to shutdown. Now we invoke the new flush api and make the htlcManager quit and remove the link from the switch when the flush is complete.

htlcswitch+peer: add trivial implementations for flush api to mocks

e6a55b4

ProofOfKeags force-pushed the bugfix/htlc-flush-shutdown branch from 75dd860 to e6a55b4 Compare November 10, 2023 00:12

bhandras reviewed Nov 10, 2023

View reviewed changes

fn/option.go Show resolved Hide resolved

fn/option.go Show resolved Hide resolved

fn/option.go Show resolved Hide resolved

fn/mvar.go Show resolved Hide resolved

fn/mvar.go Show resolved Hide resolved

fn/mvar.go Show resolved Hide resolved

fn/mvar.go Show resolved Hide resolved

Crypt-iQ reviewed Nov 10, 2023

View reviewed changes

Roasbeef deleted the branch lightningnetwork:0-18-staging November 10, 2023 21:31

Roasbeef closed this Nov 10, 2023

ProofOfKeags mentioned this pull request Nov 16, 2023

Bugfix/htlc flush shutdown #8167

Merged

7 tasks

Bugfix/htlc flush shutdown #8145

Bugfix/htlc flush shutdown #8145

Uh oh!

Conversation

ProofOfKeags commented Nov 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Description

Steps to Test

Pull Request Checklist

Testing

Code Style and Documentation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bhandras left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Crypt-iQ left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Roasbeef Nov 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Crypt-iQ Nov 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Crypt-iQ Nov 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ProofOfKeags Nov 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Roasbeef commented Nov 10, 2023

Uh oh!

ProofOfKeags commented Nov 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

ProofOfKeags commented Nov 2, 2023 •

edited

Loading

Roasbeef Nov 15, 2023 •

edited

Loading

Crypt-iQ Nov 15, 2023 •

edited

Loading

Crypt-iQ Nov 10, 2023 •

edited

Loading

ProofOfKeags Nov 10, 2023 •

edited

Loading