-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Bugfix/htlc flush shutdown #8145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix/htlc flush shutdown #8145
Conversation
this commit introduces many of the most common functions you will want to use with the Option type. Not all of them are used immediately in this PR.
db1d08a to
4154113
Compare
htlcswitch/link_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Figure out an alternative test to this, since just deleting this test is unacceptable.
Note for Reviewers: If you have opinions on how I should go about testing this new asynchronous shutdown procedure please reply to this thread with your ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test cases I want to add:
- Ensure that the flush operation itself works. It must block update_adds while flushing
- Ensure that the switch doesn't forward to the link when it is flushing.
- Ensure that inbound shutdown results in immediate outbound shutdown and then puts it into a flushing state
7601fef to
5401d4e
Compare
htlcswitch/interfaces.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for reviewers: I am having second thoughts about this as I think that having this kind of "unsafe" method on the interface isn't a good idea. I think it may be better to replace it with a "true" shutdown method that implicitly calls the flush before calling an internal unsafe shutdownHtlcManager
7a5d9ba to
62dc6b6
Compare
62dc6b6 to
75dd860
Compare
This commit removes the requirement that the channel state is clean prior to shutdown. Now we invoke the new flush api and make the htlcManager quit and remove the link from the switch when the flush is complete.
75dd860 to
e6a55b4
Compare
bhandras
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only reviewed the MVar for now as I think it needs some changes.
Crypt-iQ
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think mvar is necessary here and the same thing can be accomplished in a simpler way.
One thing that we'll want to add is retransmission of Shutdown which will likely require a database flag to let us know whether we've ever sent Shutdown before.
| p.cfg.Switch.RemoveLink(cid) | ||
|
|
||
| return nil | ||
| return chanLink.Flush(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some awareness/signal needs to be added to the chancloser or we'll run into spec violations. Consider the case where we're the responder in the coop close flow:
- We receive a
Shutdownand callfetchActiveChanCloser. fetchActiveChanClosercallstryLinkShutdownwhich callschanLink.Flush.Flushis non-blocking so this returns immediately. We can't make this blocking as this would prevent other coop close requests from continuing.ProcessCloseMsgis called on the receivedShutdownmessage. This will make us send aShutdownto the peer.- It's possible that the
ChannelLinksends an htlc after this inhandleDownstreamUpdateAdd. This would be a spec violation.
The case where we are the coop close initiator leads to the same scenario above, but the initiator will actually send a premature ClosingSigned when the chancloser receives the peer's Shutdown.
Shutdown should only be sent if it's not possible for the link to send a htlc after Shutdown due to concurrency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok If I understand this correctly you're saying:
It's not sufficient to handle this in EligibleToForward because it may have already been added to the downstream mailbox but not processed by the htlcManager. It must be handled in the actual handleDownstreamPkt itself otherwise something may get into the link process queue, then we flush and send shutdown to our peer, then when we process that downstream add we still send it out because it was already in the queue.
I don't think that automatically implies we need to add something to the ChanCloser, but I do think it means that something needs to change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not sufficient to handle this in EligibleToForward because it may have already been added to the downstream mailbox but not processed by the htlcManager
My understanding is that the change in EligibleToForward is necessary but not sufficient. We also need a way to cancel back anything that's un-ack'd that might be sitting in the mailbox. We can do that by doing a similar flush check in handleDownstreamUpdateAdd, which'll then allow us to cancel stuff within the mailbox (mailbox.FailAdd) batch to the switch.
| // net new HTLCs rather than forwarding them. This is the first | ||
| // opportunity we have to bounce invalid HTLC adds without | ||
| // doing a force-close. | ||
| if l.IsFlushing() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this doesn't need to be caught here, it can be caught earlier in handleUpstreamMsg when the add is first sent across. It should also only apply if we've received the peer's Shutdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, this should also be caught in handleDownstreamMsg or in handleDownstreamUpdateAdd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I talked to @Roasbeef about this. Earlier versions of this did exactly what you suggest but the issue is a question of how you'd respond in the case when they issued the add. From a code cleanliness perspective I would love to do this in handleUpstreamMsg but the issue is it will result in more aggressive behavior than is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's safe to fail HTLCs here. If we've previously forwarded the HTLC it may exist on the outgoing link's commitment tx. If we then startup and call resolveFwdPkgs, it will sequentially call processRemoteAdds for each non-full forwarding package. If we then call Flush at some point during this, we'll cancel one of these HTLCs on the incoming side. Since the outgoing side may still have the HTLC, we'd lose the value of an HTLC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding from my conversation with @Roasbeef yesterday was that this is the function responsible for actually locking in upstream HTLCs and then deciding to forward them after they have been locked in.
My understanding of my change here is that if we were in this flushing state when these irrevocable commitments were made to net-new htlc's, then we haven't forwarded them yet.
What I am getting from your comment here is that if we are mid-way through this process of dealing with the remote adds, then we shouldn't be able to put things into a flushing state, which I agree with.
That said, I'd like to understand what you mean by "if we've previously forwarded the HTLC..." because as I understand things that shouldn't be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On link startup, resolveFwdPkgs is called which will reforward adds, settles, fails to ensure that these messages reach the intended link (this link will be different from the one doing the reforwarding). This is done in processRemoteSettleFails & in processRemoteAdds. We do this because it ensures that if our node crashed or shutdown before these messages reached the other link, we're able to recover on startup. So my point above was that if the add reached the outgoing link and then we end up canceling here due to flushing, we're in a bad state. One way to get around this might be to not cancel if the state is FwdStateProcessed (see several lines below), but I'd have to see if there is any funny business with exit hops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my point above was that if the add reached the outgoing link and then we end up canceling here due to flushing, we're in a bad state.
If we've reached this point after an internal retransmission, then my understanding is that we haven't wrote anything in the forwarding packages for that entry so it's safe to cancel back. Otherwise, the calls to sendHTLCError above would introduce the same issue (cancel back HTLC when we have something locked in the outgoing link).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The forwarding package is only written to once the response to the add is signed for on the incoming link. So it's possible that an outgoing htlc exists already. The calls to sendHTLCError above and below should be reproducible such that if it this is a retransmission and it fails, any earlier invocation should have also failed (otherwise we have a bug).
This isn't the case with the flush logic because an earlier processRemoteAdds may have forwarded the HTLC and now if we've come back up and are flushing, we could hit this error while the HTLC is outgoing
| // After we are finished processing the event, if the link is | ||
| // flushing, we check if the channel is clean and invoke the | ||
| // post-flush hook if it is. | ||
| if l.IsFlushing() && l.channel.IsChannelClean() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can only shutdown here if we've also received their Shutdown. IsFlushing may be set when we're the initiator meaning we haven't received their Shutdown. The counterparty is still allowed to add HTLC's in this case according to the spec.
I think what we'll need is one variable to know when we've received their Shutdown and another variable to know when we've sent our Shutdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a closer look at the spec and see what we can do. I was trying to avoid duplicating the ChanCloser state logic inside of the link since that was a barrier in the PR that precedes this one. As far as I know the ChanCloser is responsible for the state machine from the "initial state" where no shutdowns have been sent all the way through closing_signeds. Is there a reason you think this has to be duplicated in the link itself?
Also keep in mind that if we had a "flush" operation that wasn't a shutdown (dyncomm execution), then this wouldn't translate properly. Conceptually all this does is execute the continuation at the first clean state after the flush has been initiated. The shutdown use case puts the shutdown logic into that continuation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently flushing means that Flush has been called once (either we've sent Shutdown or the peer has), but I think there could be a new interface call that means we've sent and received Shutdown. The issue is that if we've sent Shutdown, the peer can still send HTLCs or update_fee and if we've exited, the peer will still be waiting for that HTLC or update_fee to get resolved. I'm not sure whether this translates well to dynamic commitments, but ideally we'd find a way to do that as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the Flush can only begin once we've both received and sent shutdown, then.
| // net new HTLCs rather than forwarding them. This is the first | ||
| // opportunity we have to bounce invalid HTLC adds without | ||
| // doing a force-close. | ||
| if l.IsFlushing() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, this should also be caught in handleDownstreamMsg or in handleDownstreamUpdateAdd
|
Deleted the staging branch, but this should go into master. I can't re-open this for some reason. |
@Crypt-iQ can you hint at what the easier way might be? Maybe it's an |
Change Description
This change fixes an issue where we would fail a channel when a peer sent us a
shutdownmessage while there were still active HTLCs: see #6039Steps to Test
Steps for reviewers to follow to test the change.
Pull Request Checklist
Testing
Code Style and Documentation
[skip ci]in the commit message for small changes.📝 Please see our Contribution Guidelines for further guidance.