contractcourt: make sure the startup process can be finished #8519

yyforyongyu · 2024-03-05T12:05:38Z

This commit makes sure the startup process won't fail due to errors returned from reloading unfinished sweeps. It's likely to happen as by the time the node is restarted, the sweep txns may already be confirmed.

Fixes #8383

coderabbitai · 2024-03-05T12:05:44Z

Important

Auto Review Skipped

Auto reviews are limited to the following labels: llm-review. Please add one of these labels to enable auto reviews.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository.

To trigger a single review, invoke the @coderabbitai review command.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit-tests for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit tests for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit tests.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
The JSON schema for the configuration file is available here.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

This commit makes sure the startup process won't fail due to errors returned from reloading unfinished sweeps. IT's likely to happen as by the time the node is restarted, the sweep txns may already be confirmed.

morehouse · 2024-03-05T21:46:08Z

contractcourt/utxonursery.go

@@ -284,15 +284,13 @@ func (u *UtxoNursery) Start() error {
 	// point forward, we must close the nursery's quit channel if we detect
 	// any failures during startup to ensure they terminate.
 	if err := u.reloadPreschool(); err != nil {
-		close(u.quit)
-		return err
+		log.Errorf("UtxoNursery unable to reload preschool: %v", err)


This seems more like a bandaid than a good fix. I think this will enable bad half-initialized states in the UTXO nursery (e.g., reloadPreschool could fail halfway through and the nursery would quietly forget about any UTXOs that hadn't been processed yet).

I think what we really want is to detect the "already confirmed" case and update the nursery to the correct state.

IIRC the UtxoNursery has been deprecated, so there is a question of how defensive we need to be here. My main question is that assuming we discovered this by accident, how did this manifest in the first place? Did we run into this? If so, I'm inclined to agree with @morehouse in that we probably can't just quietly forget unless we can justify to ourselves why this won't create an irrecoverable state.

I agree that this is a bandaid. @ProofOfKeags the utxonursery is still used for non-anchor channels so I think we still need to be defensive here. Would be better to understand the root cause of the issue before proceeding w/ this pr

ProofOfKeags

I agree with @morehouse. At a bare minimum we need to justify to ourselves why this is a safe thing to do because if it is, it is non-obvious. I don't think the commit message gives me enough context as it stands for me to convince myself this is OK.

ProofOfKeags · 2024-03-06T00:49:00Z

contractcourt/utxonursery.go

@@ -284,15 +284,13 @@ func (u *UtxoNursery) Start() error {
 	// point forward, we must close the nursery's quit channel if we detect
 	// any failures during startup to ensure they terminate.
 	if err := u.reloadPreschool(); err != nil {
-		close(u.quit)
-		return err
+		log.Errorf("UtxoNursery unable to reload preschool: %v", err)


IIRC the UtxoNursery has been deprecated, so there is a question of how defensive we need to be here. My main question is that assuming we discovered this by accident, how did this manifest in the first place? Did we run into this? If so, I'm inclined to agree with @morehouse in that we probably can't just quietly forget unless we can justify to ourselves why this won't create an irrecoverable state.

ziggie1984 · 2024-03-06T14:22:00Z

Let's keep this PR on hold, until we can pinpoint the problem why the crib transaction was saved using a 0 expiry height.

ziggie1984 · 2024-03-12T14:03:32Z

So looking through older LND versions to find out how the software could allow a Locktime of 0 when forwarding an HTLC, I analysed that LND-0.5-beta would be able to do so, but starting from LND-0.5.1, that would have been caught and no HTLC with this locktime would have been forwarded.

LND-0.5.0-beta:

lnd/htlcswitch/link.go

Lines 1953 to 1965 in 3b2c807

    
           // Finally, we'll ensure that the time-lock on the outgoing HTLC meets 
        
           // the following constraint: the incoming time-lock minus our time-lock 
        
           // delta should equal the outgoing time lock. Otherwise, whether the 
        
           // sender messed up, or an intermediate node tampered with the HTLC. 
        
           if incomingTimeout-timeDelta < outgoingTimeout { 
        
           	l.errorf("Incoming htlc(%x) has incorrect time-lock value: "+ 
        
           		"expected at least %v block delta, got %v block delta", 
        
           		payHash[:], timeDelta, incomingTimeout-outgoingTimeout) 
        
           	// Grab the latest routing policy so the sending node is up to 
        
           	// date with our current policy. 
        
           	var failure lnwire.FailureMessage 
        
           	update, err := l.cfg.FetchLastChannelUpdate(

LND-0.5.1-beta Check:

lnd/htlcswitch/link.go

Lines 1961 to 1967 in b07499f

    
           if outgoingTimeout-heightNow > maxCltvExpiry { 
        
           	l.errorf("outgoing htlc(%x) has a time lock too far in the "+ 
        
           		"future: got %v, but maximum is %v", payHash[:], 
        
           		outgoingTimeout-heightNow, maxCltvExpiry) 
        
           	return &lnwire.FailExpiryTooFar{} 
        
           }

All the channels which suffered from this bug are very old ~2019ish (see logs #8383) so I think we don't need this fix wdyt, but just let the node runner who suffered from this behaviour run a patch to resolve his issues ?

ProofOfKeags · 2024-03-13T02:17:43Z

I don't think it's controversial to say gracefully supporting something caused by LND (checks nots) 13 versions ago is not a priority. :)

So I think that means we can ship this, no? I guess there is also a question of how @yyforyongyu discovered the need for it in the first place.

yyforyongyu · 2024-03-13T09:45:28Z

The way I tackle issue #8383 (or any issues) is to understand its root cause and its effects - this PR is to limit its effect, also was planning to find out the root cause but was caught up with something else, and thanks @ziggie1984 for the discovery and analysis!

I don't think a single error state in one channel should bring down the whole node, it just puts other channels at risk. In fact, I think the startup processes should not be interrupted at all, except if an error from the lower layer is returned, like from bitcoind, database, files, etc., which are checked in healthcheck.

On the other hand, there's a lot of work to do to improve our logging system, so that users can use tools like prometheus to monitor error or warning logs and take action. We are not there yet because some of the error loggings are not erroneous, like the rebroadcasting error, while there are other places missing loggings.

So imo this PR is needed, we should not shut down the node to bring the user's attention about a previous bug from old channels.

morehouse

NACK on the current approach, for the reasons I mentioned previously.

If an error occurs during reloadPreschool, this PR would silently forget about all remaining preschool outputs (not just the bad one). If an error occurs during reloadClasses, this PR will AFAICT forever fail to sweep any remaining outputs that timed out while the node was offline.

One bad output should not put all other channel funds at risk.

ProofOfKeags · 2024-03-13T20:33:41Z

One bad output should not put all other channel funds at risk.

Sounds like we need to rework this to make both reloadPreSchool and reloadClasses incapable of producing errors. Or any errors it produces need to be truly fatal since it will be treated that way by the surrounding context.

ziggie1984 · 2024-03-14T15:12:26Z

I will take this PR over as discussed with yong offline.

lightninglabs-deploy · 2024-04-04T17:11:00Z

@Crypt-iQ: review reminder
@yyforyongyu, remember to re-request review from reviewers when ready

yyforyongyu self-assigned this Mar 5, 2024

yyforyongyu requested review from Roasbeef and Crypt-iQ as code owners March 5, 2024 12:05

contractcourt: make sure the startup process can be finished

a800751

This commit makes sure the startup process won't fail due to errors returned from reloading unfinished sweeps. IT's likely to happen as by the time the node is restarted, the sweep txns may already be confirmed.

yyforyongyu force-pushed the nursey-continue-startup branch from bc93b0c to a800751 Compare March 5, 2024 12:10

saubyk added this to the v0.18.0 milestone Mar 5, 2024

saubyk requested review from ProofOfKeags and removed request for Roasbeef March 5, 2024 17:34

morehouse reviewed Mar 5, 2024

View reviewed changes

ProofOfKeags reviewed Mar 6, 2024

View reviewed changes

morehouse requested changes Mar 13, 2024

View reviewed changes

ziggie1984 assigned ziggie1984 and unassigned yyforyongyu Mar 14, 2024

saubyk modified the milestones: v0.18.0, v0.18.1 Mar 14, 2024

yyforyongyu closed this Apr 8, 2024

yyforyongyu deleted the nursey-continue-startup branch April 8, 2024 08:39

saubyk removed this from the v0.18.1 milestone Apr 8, 2024

contractcourt: make sure the startup process can be finished #8519

contractcourt: make sure the startup process can be finished #8519

Uh oh!

Conversation

yyforyongyu commented Mar 5, 2024

Uh oh!

coderabbitai bot commented Mar 5, 2024

Auto Review Skipped

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

CodeRabbit Discord Community

Uh oh!

morehouse Mar 5, 2024

Choose a reason for hiding this comment

Uh oh!

ProofOfKeags Mar 6, 2024

Choose a reason for hiding this comment

Uh oh!

Crypt-iQ Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

ProofOfKeags left a comment

Choose a reason for hiding this comment

Uh oh!

ProofOfKeags Mar 6, 2024

Choose a reason for hiding this comment

Uh oh!

ziggie1984 commented Mar 6, 2024

Uh oh!

ziggie1984 commented Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProofOfKeags commented Mar 13, 2024

Uh oh!

yyforyongyu commented Mar 13, 2024

Uh oh!

morehouse left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ProofOfKeags commented Mar 13, 2024

Uh oh!

ziggie1984 commented Mar 14, 2024

Uh oh!

lightninglabs-deploy commented Apr 4, 2024

Uh oh!

Uh oh!

CodeRabbit Configration File (`.coderabbit.yaml`)

ziggie1984 commented Mar 12, 2024 •

edited

Loading

morehouse left a comment •

edited

Loading