Skip to content

contractcourt: make sure the startup process can be finished #8519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

yyforyongyu
Copy link
Member

This commit makes sure the startup process won't fail due to errors returned from reloading unfinished sweeps. It's likely to happen as by the time the node is restarted, the sweep txns may already be confirmed.

Fixes #8383

@yyforyongyu yyforyongyu self-assigned this Mar 5, 2024
Copy link
Contributor

coderabbitai bot commented Mar 5, 2024

Important

Auto Review Skipped

Auto reviews are limited to the following labels: llm-review. Please add one of these labels to enable auto reviews.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository.

To trigger a single review, invoke the @coderabbitai review command.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit-tests for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit tests for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit tests.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

This commit makes sure the startup process won't fail due to errors
returned from reloading unfinished sweeps. IT's likely to happen as by
the time the node is restarted, the sweep txns may already be confirmed.
@yyforyongyu yyforyongyu force-pushed the nursey-continue-startup branch from bc93b0c to a800751 Compare March 5, 2024 12:10
@saubyk saubyk added this to the v0.18.0 milestone Mar 5, 2024
@saubyk saubyk requested review from ProofOfKeags and removed request for Roasbeef March 5, 2024 17:34
@@ -284,15 +284,13 @@ func (u *UtxoNursery) Start() error {
// point forward, we must close the nursery's quit channel if we detect
// any failures during startup to ensure they terminate.
if err := u.reloadPreschool(); err != nil {
close(u.quit)
return err
log.Errorf("UtxoNursery unable to reload preschool: %v", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems more like a bandaid than a good fix. I think this will enable bad half-initialized states in the UTXO nursery (e.g., reloadPreschool could fail halfway through and the nursery would quietly forget about any UTXOs that hadn't been processed yet).

I think what we really want is to detect the "already confirmed" case and update the nursery to the correct state.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the UtxoNursery has been deprecated, so there is a question of how defensive we need to be here. My main question is that assuming we discovered this by accident, how did this manifest in the first place? Did we run into this? If so, I'm inclined to agree with @morehouse in that we probably can't just quietly forget unless we can justify to ourselves why this won't create an irrecoverable state.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is a bandaid. @ProofOfKeags the utxonursery is still used for non-anchor channels so I think we still need to be defensive here. Would be better to understand the root cause of the issue before proceeding w/ this pr

Copy link
Collaborator

@ProofOfKeags ProofOfKeags left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @morehouse. At a bare minimum we need to justify to ourselves why this is a safe thing to do because if it is, it is non-obvious. I don't think the commit message gives me enough context as it stands for me to convince myself this is OK.

@@ -284,15 +284,13 @@ func (u *UtxoNursery) Start() error {
// point forward, we must close the nursery's quit channel if we detect
// any failures during startup to ensure they terminate.
if err := u.reloadPreschool(); err != nil {
close(u.quit)
return err
log.Errorf("UtxoNursery unable to reload preschool: %v", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the UtxoNursery has been deprecated, so there is a question of how defensive we need to be here. My main question is that assuming we discovered this by accident, how did this manifest in the first place? Did we run into this? If so, I'm inclined to agree with @morehouse in that we probably can't just quietly forget unless we can justify to ourselves why this won't create an irrecoverable state.

@ziggie1984
Copy link
Collaborator

Let's keep this PR on hold, until we can pinpoint the problem why the crib transaction was saved using a 0 expiry height.

@ziggie1984
Copy link
Collaborator

ziggie1984 commented Mar 12, 2024

So looking through older LND versions to find out how the software could allow a Locktime of 0 when forwarding an HTLC, I analysed that LND-0.5-beta would be able to do so, but starting from LND-0.5.1, that would have been caught and no HTLC with this locktime would have been forwarded.

LND-0.5.0-beta:

lnd/htlcswitch/link.go

Lines 1953 to 1965 in 3b2c807

// Finally, we'll ensure that the time-lock on the outgoing HTLC meets
// the following constraint: the incoming time-lock minus our time-lock
// delta should equal the outgoing time lock. Otherwise, whether the
// sender messed up, or an intermediate node tampered with the HTLC.
if incomingTimeout-timeDelta < outgoingTimeout {
l.errorf("Incoming htlc(%x) has incorrect time-lock value: "+
"expected at least %v block delta, got %v block delta",
payHash[:], timeDelta, incomingTimeout-outgoingTimeout)
// Grab the latest routing policy so the sending node is up to
// date with our current policy.
var failure lnwire.FailureMessage
update, err := l.cfg.FetchLastChannelUpdate(

LND-0.5.1-beta Check:

lnd/htlcswitch/link.go

Lines 1961 to 1967 in b07499f

if outgoingTimeout-heightNow > maxCltvExpiry {
l.errorf("outgoing htlc(%x) has a time lock too far in the "+
"future: got %v, but maximum is %v", payHash[:],
outgoingTimeout-heightNow, maxCltvExpiry)
return &lnwire.FailExpiryTooFar{}
}

All the channels which suffered from this bug are very old ~2019ish (see logs #8383) so I think we don't need this fix wdyt, but just let the node runner who suffered from this behaviour run a patch to resolve his issues ?

@ProofOfKeags
Copy link
Collaborator

I don't think it's controversial to say gracefully supporting something caused by LND (checks nots) 13 versions ago is not a priority. :)

So I think that means we can ship this, no? I guess there is also a question of how @yyforyongyu discovered the need for it in the first place.

@yyforyongyu
Copy link
Member Author

The way I tackle issue #8383 (or any issues) is to understand its root cause and its effects - this PR is to limit its effect, also was planning to find out the root cause but was caught up with something else, and thanks @ziggie1984 for the discovery and analysis!

I don't think a single error state in one channel should bring down the whole node, it just puts other channels at risk. In fact, I think the startup processes should not be interrupted at all, except if an error from the lower layer is returned, like from bitcoind, database, files, etc., which are checked in healthcheck.

On the other hand, there's a lot of work to do to improve our logging system, so that users can use tools like prometheus to monitor error or warning logs and take action. We are not there yet because some of the error loggings are not erroneous, like the rebroadcasting error, while there are other places missing loggings.

So imo this PR is needed, we should not shut down the node to bring the user's attention about a previous bug from old channels.

Copy link
Collaborator

@morehouse morehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NACK on the current approach, for the reasons I mentioned previously.

If an error occurs during reloadPreschool, this PR would silently forget about all remaining preschool outputs (not just the bad one). If an error occurs during reloadClasses, this PR will AFAICT forever fail to sweep any remaining outputs that timed out while the node was offline.

One bad output should not put all other channel funds at risk.

@ProofOfKeags
Copy link
Collaborator

One bad output should not put all other channel funds at risk.

Sounds like we need to rework this to make both reloadPreSchool and reloadClasses incapable of producing errors. Or any errors it produces need to be truly fatal since it will be treated that way by the surrounding context.

@ziggie1984 ziggie1984 assigned ziggie1984 and unassigned yyforyongyu Mar 14, 2024
@ziggie1984
Copy link
Collaborator

I will take this PR over as discussed with yong offline.

@saubyk saubyk modified the milestones: v0.18.0, v0.18.1 Mar 14, 2024
@lightninglabs-deploy
Copy link

@Crypt-iQ: review reminder
@yyforyongyu, remember to re-request review from reviewers when ready

@yyforyongyu yyforyongyu closed this Apr 8, 2024
@yyforyongyu yyforyongyu deleted the nursey-continue-startup branch April 8, 2024 08:39
@saubyk saubyk removed this from the v0.18.1 milestone Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug]: a height hint greater than 0 must be provided
7 participants