-
Notifications
You must be signed in to change notification settings - Fork 2.2k
contractcourt: make sure the startup process can be finished #8519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
contractcourt: make sure the startup process can be finished #8519
Conversation
Important Auto Review SkippedAuto reviews are limited to the following labels: llm-review. Please add one of these labels to enable auto reviews. Please check the settings in the CodeRabbit UI or the To trigger a single review, invoke the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
This commit makes sure the startup process won't fail due to errors returned from reloading unfinished sweeps. IT's likely to happen as by the time the node is restarted, the sweep txns may already be confirmed.
bc93b0c
to
a800751
Compare
@@ -284,15 +284,13 @@ func (u *UtxoNursery) Start() error { | |||
// point forward, we must close the nursery's quit channel if we detect | |||
// any failures during startup to ensure they terminate. | |||
if err := u.reloadPreschool(); err != nil { | |||
close(u.quit) | |||
return err | |||
log.Errorf("UtxoNursery unable to reload preschool: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems more like a bandaid than a good fix. I think this will enable bad half-initialized states in the UTXO nursery (e.g., reloadPreschool
could fail halfway through and the nursery would quietly forget about any UTXOs that hadn't been processed yet).
I think what we really want is to detect the "already confirmed" case and update the nursery to the correct state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the UtxoNursery has been deprecated, so there is a question of how defensive we need to be here. My main question is that assuming we discovered this by accident, how did this manifest in the first place? Did we run into this? If so, I'm inclined to agree with @morehouse in that we probably can't just quietly forget unless we can justify to ourselves why this won't create an irrecoverable state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is a bandaid. @ProofOfKeags the utxonursery is still used for non-anchor channels so I think we still need to be defensive here. Would be better to understand the root cause of the issue before proceeding w/ this pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @morehouse. At a bare minimum we need to justify to ourselves why this is a safe thing to do because if it is, it is non-obvious. I don't think the commit message gives me enough context as it stands for me to convince myself this is OK.
@@ -284,15 +284,13 @@ func (u *UtxoNursery) Start() error { | |||
// point forward, we must close the nursery's quit channel if we detect | |||
// any failures during startup to ensure they terminate. | |||
if err := u.reloadPreschool(); err != nil { | |||
close(u.quit) | |||
return err | |||
log.Errorf("UtxoNursery unable to reload preschool: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the UtxoNursery has been deprecated, so there is a question of how defensive we need to be here. My main question is that assuming we discovered this by accident, how did this manifest in the first place? Did we run into this? If so, I'm inclined to agree with @morehouse in that we probably can't just quietly forget unless we can justify to ourselves why this won't create an irrecoverable state.
Let's keep this PR on hold, until we can pinpoint the problem why the crib transaction was saved using a 0 expiry height. |
So looking through older LND versions to find out how the software could allow a Locktime of 0 when forwarding an HTLC, I analysed that LND-0.5-beta would be able to do so, but starting from LND-0.5.1, that would have been caught and no HTLC with this locktime would have been forwarded. LND-0.5.0-beta: Lines 1953 to 1965 in 3b2c807
LND-0.5.1-beta Check: Lines 1961 to 1967 in b07499f
All the channels which suffered from this bug are very old ~2019ish (see logs #8383) so I think we don't need this fix wdyt, but just let the node runner who suffered from this behaviour run a patch to resolve his issues ? |
I don't think it's controversial to say gracefully supporting something caused by LND (checks nots) 13 versions ago is not a priority. :) So I think that means we can ship this, no? I guess there is also a question of how @yyforyongyu discovered the need for it in the first place. |
The way I tackle issue #8383 (or any issues) is to understand its root cause and its effects - this PR is to limit its effect, also was planning to find out the root cause but was caught up with something else, and thanks @ziggie1984 for the discovery and analysis! I don't think a single error state in one channel should bring down the whole node, it just puts other channels at risk. In fact, I think the startup processes should not be interrupted at all, except if an error from the lower layer is returned, like from bitcoind, database, files, etc., which are checked in On the other hand, there's a lot of work to do to improve our logging system, so that users can use tools like So imo this PR is needed, we should not shut down the node to bring the user's attention about a previous bug from old channels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NACK on the current approach, for the reasons I mentioned previously.
If an error occurs during reloadPreschool
, this PR would silently forget about all remaining preschool outputs (not just the bad one). If an error occurs during reloadClasses
, this PR will AFAICT forever fail to sweep any remaining outputs that timed out while the node was offline.
One bad output should not put all other channel funds at risk.
Sounds like we need to rework this to make both |
I will take this PR over as discussed with yong offline. |
@Crypt-iQ: review reminder |
This commit makes sure the startup process won't fail due to errors returned from reloading unfinished sweeps. It's likely to happen as by the time the node is restarted, the sweep txns may already be confirmed.
Fixes #8383