Skip to content

chore(abci): Add FATAL errors for app#2581

Open
shotes wants to merge 7 commits into
mainfrom
fatal-app-errors
Open

chore(abci): Add FATAL errors for app#2581
shotes wants to merge 7 commits into
mainfrom
fatal-app-errors

Conversation

@shotes
Copy link
Copy Markdown
Contributor

@shotes shotes commented Mar 10, 2025

This adds FATAL errors to the ABCI app. This makes the response to Fatal Engine API errors consistent across PrepareProposal, ProcessProposal, and FinalizeBlock. This is particularly important when handling fatal errors such as connection refused from the EL.

Currently, if the CL receives a connection refused:

  • PrepareProposal will submit an empty proposal and return no error.
  • ProcessProposal will REJECT the block and return no error.
  • FinalizeBlock will return error and panic.

This needs to be made consistent, and IMO we should not be voting for blocks or responding with null proposals if we have a FATAL error.

@shotes shotes requested a review from a team as a code owner March 10, 2025 16:20
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 10, 2025

Codecov Report

❌ Patch coverage is 72.72727% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.97%. Comparing base (eead674) to head (96ef030).
⚠️ Report is 248 commits behind head on main.

Files with missing lines Patch % Lines
consensus/cometbft/service/prepare_proposal.go 50.00% 3 Missing ⚠️
errors/errors.go 66.66% 2 Missing ⚠️
execution/engine/engine.go 50.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2581      +/-   ##
==========================================
+ Coverage   58.61%   58.97%   +0.36%     
==========================================
  Files         340      337       -3     
  Lines       15272    15178      -94     
  Branches       20        0      -20     
==========================================
  Hits         8951     8951              
+ Misses       5652     5558      -94     
  Partials      669      669              
Files with missing lines Coverage Δ
consensus/cometbft/service/process_proposal.go 84.44% <100.00%> (+0.72%) ⬆️
execution/engine/engine.go 53.21% <50.00%> (ø)
errors/errors.go 73.33% <66.66%> (ø)
consensus/cometbft/service/prepare_proposal.go 64.58% <50.00%> (-1.38%) ⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread errors/mod.go
Copy link
Copy Markdown
Contributor

@rezzmah rezzmah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll look bad if validators have a higher occurrance of their beacon node crashing with the new release (something we faced with the lockstep EL/CL changes).

Is there anything we can do to put the validators at ease as to why their beacon node crashed - e.g. excessive logging?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think payload builders should still be able to send an empty block if their EL went fatal. Something to discuss further

Copy link
Copy Markdown
Contributor Author

@shotes shotes Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty proposed block would get broadcasted, rejected by validators, and then prevoted nil. Whereas a missed proposed would timeout and get prevoted nil**. The behavior at consensus level is the same, but we're adding extra network congestion by broadcasting empty proposal. Why would we do that if we don't have to?

**From comet spec:

Else, if the proposal is invalid or wasn’t received on time, it prevotes <nil>.

Comment thread consensus/cometbft/service/prepare_proposal.go Outdated
Comment thread consensus/cometbft/service/process_proposal.go Outdated
@shotes
Copy link
Copy Markdown
Contributor Author

shotes commented Mar 19, 2025

It'll look bad if validators have a higher occurrance of their beacon node crashing with the new release (something we faced with the lockstep EL/CL changes).

Is there anything we can do to put the validators at ease as to why their beacon node crashed - e.g. excessive logging?

I'm not sure this results in a higher occurrence of beacon node crashing. If a Fatal error happens during ProcessProposal (and gets glossed over because we don't crash immediately in process proposal), then it is extremely likely that the same Fatal error happens again during FinalizeBlock. For example, if my EL crashes, the engine api is going to keep returning the Fatal connection refused error until it reaches FinalizeBlock, where the CL then crashes.

Definitely can do with more logging and I see your point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants