Skip to content

Cleaner node shutdown logic #584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: optimism
Choose a base branch
from

Conversation

eugene-uniswap
Copy link

@eugene-uniswap eugene-uniswap commented Apr 16, 2025

Implement functionality to

  • fail /readyz health check when node shutdown is initiated.
  • allow 7 second for existing requests to drain before shutting down RPC servers and before database is closed.

Tests

Testing was done on a local WS by launching op-geth, hitting /readyz on RPC port with fortio, interrupting op-geth with Ctrl-C and watching for readyz probe to fail while op-geth is waiting for 7 seconds before shutting down the server.

fortio load -c 1 -qps 1 -allow-initial-errors -n 10000 http://127.0.0.1:8551/readyz

17:50:26.384 r1 [WRN] http_client.go:1217> Content-length missing, header_len=64, thread=0, run=0
17:50:27.384 r1 [WRN] http_client.go:1217> Content-length missing, header_len=64, thread=0, run=0
17:50:28.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:29.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:30.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:31.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:32.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:33.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:34.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:35.383 r1 [ERR] http_client.go:954> Unable to connect, dest={"IP":"127.0.0.1","Port":8551,"Zone":""}, err="dial tcp 127.0.0.1:8551: connect: connection refused", numfd=6, thread=0, run=0
17:50:36.385 r1 [ERR] http_client.go:954> Unable to connect, dest={"IP":"127.0.0.1","Port":8551,"Zone":""}, err="dial tcp 127.0.0.1:8551: connect: connection refused", numfd=6, thread=0, run=0

op-geth output:

^CINFO [04-22|17:50:28.036] Got interrupt, shutting down...
INFO [04-22|17:50:35.037] HTTP server stopped                      endpoint=127.0.0.1:8551
INFO [04-22|17:50:35.038] IPC endpoint closed                      url=/Users/eugene.aleynikov/Library/Ethereum/geth.ipc
INFO [04-22|17:50:35.155] Ethereum protocol stopped
INFO [04-22|17:50:35.155] Transaction pool stopped
INFO [04-22|17:50:35.204] Persisting dirty state to disk           root=d7f897..0f0544 layers=0
INFO [04-22|17:50:35.205] Persisted dirty state to disk            size=74.00B elapsed="264.834µs"
INFO [04-22|17:50:35.236] Blockchain stopped

Additional context

  • This PR is needed for the work to add backend healthcheck into proxyd.
  • When launching op-geth via Docker Compose, the default stop_grace_period is 10 seconds. This should be increased to at least 15 seconds—or ideally 30 seconds—to align with the default values used by Kubernetes and ECS. Otherwise, the container risks being forcefully terminated with a SIGKILL before the drain logic has a chance to complete, which will cause unclean shutdown and possible on-disk data corruption.

Metadata

@eugene-uniswap eugene-uniswap requested a review from a team as a code owner April 16, 2025 01:54
@eugene-uniswap eugene-uniswap marked this pull request as draft April 16, 2025 01:54
Copy link

@paranoiacblack paranoiacblack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good but I think we should avoid adding channels if we don't have to, it adds an extra layer of management to a codebase littered with many haphazard channels. And, as you know, poorly managed channels lead to deadlocks.

Copy link

@paranoiacblack paranoiacblack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, overall, you should go ahead and move it to review by the optimism team to see what they think makes sense.

@eugene-uniswap eugene-uniswap marked this pull request as ready for review April 22, 2025 03:23
@sebastianst sebastianst self-requested a review May 6, 2025 18:30
@sebastianst sebastianst self-assigned this May 6, 2025
@sebastianst sebastianst moved this to In Review in Protocol Team May 6, 2025
Copy link
Member

@sebastianst sebastianst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a general improvement of go-ethereum and not related to the OP Stack diff. Would be great if you could open this PR upstream and it will then be available in op-geth soon after. If there's resistance from upstream maintainers, we can take a closer look. Feel also free to refer back to this PR.

@eugene-uniswap
Copy link
Author

This seems like a general improvement of go-ethereum and not related to the OP Stack diff. Would be great if you could open this PR upstream and it will then be available in op-geth soon after. If there's resistance from upstream maintainers, we can take a closer look. Feel also free to refer back to this PR.

ethereum/go-ethereum#31866

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

3 participants