fix(sync): increase keepalive timeout and resync nonce on retry by wcatz · Pull Request #66 · wcatz/goduckbot

wcatz · 2026-02-11T19:38:30Z

Summary

Bumps NtN keepalive from 60s/10s to 120s/30s to reduce spurious disconnects during historical sync
Adds NonceTracker.ResyncFromDB() to reload evolving nonce state from DB between retry attempts
Drains the 10,000-block channel buffer before retry to prevent old/new block overlap corrupting nonce evolution

Problem

Default keepalive pong timeout (10s) fires when cardano-node is busy validating blocks. Each retry creates a new ChainSyncer but shares the same buffered channel. Blocks from the dead connection overlap with blocks from the new connection, and ProcessBatch doesn't dedup — corrupting the NonceTracker state. This caused 143 nonce mismatches on a clean sync (vs 4 on a sync with fewer timeouts).

Test plan

Build passes (go build ./...)
Deploy, wipe DB, fresh sync from genesis
Verify zero (or near-zero) backfill mismatches against Koios
Confirm keepalive timeouts are reduced in logs

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Strengthened synchronization retry robustness by improving state consistency between in-memory data and persistent storage, preventing divergence during recovery operations and node restarts.
- Enhanced connection resilience with refined keepalive configuration, reducing spurious disconnects and improving stability during intensive node operations and network stress conditions.

The default gouroboros keepalive (60s period / 10s pong timeout) causes spurious disconnects when the node is busy validating blocks. On retry, buffered blocks from the dead connection overlap with blocks from the new connection, corrupting the NonceTracker's evolving nonce state. - Bump keepalive to 120s period / 30s timeout via WithKeepAliveConfig - Add NonceTracker.ResyncFromDB() to reload state from database - Drain channel buffer and resync nonce between retry attempts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-02-11T19:38:46Z

📝 Walkthrough

Walkthrough

This PR enhances synchronization robustness by introducing a nonce state resynchronization mechanism after failed historical syncs and adding keepalive configuration for network connections. The changes ensure in-memory state aligns with persisted database state across sync retries.

Changes

Cohort / File(s)	Summary
Historical Sync Safeguard `main.go`, `nonce.go`	Introduces synchronization recovery in failed sync retries: drains block channel, flushes writer, and resynchronizes in-memory nonce state from database via new `ResyncFromDB()` method. Prevents state divergence when sync restarts.
Connection Keepalive Configuration `sync.go`	Adds keepalive configuration (120-second period, 30-second timeout) to NtN Ouroboros connection to reduce spurious disconnects during high node load. Imports `protocol/keepalive` package.

Sequence Diagram(s)

sequenceDiagram
    participant HSH as Historical Sync Handler
    participant BC as Block Channel
    participant W as Writer
    participant NT as NonceTracker
    participant DB as Database

    HSH->>BC: Drain remaining blocks after sync fails
    HSH->>W: Wait for writer flush completion
    W-->>HSH: Flush complete
    HSH->>NT: ResyncFromDB()
    NT->>NT: Acquire lock
    NT->>DB: Query last synced slot
    DB-->>NT: Last slot
    NT->>DB: Fetch evolving nonce & block count for epoch
    DB-->>NT: Nonce state
    NT->>NT: Restore evolvingNonce, currentEpoch, blockCount
    NT->>NT: Release lock
    NT-->>HSH: Resync complete
    HSH->>HSH: Proceed with backoff wait

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related PRs

wcatz/goduckbot#21: Modifies historical sync/retry behavior and nonce state resynchronization with block channel draining and keepalive configuration.
wcatz/goduckbot#58: Related through nonce computation/database persistence and block batch insertion, complementing the ResyncFromDB nonce-sync mechanism.

Poem

🐰 A sync that stumbles now stands tall,
With nonces synced from database's call,
The keeper wakes from troubled sleep,
State aligned, connections deep,
Resilience blooms in every hop! 🌿✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: increasing keepalive timeout and resyncing nonce on retry, which directly corresponds to the changeset modifications in sync.go and nonce.go.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/keepalive-resync

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments

main.go (1)

564-575: The 3-second sleep is a best-effort heuristic — consider a writer-flush signal for stronger guarantees.

The drain loop (len(blockCh) > 0) correctly waits for the writer to dequeue all buffered blocks, and the 3s sleep accounts for the in-flight batch. However, if a flushBlockBatch call takes longer than 3s (e.g., slow DB under load), ResyncFromDB could read stale nonce state.

A more robust alternative would be to add a flush-and-ack mechanism (e.g., send a sentinel value or use a separate sync.WaitGroup per flush cycle) so the retry path can deterministically wait for the writer to finish. That said, the current approach is a significant improvement over the previous behavior and is reasonable for now.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The default gouroboros keepalive (60s period / 10s pong timeout) causes spurious disconnects when the node is busy validating blocks. On retry, buffered blocks from the dead connection overlap with blocks from the new connection, corrupting the NonceTracker's evolving nonce state. - Bump keepalive to 120s period / 30s timeout via WithKeepAliveConfig - Add NonceTracker.ResyncFromDB() to reload state from database - Drain channel buffer and resync nonce between retry attempts Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

wcatz merged commit eb531c1 into master Feb 11, 2026
2 checks passed

wcatz deleted the fix/keepalive-resync branch February 11, 2026 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sync): increase keepalive timeout and resync nonce on retry#66

fix(sync): increase keepalive timeout and resync nonce on retry#66
wcatz merged 1 commit intomasterfrom
fix/keepalive-resync

wcatz commented Feb 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 11, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Possibly Related PRs

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wcatz commented Feb 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Possibly Related PRs

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wcatz commented Feb 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 11, 2026 •

edited

Loading