Skip to content

fix(sync): increase keepalive timeout and resync nonce on retry#66

Merged
wcatz merged 1 commit intomasterfrom
fix/keepalive-resync
Feb 11, 2026
Merged

fix(sync): increase keepalive timeout and resync nonce on retry#66
wcatz merged 1 commit intomasterfrom
fix/keepalive-resync

Conversation

@wcatz
Copy link
Copy Markdown
Owner

@wcatz wcatz commented Feb 11, 2026

Summary

  • Bumps NtN keepalive from 60s/10s to 120s/30s to reduce spurious disconnects during historical sync
  • Adds NonceTracker.ResyncFromDB() to reload evolving nonce state from DB between retry attempts
  • Drains the 10,000-block channel buffer before retry to prevent old/new block overlap corrupting nonce evolution

Problem

Default keepalive pong timeout (10s) fires when cardano-node is busy validating blocks. Each retry creates a new ChainSyncer but shares the same buffered channel. Blocks from the dead connection overlap with blocks from the new connection, and ProcessBatch doesn't dedup — corrupting the NonceTracker state. This caused 143 nonce mismatches on a clean sync (vs 4 on a sync with fewer timeouts).

Test plan

  • Build passes (go build ./...)
  • Deploy, wipe DB, fresh sync from genesis
  • Verify zero (or near-zero) backfill mismatches against Koios
  • Confirm keepalive timeouts are reduced in logs

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Strengthened synchronization retry robustness by improving state consistency between in-memory data and persistent storage, preventing divergence during recovery operations and node restarts.
    • Enhanced connection resilience with refined keepalive configuration, reducing spurious disconnects and improving stability during intensive node operations and network stress conditions.

The default gouroboros keepalive (60s period / 10s pong timeout) causes
spurious disconnects when the node is busy validating blocks. On retry,
buffered blocks from the dead connection overlap with blocks from the
new connection, corrupting the NonceTracker's evolving nonce state.

- Bump keepalive to 120s period / 30s timeout via WithKeepAliveConfig
- Add NonceTracker.ResyncFromDB() to reload state from database
- Drain channel buffer and resync nonce between retry attempts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 11, 2026

📝 Walkthrough

Walkthrough

This PR enhances synchronization robustness by introducing a nonce state resynchronization mechanism after failed historical syncs and adding keepalive configuration for network connections. The changes ensure in-memory state aligns with persisted database state across sync retries.

Changes

Cohort / File(s) Summary
Historical Sync Safeguard
main.go, nonce.go
Introduces synchronization recovery in failed sync retries: drains block channel, flushes writer, and resynchronizes in-memory nonce state from database via new ResyncFromDB() method. Prevents state divergence when sync restarts.
Connection Keepalive Configuration
sync.go
Adds keepalive configuration (120-second period, 30-second timeout) to NtN Ouroboros connection to reduce spurious disconnects during high node load. Imports protocol/keepalive package.

Sequence Diagram(s)

sequenceDiagram
    participant HSH as Historical Sync Handler
    participant BC as Block Channel
    participant W as Writer
    participant NT as NonceTracker
    participant DB as Database

    HSH->>BC: Drain remaining blocks after sync fails
    HSH->>W: Wait for writer flush completion
    W-->>HSH: Flush complete
    HSH->>NT: ResyncFromDB()
    NT->>NT: Acquire lock
    NT->>DB: Query last synced slot
    DB-->>NT: Last slot
    NT->>DB: Fetch evolving nonce & block count for epoch
    DB-->>NT: Nonce state
    NT->>NT: Restore evolvingNonce, currentEpoch, blockCount
    NT->>NT: Release lock
    NT-->>HSH: Resync complete
    HSH->>HSH: Proceed with backoff wait
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related PRs

  • wcatz/goduckbot#21: Modifies historical sync/retry behavior and nonce state resynchronization with block channel draining and keepalive configuration.
  • wcatz/goduckbot#58: Related through nonce computation/database persistence and block batch insertion, complementing the ResyncFromDB nonce-sync mechanism.

Poem

🐰 A sync that stumbles now stands tall,
With nonces synced from database's call,
The keeper wakes from troubled sleep,
State aligned, connections deep,
Resilience blooms in every hop! 🌿✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: increasing keepalive timeout and resyncing nonce on retry, which directly corresponds to the changeset modifications in sync.go and nonce.go.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/keepalive-resync

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
main.go (1)

564-575: The 3-second sleep is a best-effort heuristic — consider a writer-flush signal for stronger guarantees.

The drain loop (len(blockCh) > 0) correctly waits for the writer to dequeue all buffered blocks, and the 3s sleep accounts for the in-flight batch. However, if a flushBlockBatch call takes longer than 3s (e.g., slow DB under load), ResyncFromDB could read stale nonce state.

A more robust alternative would be to add a flush-and-ack mechanism (e.g., send a sentinel value or use a separate sync.WaitGroup per flush cycle) so the retry path can deterministically wait for the writer to finish. That said, the current approach is a significant improvement over the previous behavior and is reasonable for now.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@wcatz wcatz merged commit eb531c1 into master Feb 11, 2026
2 checks passed
@wcatz wcatz deleted the fix/keepalive-resync branch February 11, 2026 20:01
wcatz added a commit that referenced this pull request Feb 18, 2026
The default gouroboros keepalive (60s period / 10s pong timeout) causes
spurious disconnects when the node is busy validating blocks. On retry,
buffered blocks from the dead connection overlap with blocks from the
new connection, corrupting the NonceTracker's evolving nonce state.

- Bump keepalive to 120s period / 30s timeout via WithKeepAliveConfig
- Add NonceTracker.ResyncFromDB() to reload state from database
- Drain channel buffer and resync nonce between retry attempts

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant