fix(sync): increase keepalive timeout and resync nonce on retry#66
fix(sync): increase keepalive timeout and resync nonce on retry#66
Conversation
The default gouroboros keepalive (60s period / 10s pong timeout) causes spurious disconnects when the node is busy validating blocks. On retry, buffered blocks from the dead connection overlap with blocks from the new connection, corrupting the NonceTracker's evolving nonce state. - Bump keepalive to 120s period / 30s timeout via WithKeepAliveConfig - Add NonceTracker.ResyncFromDB() to reload state from database - Drain channel buffer and resync nonce between retry attempts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR enhances synchronization robustness by introducing a nonce state resynchronization mechanism after failed historical syncs and adding keepalive configuration for network connections. The changes ensure in-memory state aligns with persisted database state across sync retries. Changes
Sequence Diagram(s)sequenceDiagram
participant HSH as Historical Sync Handler
participant BC as Block Channel
participant W as Writer
participant NT as NonceTracker
participant DB as Database
HSH->>BC: Drain remaining blocks after sync fails
HSH->>W: Wait for writer flush completion
W-->>HSH: Flush complete
HSH->>NT: ResyncFromDB()
NT->>NT: Acquire lock
NT->>DB: Query last synced slot
DB-->>NT: Last slot
NT->>DB: Fetch evolving nonce & block count for epoch
DB-->>NT: Nonce state
NT->>NT: Restore evolvingNonce, currentEpoch, blockCount
NT->>NT: Release lock
NT-->>HSH: Resync complete
HSH->>HSH: Proceed with backoff wait
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly Related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The default gouroboros keepalive (60s period / 10s pong timeout) causes spurious disconnects when the node is busy validating blocks. On retry, buffered blocks from the dead connection overlap with blocks from the new connection, corrupting the NonceTracker's evolving nonce state. - Bump keepalive to 120s period / 30s timeout via WithKeepAliveConfig - Add NonceTracker.ResyncFromDB() to reload state from database - Drain channel buffer and resync nonce between retry attempts Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Summary
NonceTracker.ResyncFromDB()to reload evolving nonce state from DB between retry attemptsProblem
Default keepalive pong timeout (10s) fires when cardano-node is busy validating blocks. Each retry creates a new ChainSyncer but shares the same buffered channel. Blocks from the dead connection overlap with blocks from the new connection, and
ProcessBatchdoesn't dedup — corrupting the NonceTracker state. This caused 143 nonce mismatches on a clean sync (vs 4 on a sync with fewer timeouts).Test plan
go build ./...)🤖 Generated with Claude Code
Summary by CodeRabbit