fix(sync): overlap window on reconnect to prevent dropped blocks#97
fix(sync): overlap window on reconnect to prevent dropped blocks#97
Conversation
When the cardano-node kills the connection (ExceededTimeLimit), blocks in the gouroboros muxer buffer are lost. Previously, reconnect resumed from GetLastSyncedSlot exactly, leaving gaps if later blocks were already flushed past the lost ones. Now getIntersectPoints backs up 100 blocks (~2000 slots) on reconnect. The node re-delivers blocks from the overlap point, filling any gaps. ON CONFLICT DO NOTHING handles duplicates in the DB, and InsertBlockBatch now returns the inserted count so flushBlockBatch can skip duplicate blocks during nonce evolution.
📝 WalkthroughWalkthroughThis PR updates Changes
Sequence Diagram(s)sequenceDiagram
participant Sync as Sync Engine
participant DB as InsertBlockBatch
participant Batcher as Block Batcher
participant Nonce as Nonce Evolve
Sync->>DB: Fetch last 100 blocks (overlap window)
DB-->>Sync: Resume point from oldest block
Sync->>Batcher: flushBlockBatch(batch)
Batcher->>DB: InsertBlockBatch(blocks)
DB-->>Batcher: (inserted count, error)
alt All blocks new (inserted == len(batch))
Batcher->>Nonce: Evolve nonce for entire batch
else Some blocks duplicate (inserted < len(batch))
Batcher->>Batcher: Slice batch to new blocks only
Batcher->>Nonce: Evolve nonce for new blocks
end
Nonce-->>Batcher: Nonce updated
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
sync.go (1)
140-162: LGTM — Overlap-based resume correctly prevents dropped blocks.The 100-block overlap window addresses the root cause well: blocks buffered in the gouroboros muxer but not dispatched before connection death are now re-delivered on reconnect. The
ON CONFLICT DO NOTHINGinInsertBlockBatchhandles the resulting duplicates gracefully.Consider extracting the overlap window size to a named constant for clarity and easier tuning:
💡 Suggested constant extraction
+// overlapWindowBlocks is the number of blocks to back up on reconnect +// to recover any blocks lost in the muxer buffer (~2000 slots at ~20 slots/block). +const overlapWindowBlocks = 100 + func (s *ChainSyncer) getIntersectPoints(ctx context.Context) ([]pcommon.Point, error) { // ... - blocks, err := s.store.GetLastNBlocks(ctx, 100) + blocks, err := s.store.GetLastNBlocks(ctx, overlapWindowBlocks)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sync.go` around lines 140 - 162, The hard-coded overlap window "100" in getIntersectPoints should be extracted to a named constant (e.g., blockOverlapWindow or overlapBlocks) so the value is clear and easier to tune; update the call to s.store.GetLastNBlocks(ctx, 100) to use the constant, update the log.Printf that mentions the overlap to reference the constant (or its value) and adjust the surrounding comment accordingly; touch getIntersectPoints, the GetLastNBlocks invocation, and the log.Printf message to use the new constant.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@sync.go`:
- Around line 140-162: The hard-coded overlap window "100" in getIntersectPoints
should be extracted to a named constant (e.g., blockOverlapWindow or
overlapBlocks) so the value is clear and easier to tune; update the call to
s.store.GetLastNBlocks(ctx, 100) to use the constant, update the log.Printf that
mentions the overlap to reference the constant (or its value) and adjust the
surrounding comment accordingly; touch getIntersectPoints, the GetLastNBlocks
invocation, and the log.Printf message to use the new constant.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 44813f85-673b-4137-94c7-2d57cc003d8e
📒 Files selected for processing (5)
comprehensive_test.godb.gomain.gostore.gosync.go
Summary
InsertBlockBatchnow returns inserted count to detect overlap duplicatesflushBlockBatchslices overlap duplicates out before nonce evolutionRoot cause: 4 blocks dropped during epoch 407 sync stall in test instance. The gouroboros muxer had blocks in its internal buffer when the connection died — they were never dispatched to the callback, and the reconnect resumed past them.
Test plan
go vetcleanSummary by CodeRabbit
Bug Fixes
New Features