Skip to content

fix(sync): overlap window on reconnect to prevent dropped blocks#97

Merged
wcatz merged 1 commit intomasterfrom
fix/sync-overlap-reconnect
Mar 9, 2026
Merged

fix(sync): overlap window on reconnect to prevent dropped blocks#97
wcatz merged 1 commit intomasterfrom
fix/sync-overlap-reconnect

Conversation

@wcatz
Copy link
Copy Markdown
Owner

@wcatz wcatz commented Mar 9, 2026

Summary

  • Back up 100 blocks (~2000 slots) on reconnect instead of resuming from exact last slot
  • InsertBlockBatch now returns inserted count to detect overlap duplicates
  • flushBlockBatch slices overlap duplicates out before nonce evolution

Root cause: 4 blocks dropped during epoch 407 sync stall in test instance. The gouroboros muxer had blocks in its internal buffer when the connection died — they were never dispatched to the callback, and the reconnect resumed past them.

Test plan

  • go vet clean
  • All 63 tests pass
  • Deploy to test, verify clean historical sync with no block count gaps

Summary by CodeRabbit

  • Bug Fixes

    • Improved batch insert operations to detect and handle duplicate blocks more efficiently
    • Enhanced sync resumption logic for more reliable recovery during interrupted syncs
  • New Features

    • Bulk insert operations now report the count of successfully inserted blocks, enabling better visibility into batch processing results

When the cardano-node kills the connection (ExceededTimeLimit), blocks
in the gouroboros muxer buffer are lost. Previously, reconnect resumed
from GetLastSyncedSlot exactly, leaving gaps if later blocks were
already flushed past the lost ones.

Now getIntersectPoints backs up 100 blocks (~2000 slots) on reconnect.
The node re-delivers blocks from the overlap point, filling any gaps.
ON CONFLICT DO NOTHING handles duplicates in the DB, and
InsertBlockBatch now returns the inserted count so flushBlockBatch
can skip duplicate blocks during nonce evolution.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 9, 2026

📝 Walkthrough

Walkthrough

This PR updates InsertBlockBatch to return the count of inserted rows alongside error handling, enabling duplicate detection and selective processing. The sync logic switches from querying exact last-synced slots to an overlap-based resume strategy using the last 100 blocks, with nonce evolution adjusted based on insertion counts.

Changes

Cohort / File(s) Summary
Return Type Signature Updates
store.go, db.go
Updated InsertBlockBatch interface and PostgreSQL implementation to return (int, error) instead of error, enabling callers to determine inserted row counts for duplicate handling.
Duplicate Handling Logic
main.go
Modified flushBlockBatch to capture inserted row counts and conditionally process overlapping blocks; nonce evolution now depends on whether all blocks are new or if duplicates were detected.
Overlap-Based Sync Strategy
sync.go
Replaced single-slot resume logic with overlap strategy: fetches last 100 blocks, uses oldest as intersection point, logs overlap size, and handles decoding failures with Shelley genesis fallback.
Test Pattern Alignment
comprehensive_test.go
Updated TestStoreBlockBatch calls to explicitly discard return values using if _, err := ... syntax, matching updated function signature.

Sequence Diagram(s)

sequenceDiagram
    participant Sync as Sync Engine
    participant DB as InsertBlockBatch
    participant Batcher as Block Batcher
    participant Nonce as Nonce Evolve

    Sync->>DB: Fetch last 100 blocks (overlap window)
    DB-->>Sync: Resume point from oldest block
    Sync->>Batcher: flushBlockBatch(batch)
    Batcher->>DB: InsertBlockBatch(blocks)
    DB-->>Batcher: (inserted count, error)
    
    alt All blocks new (inserted == len(batch))
        Batcher->>Nonce: Evolve nonce for entire batch
    else Some blocks duplicate (inserted < len(batch))
        Batcher->>Batcher: Slice batch to new blocks only
        Batcher->>Nonce: Evolve nonce for new blocks
    end
    
    Nonce-->>Batcher: Nonce updated
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Poem

🐰 Hops with glee through overlapping blocks,
No more fretting o'er duplicate shocks!
A hundred-block window, so clever and wide,
Nonce evolution skips what's already tried,
Staging tables sync with PostgreSQL's grace,
While SQLite keeps pace in this duplicates race!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: implementing an overlap window strategy on reconnect to prevent dropped blocks, which aligns with the primary objective of backing up 100 blocks to avoid gaps from buffered blocks lost on connection failure.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/sync-overlap-reconnect

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
sync.go (1)

140-162: LGTM — Overlap-based resume correctly prevents dropped blocks.

The 100-block overlap window addresses the root cause well: blocks buffered in the gouroboros muxer but not dispatched before connection death are now re-delivered on reconnect. The ON CONFLICT DO NOTHING in InsertBlockBatch handles the resulting duplicates gracefully.

Consider extracting the overlap window size to a named constant for clarity and easier tuning:

💡 Suggested constant extraction
+// overlapWindowBlocks is the number of blocks to back up on reconnect
+// to recover any blocks lost in the muxer buffer (~2000 slots at ~20 slots/block).
+const overlapWindowBlocks = 100
+
 func (s *ChainSyncer) getIntersectPoints(ctx context.Context) ([]pcommon.Point, error) {
     // ...
-    blocks, err := s.store.GetLastNBlocks(ctx, 100)
+    blocks, err := s.store.GetLastNBlocks(ctx, overlapWindowBlocks)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sync.go` around lines 140 - 162, The hard-coded overlap window "100" in
getIntersectPoints should be extracted to a named constant (e.g.,
blockOverlapWindow or overlapBlocks) so the value is clear and easier to tune;
update the call to s.store.GetLastNBlocks(ctx, 100) to use the constant, update
the log.Printf that mentions the overlap to reference the constant (or its
value) and adjust the surrounding comment accordingly; touch getIntersectPoints,
the GetLastNBlocks invocation, and the log.Printf message to use the new
constant.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@sync.go`:
- Around line 140-162: The hard-coded overlap window "100" in getIntersectPoints
should be extracted to a named constant (e.g., blockOverlapWindow or
overlapBlocks) so the value is clear and easier to tune; update the call to
s.store.GetLastNBlocks(ctx, 100) to use the constant, update the log.Printf that
mentions the overlap to reference the constant (or its value) and adjust the
surrounding comment accordingly; touch getIntersectPoints, the GetLastNBlocks
invocation, and the log.Printf message to use the new constant.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 44813f85-673b-4137-94c7-2d57cc003d8e

📥 Commits

Reviewing files that changed from the base of the PR and between 8d5c3f5 and 322be97.

📒 Files selected for processing (5)
  • comprehensive_test.go
  • db.go
  • main.go
  • store.go
  • sync.go

@wcatz wcatz merged commit 51e6f93 into master Mar 9, 2026
5 checks passed
@wcatz wcatz deleted the fix/sync-overlap-reconnect branch March 9, 2026 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant