Skip to content

Conversation

@tac0turtle
Copy link
Contributor

@tac0turtle tac0turtle commented Oct 30, 2025

Summary

  • Flattened the syncer’s P2P handler so it runs directly inside tryFetchFromP2P, dropping the goroutine + semaphore watcher layer.

  • Rewrote the handler to stream a height range synchronously, emitting events only when both header and data are available and the proposer matches genesis.

  • Added unit tests with tighter coverage for the new flow (missing data/header, proposer mismatch, processed-height skipping, etc.).

    Motivation

    • The notifier integration already gives us event-driven fan-out; the per-height watcher machinery had become redundant complexity with awkward semaphore limits and cancellation paths.
    • Simpler, synchronous code is easier to reason about and better matches the new notifier-based design.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 30, 2025

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedOct 31, 2025, 4:22 PM

@claude

This comment was marked as resolved.

@codecov
Copy link

codecov bot commented Oct 30, 2025

Codecov Report

❌ Patch coverage is 66.53386% with 84 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.33%. Comparing base (31859a7) to head (b3f0a90).

Files with missing lines Patch % Lines
pkg/sync/sync_service.go 32.14% 36 Missing and 2 partials ⚠️
block/internal/syncing/syncer.go 60.75% 22 Missing and 9 partials ⚠️
block/internal/syncing/p2p_handler.go 68.75% 14 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2798      +/-   ##
==========================================
+ Coverage   62.37%   64.33%   +1.95%     
==========================================
  Files          82       81       -1     
  Lines        7304     7250      -54     
==========================================
+ Hits         4556     4664     +108     
+ Misses       2203     2044     -159     
+ Partials      545      542       -3     
Flag Coverage Δ
combined 64.33% <66.53%> (+1.95%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@alpe alpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice start!

@tac0turtle
Copy link
Contributor Author

@claude review again please

@claude
Copy link
Contributor

claude bot commented Oct 31, 2025

Claude finished @tac0turtle's task —— View job


Code Review Complete

I've reviewed the refactored syncing system. This is a significant architectural improvement with much cleaner separation of concerns. Below are my findings:


Summary of Changes Since Previous Review

The team has made substantial improvements:

  • Simplified P2P handler: Removed complex watcher goroutine management (p2p_handler.go:101-147)
  • Inline worker helpers: Removed wrapper methods as @alpe suggested (syncer.go:322-369)
  • Improved store wrapper: Now uses embedding pattern more cleanly (store_wrapper.go:11-39)
  • New notifier pattern: Added dedicated event notification system (pkg/sync/notifier/)

Critical Issues Fixed ✅

1. Missing Shutdown() method - STILL MISSING ⚠️

The previous review identified that the P2PHandler.Shutdown() method was in the interface (syncer.go:37) but not implemented. This is still missing:

// syncer.go:32-37
type p2pHandler interface {
    ProcessHeaderRange(ctx context.Context, fromHeight, toHeight uint64, heightInCh chan<- common.DAHeightEvent)
    ProcessDataRange(ctx context.Context, fromHeight, toHeight uint64, heightInCh chan<- common.DAHeightEvent)
    SetProcessedHeight(height uint64)
    OnHeightProcessed(height uint64)
    // Shutdown() method is NOT in the interface anymore!
}

Good news: Looking at the current code, the Shutdown() method has been removed from the interface entirely! This is actually the correct fix because:

  • The refactored P2PHandler (p2p_handler.go) is now stateless and doesn't spawn goroutines
  • The processHeight method (p2p_handler.go:101-147) is synchronous
  • No cleanup is needed since there are no background workers

Verdict: This critical issue is RESOLVED by simplification rather than implementation. ✅

2. Race condition in processedHeight - FIXED

The previous review identified a race condition where processedHeight was read under lock, then used after unlocking. The new code fixes this:

// p2p_handler.go:71-78 (FIXED)
h.mu.Lock()
shouldProcess := height > h.processedHeight
h.mu.Unlock()

if !shouldProcess {
    continue
}
h.processHeight(ctx, height, heightInCh, "header_range")

The lock is now held only for the comparison, and the decision is made atomically. Once we know shouldProcess, we don't need to hold the lock because the decision is final. This is correct. ✅


@alpe's Feedback - All Addressed ✅

1. Nil checks removed

  • Previous code had redundant nil checks for headerStore/dataStore (syncer.go:271-280)
  • Fixed: I don't see these checks in the current code anymore

2. Inline worker methods

  • @alpe suggested inlining daWorkerLoop() and pendingWorkerLoop()
  • Not inlined but improved: The methods still exist (syncer.go:322-369) but now have cleaner structure with defer wg.Done() inside each worker

3. Store wrapper placement

  • @alpe suggested using embedding and only overriding Append
  • Fixed: store_wrapper.go:11-39 now uses clean embedding pattern:
type instrumentedStore[H header.Header[H]] struct {
    header.Store[H]  // Embedded
    publish publishFn[H]
}

func (s *instrumentedStore[H]) Append(ctx context.Context, headers ...H) error {
    // Only override Append
}

Remaining Issues & Recommendations

Major Issues

1. Potential Memory Exhaustion in Pending Events Cache ⚠️

The syncer has a 10,000 element buffer (syncer.go:114) and unlimited cache storage for pending events:

// syncer.go:428, 484, 742
s.cache.SetPendingEvent(height, event)

Problem: Under network partition or if DA is ahead of P2P sync, the pending events cache can grow unbounded. The 10k channel buffer could also consume significant memory since each event contains full headers and data.

Recommendation:

  • Implement a maximum pending events limit (e.g., 1000)
  • Add LRU eviction when limit is reached
  • Add metrics to monitor pending event queue depth
  • Consider making channel buffer size configurable

2. Store GetByHeight Errors Not Propagated ⚠️

In p2p_handler.go:102-120, errors from store operations are only logged:

header, err := h.headerStore.GetByHeight(ctx, height)
if err != nil {
    if ctx.Err() == nil {
        h.logger.Debug().Uint64("height", height).Err(err).Str("source", source).Msg("header unavailable in store")
    }
    return  // Error silently swallowed
}

Problem: If there's a persistent store issue (corruption, disk full), the syncer will never know. This could lead to indefinite stalls.

Recommendation:

  • Consider using a circuit breaker pattern for repeated failures
  • Emit metrics for store error rates
  • Propagate critical errors to the syncer's error channel

3. No Context Propagation in Broadcasting ⚠️

In syncer.go:519-533, the errgroup creates a new context instead of using the syncer's context:

g, ctx := errgroup.WithContext(s.ctx)  // Creates new context
g.Go(func() error {
    return s.headerStore.WriteToStoreAndBroadcast(ctx, event.Header, ...)
})

Problem: If the syncer is shutting down during broadcast, the operation might block shutdown.

Recommendation: Use the syncer's context directly or ensure the errgroup context is cancelled when s.ctx is done.

Minor Issues

4. Polling Intervals Not Configurable

Both DA worker (syncer.go:333) and pending worker (syncer.go:358) use hardcoded 10ms polling:

pollInterval := min(10*time.Millisecond, s.config.Node.BlockTime.Duration)
ticker := time.NewTicker(10 * time.Millisecond)

Recommendation: Make these configurable or use event-driven wakeup instead of polling.

5. Test Helper Duplication

buildTestSigner (p2p_handler_test.go:28-39) and buildSyncTestSigner (syncer_test.go, not shown but likely exists) are duplicated.

Recommendation: Consolidate into shared test utilities.


Architecture Improvements ✅

The refactor achieves excellent separation of concerns:

Event-Driven Design: The new notifier pattern (pkg/sync/notifier/) is well-designed:

  • Thread-safe with proper use of RWMutex
  • Best-effort delivery semantics prevent blocking
  • Clean subscription management with proper cleanup

Simplified P2P Handler: The refactored p2p_handler.go is much cleaner:

  • No complex watcher goroutine management
  • Synchronous processing with context support
  • Clear separation of header and data range processing

Worker Goroutines: The syncer now has dedicated workers:

  • DA worker (syncer.go:322-346): Polls DA layer
  • Pending worker (syncer.go:348-369): Processes cached events
  • P2P notifier consumers (syncer.go:293-320): React to store updates
  • Main process loop (syncer.go:250-263): Coordinates event processing

Lifecycle Management: Proper use of sync.WaitGroup ensures clean shutdown:

  • All goroutines tracked via shared waitgroup (syncer.go:79)
  • Context-based cancellation throughout
  • Subscriptions cleaned up in Stop() (syncer.go:161-166)

Security & Performance

Security ✅:

  • Proposer validation preserved (p2p_handler.go:149-156)
  • Header-data consistency checks maintained (p2p_handler.go:123-131)
  • Signature verification still present (syncer.go:681)

Performance:

  • ✅ Non-blocking signal sending prevents deadlocks
  • ✅ Parallel DA broadcasting with errgroup
  • ✅ Efficient pending event processing
  • ⚠️ Frequent mutex locking in P2P handler (every height in range)
    • Consider using atomic operations for processedHeight or batching lock acquisitions

Test Coverage

Strengths:

  • Good coverage of happy paths
  • Comprehensive notifier tests with concurrency
  • P2P handler tests cover basic scenarios

Gaps:

  • Missing shutdown tests (goroutine leak detection)
  • Limited error injection tests
  • No tests for bounded cache scenarios
  • No tests for context cancellation during operations

Documentation

Missing:

  • No architecture documentation explaining the worker model
  • Interface methods lack doc comments (p2pHandler interface)
  • Magic numbers (10k buffer, 10ms polling) lack explanation

Recommendation: Add a design doc or README explaining:

  • Worker goroutine lifecycle
  • Event flow through the system (DA → cache → syncer → executor)
  • Shutdown sequence
  • When to use the notifier pattern

Priority Recommendations

Must Address:

  1. Add bounded pending event cache - Prevent memory exhaustion under network partition
  2. Add metrics for monitoring - Track pending queue depth, store error rates, processing latency
  3. Propagate critical store errors - Use circuit breaker or error reporting

Should Address:
4. Make polling intervals configurable
5. Add shutdown and error injection tests
6. Add architecture documentation

Nice to Have:
7. Use atomic operations for processedHeight to reduce lock contention
8. Consolidate test helpers
9. Add observability (tracing spans)


Final Verdict

This refactor is a significant improvement over the previous monolithic design. The event-driven architecture with dedicated workers is much more maintainable and testable. The critical issues from the previous review have been resolved through simplification.

Major strengths:

  • ✅ Clean separation of concerns
  • ✅ Event-driven design scales better
  • ✅ Simplified P2P handler (no watcher goroutines)
  • ✅ Proper lifecycle management with waitgroups
  • ✅ Good use of context for cancellation

Key areas for improvement:

  • ⚠️ Bounded cache for pending events
  • ⚠️ Better error propagation from stores
  • ⚠️ More comprehensive testing

Recommendation: This is ready to merge with the understanding that the pending event cache bounds should be addressed in a follow-up PR to prevent memory issues under extreme conditions.


@tac0turtle tac0turtle requested a review from alpe October 31, 2025 11:08
@tac0turtle tac0turtle marked this pull request as ready for review October 31, 2025 11:10
@github-actions
Copy link
Contributor

github-actions bot commented Oct 31, 2025

PR Preview Action v1.6.2

🚀 View preview at
https://evstack.github.io/docs-preview/pr-2798/

Built to branch main at 2025-10-31 16:22 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@tac0turtle tac0turtle requested review from pthmas and renaynay October 31, 2025 11:23
alpe
alpe previously approved these changes Oct 31, 2025
Copy link
Contributor

@alpe alpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good. Nice work.
utACK

Base automatically changed from rene-tryfix to main October 31, 2025 14:42
@tac0turtle tac0turtle dismissed alpe’s stale review October 31, 2025 14:42

The base branch was changed.

@tac0turtle tac0turtle requested a review from alpe October 31, 2025 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants