Skip to content

Conversation

@GheisMohammadi
Copy link
Collaborator

@GheisMohammadi GheisMohammadi commented Sep 9, 2025

This PR introduces a comprehensive error classification system for stream read/write operations, replacing the previous behavior where streams were immediately closed on any error. The new system categorizes errors into recoverable and critical types, allowing streams to continue operating for transient issues while only closing streams for critical errors that require termination.

Problem

Previously, the stream implementation would close streams immediately upon encountering any error during read or write operations. This aggressive approach led to unnecessary stream closures for transient network issues, timeouts, and other recoverable errors, reducing system resilience and efficiency.

Solution

Error Classification System

  • New Error Types: Introduced a comprehensive error classification system with 11 distinct error types:
    • ErrorTypeRemoteDisconnect - Remote peer disconnected
    • ErrorTypeConnectionReset - Connection reset by peer
    • ErrorTypeBrokenPipe - Broken pipe errors
    • ErrorTypeTimeout - Network timeouts
    • ErrorTypeProgressTimeout - Progress timeout (application-level)
    • ErrorTypeReadDeadline / ErrorTypeWriteDeadline - Deadline errors
    • ErrorTypeLocalNetwork - Local network issues
    • ErrorTypeResourceExhaustion - System resource exhaustion
    • ErrorTypeProtocol - Protocol errors
    • ErrorTypeUnknown - Unclassified errors

Key Features

  1. Intelligent Error Classification: Uses type assertions first (for reliability) followed by string matching for application-specific errors
  2. Recoverable vs Critical Errors: Distinguishes between errors that allow stream continuation vs those requiring closure
  3. Retry Logic with Limits: Implements retry limits (MaxRecoverableRetries) to prevent infinite loops while allowing recovery from transient errors
  4. Centralized Decision Making: Introduces ShouldCloseStream() function for consistent error handling across the codebase
  5. Metrics Integration: Adds metrics to track recoverable and critical error counts for monitoring and observability
  6. Comprehensive Testing: Includes 509 lines of unit tests covering edge cases and error classification scenarios

Error Handling Behavior

  • Recoverable Errors (stream continues):

    • Timeouts (network, progress, read/write deadlines)
    • Local network issues
    • These errors are logged but don't trigger stream closure
  • Critical Errors (stream closes):

    • Remote peer disconnects
    • Connection resets
    • Broken pipes
    • Resource exhaustion
    • Protocol errors
    • Unknown errors (defaults to critical)

Changes

  • Added p2p/stream/types/error_handler.go with error classification logic
  • Updated p2p/stream/protocols/sync/sync_stream.go to use new error handling
  • Updated p2p/stream/types/stream.go with improved error handling
  • Added comprehensive unit tests in p2p/stream/types/error_handler_test.go
  • Added metrics in p2p/stream/types/metric.go
  • Added retry limit constant in p2p/stream/protocols/sync/const.go

Testing

  • Added 509 lines of unit tests covering:
    • Nil error handling
    • EOF detection
    • Syscall error classification
    • Network error handling
    • String matching fallbacks
    • Edge cases and error combinations
    • ShouldCloseStream() decision logic

Metrics

New metrics added to track:

  • Recoverable error counts by error type
  • Critical error counts by error type
  • Stream closures due to excessive recoverable errors

Benefits

  1. Improved Resilience: Streams no longer close unnecessarily for transient errors
  2. Better Resource Utilization: Reduces connection churn and reconnection overhead
  3. Enhanced Observability: Metrics provide visibility into error patterns
  4. Consistent Behavior: Centralized error handling ensures consistent decisions across the codebase
  5. Maintainability: Clear error classification makes debugging and monitoring easier

Breaking Changes

None - this is a backward-compatible improvement to error handling.

@GheisMohammadi GheisMohammadi self-assigned this Sep 9, 2025
@GheisMohammadi GheisMohammadi marked this pull request as draft September 9, 2025 11:55
@GheisMohammadi GheisMohammadi force-pushed the improve/stream_error_classification branch 2 times, most recently from 0bb6aaf to b4026c8 Compare September 12, 2025 09:57
@mur-me
Copy link
Collaborator

mur-me commented Sep 23, 2025

I personally like the direction of this PR, but, can we also have this logic as Prometheus metrics? 🙏

@GheisMohammadi GheisMohammadi force-pushed the improve/stream_error_classification branch from b4026c8 to b9c5100 Compare November 18, 2025 22:49
@GheisMohammadi GheisMohammadi force-pushed the improve/stream_error_classification branch from b9c5100 to e6d8776 Compare November 19, 2025 22:20
@GheisMohammadi GheisMohammadi marked this pull request as ready for review November 27, 2025 09:50
@GheisMohammadi GheisMohammadi merged commit 25e7674 into dev Nov 27, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants