Overall difficulty: HIGH. The lava repo already has a mature, battle-tested resilience stack deeply woven into its session management, relay state machines, and provider lifecycle. failsafe-go would provide cleaner abstractions and composability, but the current patterns carry significant domain-specific logic (blockchain reporting, epoch management, provider scoring, archive upgrades, cross-validation consensus) that failsafe-go cannot express natively. The integration is feasible but would be a multi-sprint refactoring effort with substantial regression risk.
Recommendation: Incremental, bottom-up adoption — start with the simplest, most isolated patterns (Timeout, WebSocket backoff) and work upward. Do not attempt a big-bang replacement.
Current: ConsumerRelayStateMachine / SmartRouterRelayStateMachine, RelayRetriesManager, RelayProcessor.shouldRetryRelay()
Difficulty: HIGH
| Aspect | Detail |
|---|---|
| What it replaces | The ticker-based retry loops in both state machines, the RelayRetryLimit constant, RelayRetryBackoffDuration, and the retry decision logic in RelayProcessor.shouldRetryRelay() |
| What it does NOT replace | RelayRetriesManager (hash-based 6h deduplication cache), provider selection/rotation via UsedProviders, archive extension upgrade on retry #1, epoch mismatch special handling (always retry), Solana/unsupported-method abort conditions |
| How | retrypolicy.NewBuilder[*RelayResult]().WithMaxRetries(2).WithDelay(2*time.Millisecond).AbortIf(isNonRetryableError).HandleIf(isRetryableError).OnRetry(updateProviderSelection).Build() |
| Why it's hard | The retry loop is not a simple "call → fail → retry same call." Each retry iteration involves: (a) selecting a different provider via UsedProviders, (b) potentially upgrading to archive extension on retry #1, (c) checking the hash dedup cache, (d) respecting different rules per selection mode (Stateless=retry, Stateful/CrossValidation=no retry). These are side-effectful state transitions between retries, not just backoff-and-replay. failsafe-go's OnRetry hook can handle some of this, but the provider-rotation-as-retry-mechanism is a fundamental architectural mismatch. |
Current: Session blocking (BlockListed), provider blocking (_blockProvider), consecutivePairingErrors (SmartRouter), MaximumNumberOfFailuresAllowedPerConsumerSession
Difficulty: HIGH
| Aspect | Detail |
|---|---|
| What it replaces | The consecutive error tracking and blocking decision matrix in OnSessionFailure, the consecutivePairingErrors circuit breaker in SmartRouter, session BlockListed flag |
| What it does NOT replace | Blockchain provider reporting (side effect of blocking), epoch-aware provider carry-over, retrySecondChanceAfter (3 min grace period), per-endpoint blocklisting |
| How | One CircuitBreaker per provider: circuitbreaker.NewBuilder[*RelayResult]().WithFailureThreshold(MaxFailures).WithDelay(3*time.Minute).OnOpen(reportToBlockchain).Build() |
| Why it's hard | The current circuit breaking is multi-level (session → endpoint → provider → epoch) and tightly coupled to ConsumerSessionManager's internal maps. A failsafe-go circuit breaker is a standalone object per "resource" — you'd need to manage a map of CircuitBreaker instances keyed by provider address, and synchronize their state with the existing session lifecycle. The "second chance" mechanism maps conceptually to half-open state, but the recovery probe logic (checking if provider is back) is custom. The provider blocking also triggers blockchain reports, which is a side effect that goes beyond what a circuit breaker normally does. |
Current: common/timeout.go (GetTimePerCu, LocalNodeTimePerCu), context.WithTimeout calls, three-level timeout hierarchy
Difficulty: LOW-MEDIUM
| Aspect | Detail |
|---|---|
| What it replaces | The context.WithTimeout(processingCtx, processingTimeout) calls in both servers, the per-relay timeout logic in state machines |
| What it does NOT replace | The adaptive timeout calculation (GetTimePerCu based on CU, hanging API flag, stateful flag), subscription first-reply timeout (10s) |
| How | timeout.New[*RelayResult](calculatedTimeout) composed inside a retry policy: failsafe.With(retryPolicy, timeoutPolicy).Get(sendRelay) — this gives per-attempt timeouts with retry on timeout |
| Why it's approachable | The timeout calculation is cleanly separated from its enforcement. You keep GetTimePerCu() to compute the duration, then wrap it in a failsafe-go Timeout instead of raw context.WithTimeout. The main benefit: composing timeout inside retry gives you per-attempt timeouts automatically, which the current code achieves manually. |
Current: websocket_backoff.go (SmartRouter), ExponentialBackoff struct
Difficulty: LOW
| Aspect | Detail |
|---|---|
| What it replaces | The entire ExponentialBackoff struct and NextBackoff()/Reset()/Clone() methods, reconnection retry loops in upstream_ws_pool.go |
| How | retrypolicy.NewBuilder[*ws.Conn]().WithBackoff(100*time.Millisecond, 30*time.Second).WithJitterFactor(0.3).WithMaxRetries(10).Build() — drop-in replacement |
| Why it's easy | This is the cleanest integration point. The WebSocket backoff is self-contained, well-isolated, and maps 1:1 to failsafe-go's retry with exponential backoff. The subscription restoration after reconnect can go in an OnSuccess listener. |
Current: ProviderOptimizer, ConsumerSessionManager.GetSessions(), UsedProviders, QoS scoring
Difficulty: N/A (not replaceable)
| Aspect | Detail |
|---|---|
| Why | failsafe-go has no concept of "choose a different backend on each attempt." Provider selection is domain logic (QoS scoring, stake weighting, T-Digest percentile normalization, strategy-based selection). This must remain as-is. failsafe-go's retry can trigger re-selection, but the selection logic itself stays. |
Current: Stateful mode (send to all), CrossValidation (send to N, check consensus)
Difficulty: MEDIUM
| Aspect | Detail |
|---|---|
| What it replaces | The parallel goroutine fan-out in Stateful mode that sends to all top providers and returns the first result |
| What it does NOT replace | CrossValidation mode — failsafe-go's hedge cancels on first success, but cross-validation needs agreement threshold (quorum). This is fundamentally different from hedging. |
| How | For Stateful: hedgepolicy.NewBuilderWithDelay[*RelayResult](0).WithMaxHedges(numProviders-1).Build() — immediate parallel execution, first success wins |
| Why it's medium | Stateful mode maps well to hedging. CrossValidation does not — it requires consensus logic that failsafe-go doesn't support. You'd keep CrossValidation as custom code. |
Current: WebsocketConnectionLimiter, per-client subscription limits, ClientRateLimiter
Difficulty: LOW-MEDIUM
| Aspect | Detail |
|---|---|
| What it replaces | The WebSocketRateLimit enforcement, subscription-per-minute limits |
| What it does NOT replace | Per-IP connection tracking (stateful, needs IP-keyed map), ban duration logic |
| How | ratelimiter.NewSmooth[any](maxRequests, time.Second) per client or per IP |
| Why it's medium | The current rate limiting is per-IP with banning, which requires maintaining state keyed by IP. You'd need a map of RateLimiter instances per IP, plus custom ban logic on top. |
Current: MaxCallsPerRelay = 50, max-concurrent-providers flag
Difficulty: LOW
| Aspect | Detail |
|---|---|
| What it replaces | The implicit concurrency limits on parallel provider calls |
| How | bulkhead.New[*RelayResult](maxConcurrentProviders) |
| Why it's easy | Simple concurrency cap, maps directly. |
Current: Cache lookup before relay, backup provider tier, archive extension upgrade
Difficulty: LOW-MEDIUM
| Aspect | Detail |
|---|---|
| What it replaces | The cache-hit-returns-early pattern, backup provider fallback |
| How | fallback.NewWithFunc(func(exec failsafe.Execution[*RelayResult]) (*RelayResult, error) { return cache.Lookup(key) }) |
| Why | Cache fallback is clean. Backup provider tier is more complex (requires session manager interaction). |
| Pattern | failsafe-go Policy | Difficulty | Risk |
|---|---|---|---|
| WebSocket backoff | RetryPolicy | LOW | Low |
| Timeout enforcement | Timeout | LOW-MEDIUM | Low |
| Bulkhead | Bulkhead | LOW | Low |
| Cache fallback | Fallback | LOW-MEDIUM | Low |
| Rate limiting | RateLimiter | LOW-MEDIUM | Medium |
| Stateful hedging | HedgePolicy | MEDIUM | Medium |
| Relay retry logic | RetryPolicy | HIGH | High |
| Circuit breaker | CircuitBreaker | HIGH | High |
| Provider selection | N/A | N/A | — |
| CrossValidation | N/A | N/A | — |
- Replace
ExponentialBackoffin SmartRouter WebSocket with failsafe-goRetryPolicy - Replace raw
context.WithTimeoutwith failsafe-goTimeout(composability benefit) - Add
Bulkheadfor concurrent provider calls
- Introduce failsafe-go
Fallbackfor cache-miss-to-relay pattern - Replace Stateful mode parallel fan-out with
HedgePolicy - Add
RateLimiterper-client (keep per-IP custom logic)
- Refactor relay state machines to use failsafe-go
RetryPolicywithOnRetryhooks for provider rotation - Introduce per-provider
CircuitBreakerinstances managed byConsumerSessionManager - Compose full policy stacks:
Fallback → Retry → CircuitBreaker → Timeout
Phase 3 is where the real complexity lives — the relay state machines (consumer_relay_state_machine.go ~400 lines, smartrouter_relay_state_machine.go ~650 lines) are the heart of the resilience logic and carry significant domain-specific state transitions that will need careful decomposition.
- Regression risk: Both components are critical path for all RPC traffic. Any behavioral change in retry/timeout/failover semantics could cause outages.
- Semantic mismatch: failsafe-go assumes "retry = call the same function again." Lava's retry means "call a different provider with possibly different parameters" (archive upgrade, extension changes). This impedance mismatch requires wrapping the provider selection inside the retried function.
- Testing gap: The current resilience behavior is likely validated by integration tests and production experience, not unit tests of the resilience logic itself. Replacing it requires building a comprehensive test harness first.
- Two systems running in parallel: During migration, you'll have some patterns using failsafe-go and others using the legacy approach, increasing cognitive load.
| File | Purpose |
|---|---|
protocol/rpcconsumer/rpcconsumer_server.go |
HTTP/WebSocket handling, relay processing |
protocol/rpcconsumer/consumer_relay_state_machine.go |
Retry orchestration, selection mode logic |
protocol/lavasession/consumer_session_manager.go |
Provider pairing, session allocation, blocking |
protocol/lavasession/used_providers.go |
Provider rotation tracking |
protocol/provideroptimizer/provider_optimizer.go |
QoS scoring, weighted provider selection |
protocol/relaycore/relay_processor.go |
Response collection, consensus, error aggregation |
protocol/relaycore/relay_state.go |
Archive extension detection, retry hash caching |
protocol/common/timeout.go |
Adaptive timeout calculation |
protocol/lavaprotocol/relay_retries_manager.go |
Hash-based retry deduplication (6h TTL) |
| File | Purpose |
|---|---|
protocol/rpcsmartrouter/rpcsmartrouter_server.go |
Request handling, relay sending, health checks |
protocol/rpcsmartrouter/smartrouter_relay_state_machine.go |
Retry logic, circuit breaker, state transitions |
protocol/rpcsmartrouter/direct_ws_subscription_manager.go |
WebSocket subscriptions, rate limiting |
protocol/rpcsmartrouter/upstream_ws_pool.go |
Connection pooling, auto-scaling |
protocol/rpcsmartrouter/websocket_backoff.go |
Exponential backoff for WS reconnection |
protocol/rpcsmartrouter/error_mapper.go |
Error classification |