Integrating failsafe-go into Lava: Architecture Analysis

Executive Summary

Overall difficulty: HIGH. The lava repo already has a mature, battle-tested resilience stack deeply woven into its session management, relay state machines, and provider lifecycle. failsafe-go would provide cleaner abstractions and composability, but the current patterns carry significant domain-specific logic (blockchain reporting, epoch management, provider scoring, archive upgrades, cross-validation consensus) that failsafe-go cannot express natively. The integration is feasible but would be a multi-sprint refactoring effort with substantial regression risk.

Recommendation: Incremental, bottom-up adoption — start with the simplest, most isolated patterns (Timeout, WebSocket backoff) and work upward. Do not attempt a big-bang replacement.

Pattern-by-Pattern Breakdown

1. Retry Logic → `retrypolicy.RetryPolicy`

Current: ConsumerRelayStateMachine / SmartRouterRelayStateMachine, RelayRetriesManager, RelayProcessor.shouldRetryRelay()

Difficulty: HIGH

Aspect	Detail
What it replaces	The ticker-based retry loops in both state machines, the `RelayRetryLimit` constant, `RelayRetryBackoffDuration`, and the retry decision logic in `RelayProcessor.shouldRetryRelay()`
What it does NOT replace	`RelayRetriesManager` (hash-based 6h deduplication cache), provider selection/rotation via `UsedProviders`, archive extension upgrade on retry #1, epoch mismatch special handling (always retry), Solana/unsupported-method abort conditions
How	`retrypolicy.NewBuilder[RelayResult]().WithMaxRetries(2).WithDelay(2time.Millisecond).AbortIf(isNonRetryableError).HandleIf(isRetryableError).OnRetry(updateProviderSelection).Build()`
Why it's hard	The retry loop is not a simple "call → fail → retry same call." Each retry iteration involves: (a) selecting a different provider via `UsedProviders`, (b) potentially upgrading to archive extension on retry #1, (c) checking the hash dedup cache, (d) respecting different rules per selection mode (Stateless=retry, Stateful/CrossValidation=no retry). These are side-effectful state transitions between retries, not just backoff-and-replay. failsafe-go's `OnRetry` hook can handle some of this, but the provider-rotation-as-retry-mechanism is a fundamental architectural mismatch.

2. Circuit Breaker → `circuitbreaker.CircuitBreaker`

Current: Session blocking (BlockListed), provider blocking (_blockProvider), consecutivePairingErrors (SmartRouter), MaximumNumberOfFailuresAllowedPerConsumerSession

Difficulty: HIGH

Aspect	Detail
What it replaces	The consecutive error tracking and blocking decision matrix in `OnSessionFailure`, the `consecutivePairingErrors` circuit breaker in SmartRouter, session `BlockListed` flag
What it does NOT replace	Blockchain provider reporting (side effect of blocking), epoch-aware provider carry-over, `retrySecondChanceAfter` (3 min grace period), per-endpoint blocklisting
How	One `CircuitBreaker` per provider: `circuitbreaker.NewBuilder[RelayResult]().WithFailureThreshold(MaxFailures).WithDelay(3time.Minute).OnOpen(reportToBlockchain).Build()`
Why it's hard	The current circuit breaking is multi-level (session → endpoint → provider → epoch) and tightly coupled to `ConsumerSessionManager`'s internal maps. A failsafe-go circuit breaker is a standalone object per "resource" — you'd need to manage a map of `CircuitBreaker` instances keyed by provider address, and synchronize their state with the existing session lifecycle. The "second chance" mechanism maps conceptually to half-open state, but the recovery probe logic (checking if provider is back) is custom. The provider blocking also triggers blockchain reports, which is a side effect that goes beyond what a circuit breaker normally does.

3. Timeout → `timeout.Timeout`

Current: common/timeout.go (GetTimePerCu, LocalNodeTimePerCu), context.WithTimeout calls, three-level timeout hierarchy

Difficulty: LOW-MEDIUM

Aspect	Detail
What it replaces	The `context.WithTimeout(processingCtx, processingTimeout)` calls in both servers, the per-relay timeout logic in state machines
What it does NOT replace	The adaptive timeout calculation (`GetTimePerCu` based on CU, hanging API flag, stateful flag), subscription first-reply timeout (10s)
How	`timeout.New[*RelayResult](calculatedTimeout)` composed inside a retry policy: `failsafe.With(retryPolicy, timeoutPolicy).Get(sendRelay)` — this gives per-attempt timeouts with retry on timeout
Why it's approachable	The timeout calculation is cleanly separated from its enforcement. You keep `GetTimePerCu()` to compute the duration, then wrap it in a failsafe-go `Timeout` instead of raw `context.WithTimeout`. The main benefit: composing timeout inside retry gives you per-attempt timeouts automatically, which the current code achieves manually.

4. WebSocket Reconnection Backoff → `retrypolicy.RetryPolicy`

Current: websocket_backoff.go (SmartRouter), ExponentialBackoff struct

Difficulty: LOW

Aspect	Detail
What it replaces	The entire `ExponentialBackoff` struct and `NextBackoff()`/`Reset()`/`Clone()` methods, reconnection retry loops in `upstream_ws_pool.go`
How	`retrypolicy.NewBuilder[ws.Conn]().WithBackoff(100time.Millisecond, 30*time.Second).WithJitterFactor(0.3).WithMaxRetries(10).Build()` — drop-in replacement
Why it's easy	This is the cleanest integration point. The WebSocket backoff is self-contained, well-isolated, and maps 1:1 to failsafe-go's retry with exponential backoff. The subscription restoration after reconnect can go in an `OnSuccess` listener.

5. Provider Failover/Selection → No direct failsafe-go equivalent

Current: ProviderOptimizer, ConsumerSessionManager.GetSessions(), UsedProviders, QoS scoring

Difficulty: N/A (not replaceable)

Aspect	Detail
Why	failsafe-go has no concept of "choose a different backend on each attempt." Provider selection is domain logic (QoS scoring, stake weighting, T-Digest percentile normalization, strategy-based selection). This must remain as-is. failsafe-go's retry can trigger re-selection, but the selection logic itself stays.

6. Hedging/Parallel Sends → `hedgepolicy.HedgePolicy`

Current: Stateful mode (send to all), CrossValidation (send to N, check consensus)

Difficulty: MEDIUM

Aspect	Detail
What it replaces	The parallel goroutine fan-out in Stateful mode that sends to all top providers and returns the first result
What it does NOT replace	CrossValidation mode — failsafe-go's hedge cancels on first success, but cross-validation needs agreement threshold (quorum). This is fundamentally different from hedging.
How	For Stateful: `hedgepolicy.NewBuilderWithDelay[*RelayResult](0).WithMaxHedges(numProviders-1).Build()` — immediate parallel execution, first success wins
Why it's medium	Stateful mode maps well to hedging. CrossValidation does not — it requires consensus logic that failsafe-go doesn't support. You'd keep CrossValidation as custom code.

7. Rate Limiting → `ratelimiter.RateLimiter`

Current: WebsocketConnectionLimiter, per-client subscription limits, ClientRateLimiter

Difficulty: LOW-MEDIUM

Aspect	Detail
What it replaces	The `WebSocketRateLimit` enforcement, subscription-per-minute limits
What it does NOT replace	Per-IP connection tracking (stateful, needs IP-keyed map), ban duration logic
How	`ratelimiter.NewSmooth[any](maxRequests, time.Second)` per client or per IP
Why it's medium	The current rate limiting is per-IP with banning, which requires maintaining state keyed by IP. You'd need a map of `RateLimiter` instances per IP, plus custom ban logic on top.

8. Bulkhead/Concurrency → `bulkhead.Bulkhead`

Current: MaxCallsPerRelay = 50, max-concurrent-providers flag

Difficulty: LOW

Aspect	Detail
What it replaces	The implicit concurrency limits on parallel provider calls
How	`bulkhead.New[*RelayResult](maxConcurrentProviders)`
Why it's easy	Simple concurrency cap, maps directly.

9. Fallback → `fallback.Fallback`

Current: Cache lookup before relay, backup provider tier, archive extension upgrade

Difficulty: LOW-MEDIUM

Aspect	Detail
What it replaces	The cache-hit-returns-early pattern, backup provider fallback
How	`fallback.NewWithFunc(func(exec failsafe.Execution[RelayResult]) (RelayResult, error) { return cache.Lookup(key) })`
Why	Cache fallback is clean. Backup provider tier is more complex (requires session manager interaction).

Difficulty Summary

Pattern	failsafe-go Policy	Difficulty	Risk
WebSocket backoff	RetryPolicy	LOW	Low
Timeout enforcement	Timeout	LOW-MEDIUM	Low
Bulkhead	Bulkhead	LOW	Low
Cache fallback	Fallback	LOW-MEDIUM	Low
Rate limiting	RateLimiter	LOW-MEDIUM	Medium
Stateful hedging	HedgePolicy	MEDIUM	Medium
Relay retry logic	RetryPolicy	HIGH	High
Circuit breaker	CircuitBreaker	HIGH	High
Provider selection	N/A	N/A	—
CrossValidation	N/A	N/A	—

Recommended Adoption Strategy

Phase 1 (Low risk, high value)

Replace ExponentialBackoff in SmartRouter WebSocket with failsafe-go RetryPolicy
Replace raw context.WithTimeout with failsafe-go Timeout (composability benefit)
Add Bulkhead for concurrent provider calls

Phase 2 (Medium risk)

Introduce failsafe-go Fallback for cache-miss-to-relay pattern
Replace Stateful mode parallel fan-out with HedgePolicy
Add RateLimiter per-client (keep per-IP custom logic)

Phase 3 (High risk, requires careful refactoring)

Refactor relay state machines to use failsafe-go RetryPolicy with OnRetry hooks for provider rotation
Introduce per-provider CircuitBreaker instances managed by ConsumerSessionManager
Compose full policy stacks: Fallback → Retry → CircuitBreaker → Timeout

Phase 3 is where the real complexity lives — the relay state machines (consumer_relay_state_machine.go ~400 lines, smartrouter_relay_state_machine.go ~650 lines) are the heart of the resilience logic and carry significant domain-specific state transitions that will need careful decomposition.

Key Risks

Regression risk: Both components are critical path for all RPC traffic. Any behavioral change in retry/timeout/failover semantics could cause outages.
Semantic mismatch: failsafe-go assumes "retry = call the same function again." Lava's retry means "call a different provider with possibly different parameters" (archive upgrade, extension changes). This impedance mismatch requires wrapping the provider selection inside the retried function.
Testing gap: The current resilience behavior is likely validated by integration tests and production experience, not unit tests of the resilience logic itself. Replacing it requires building a comprehensive test harness first.
Two systems running in parallel: During migration, you'll have some patterns using failsafe-go and others using the legacy approach, increasing cognitive load.

Key Files Reference

rpcconsumer

File	Purpose
`protocol/rpcconsumer/rpcconsumer_server.go`	HTTP/WebSocket handling, relay processing
`protocol/rpcconsumer/consumer_relay_state_machine.go`	Retry orchestration, selection mode logic
`protocol/lavasession/consumer_session_manager.go`	Provider pairing, session allocation, blocking
`protocol/lavasession/used_providers.go`	Provider rotation tracking
`protocol/provideroptimizer/provider_optimizer.go`	QoS scoring, weighted provider selection
`protocol/relaycore/relay_processor.go`	Response collection, consensus, error aggregation
`protocol/relaycore/relay_state.go`	Archive extension detection, retry hash caching
`protocol/common/timeout.go`	Adaptive timeout calculation
`protocol/lavaprotocol/relay_retries_manager.go`	Hash-based retry deduplication (6h TTL)

rpcsmartrouter

File	Purpose
`protocol/rpcsmartrouter/rpcsmartrouter_server.go`	Request handling, relay sending, health checks
`protocol/rpcsmartrouter/smartrouter_relay_state_machine.go`	Retry logic, circuit breaker, state transitions
`protocol/rpcsmartrouter/direct_ws_subscription_manager.go`	WebSocket subscriptions, rate limiting
`protocol/rpcsmartrouter/upstream_ws_pool.go`	Connection pooling, auto-scaling
`protocol/rpcsmartrouter/websocket_backoff.go`	Exponential backoff for WS reconnection
`protocol/rpcsmartrouter/error_mapper.go`	Error classification

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrating failsafe-go into Lava: Architecture Analysis

Executive Summary

Pattern-by-Pattern Breakdown

1. Retry Logic → `retrypolicy.RetryPolicy`

2. Circuit Breaker → `circuitbreaker.CircuitBreaker`

3. Timeout → `timeout.Timeout`

4. WebSocket Reconnection Backoff → `retrypolicy.RetryPolicy`

5. Provider Failover/Selection → No direct failsafe-go equivalent

6. Hedging/Parallel Sends → `hedgepolicy.HedgePolicy`

7. Rate Limiting → `ratelimiter.RateLimiter`

8. Bulkhead/Concurrency → `bulkhead.Bulkhead`

9. Fallback → `fallback.Fallback`

Difficulty Summary

Recommended Adoption Strategy

Phase 1 (Low risk, high value)

Phase 2 (Medium risk)

Phase 3 (High risk, requires careful refactoring)

Key Risks

Key Files Reference

rpcconsumer

rpcsmartrouter

FilesExpand file tree

failsafe-go-integration-analysis.md

Latest commit

History

failsafe-go-integration-analysis.md

File metadata and controls

Integrating failsafe-go into Lava: Architecture Analysis

Executive Summary

Pattern-by-Pattern Breakdown

1. Retry Logic → retrypolicy.RetryPolicy

2. Circuit Breaker → circuitbreaker.CircuitBreaker

3. Timeout → timeout.Timeout

4. WebSocket Reconnection Backoff → retrypolicy.RetryPolicy

5. Provider Failover/Selection → No direct failsafe-go equivalent

6. Hedging/Parallel Sends → hedgepolicy.HedgePolicy

7. Rate Limiting → ratelimiter.RateLimiter

8. Bulkhead/Concurrency → bulkhead.Bulkhead

9. Fallback → fallback.Fallback

Difficulty Summary

Recommended Adoption Strategy

Phase 1 (Low risk, high value)

Phase 2 (Medium risk)

Phase 3 (High risk, requires careful refactoring)

Key Risks

Key Files Reference

rpcconsumer

rpcsmartrouter

1. Retry Logic → `retrypolicy.RetryPolicy`

2. Circuit Breaker → `circuitbreaker.CircuitBreaker`

3. Timeout → `timeout.Timeout`

4. WebSocket Reconnection Backoff → `retrypolicy.RetryPolicy`

6. Hedging/Parallel Sends → `hedgepolicy.HedgePolicy`

7. Rate Limiting → `ratelimiter.RateLimiter`

8. Bulkhead/Concurrency → `bulkhead.Bulkhead`

9. Fallback → `fallback.Fallback`