Skip to content

feat(signer): add health monitoring and automatic failover for signer pool#399

Open
raushan728 wants to merge 3 commits intosolana-foundation:mainfrom
raushan728:feat/signer-health-monitoring
Open

feat(signer): add health monitoring and automatic failover for signer pool#399
raushan728 wants to merge 3 commits intosolana-foundation:mainfrom
raushan728:feat/signer-health-monitoring

Conversation

@raushan728
Copy link
Copy Markdown
Contributor

@raushan728 raushan728 commented Mar 23, 2026

Remote signers communicate over HTTP and can fail intermittently. Previously,
SignerPool had no mechanism to detect degraded signers — it would keep routing
requests to a failing signer indefinitely.

This PR adds per-signer health tracking with automatic failover:

  • After 3 consecutive failures, a signer is marked unhealthy and excluded from selection
  • Unhealthy signers get a recovery probe chance after 30 seconds
  • Pinned signers (get_signer_by_pubkey) now also respect the recovery probe (fail only if cooldown not elapsed)
  • Health is automatically restored on successful signing (or after successful probe)

Open with Devin

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 23, 2026

Greptile Summary

This PR adds per-signer health tracking to SignerPool so that remote signers experiencing repeated failures are automatically excluded from selection. After 3 consecutive signing failures a signer's is_healthy flag is cleared; health is restored on the next successful sign. The changes touch three files: pool.rs (the core logic), versioned_transaction.rs (the telemetry hook), and error.rs (a new error variant).

Key observations:

  • Health bypass via get_signer_by_pubkey — the client-pinned signer path (signer_key parameter) calls get_signer_by_pubkey, which never checks is_healthy. An unhealthy signer can still be returned and used on this path, defeating the feature's goal for pinned clients.
  • Race condition in record_failure — the health state update is a non-atomic two-step (fetch_add then a separate store). A concurrent record_success can write is_healthy = true in between, only to be overwritten by the trailing store(false), incorrectly leaving a recovered signer marked unhealthy.
  • KoraError::NoHealthySigners is dead code — the variant is added in error.rs but never returned; healthy_signers() silently falls back to the full pool instead of surfacing it.
  • No time-based recovery — once unhealthy a signer can only recover through the all-unhealthy fallback path; if one signer stays healthy, unhealthy ones are permanently shadowed with no probing mechanism.

Confidence Score: 3/5

  • Two concrete bugs remain: unhealthy signers are still reachable via the pubkey-pinned path, and a TOCTOU race can incorrectly re-mark a recovered signer as unhealthy.
  • The primary selection paths (round-robin, random, weighted) are correctly gated behind healthy_signers() and the tests cover those cases well. However, get_signer_by_pubkey — the code path for client-pinned signer consistency — never checks is_healthy, meaning a signer that has been excluded for 3+ failures can still be returned on that path. Combined with the TOCTOU race in record_failure (the separate fetch_add + is_healthy.store can be interleaved with a concurrent record_success), there are two P1 correctness issues that need targeted fixes before the health-monitoring guarantee can be considered reliable.
  • Pay close attention to crates/lib/src/signer/pool.rs — specifically get_signer_by_pubkey (missing health check) and record_failure (non-atomic state transition).

Important Files Changed

Filename Overview
crates/lib/src/signer/pool.rs Core of the PR — adds consecutive_failures/is_healthy atomics to SignerWithMetadata, health-aware selection via healthy_signers(), and record_signing_success/failure public API. Two bugs: get_signer_by_pubkey bypasses health filtering (P1), and a TOCTOU race in record_failure between the fetch_add and is_healthy.store (P1). The total_weight field removal and refactoring to *_from selection helpers are clean.
crates/lib/src/error.rs Adds NoHealthySigners(String) error variant, but this variant is never returned anywhere in the PR — healthy_signers() falls back to the full pool rather than returning this error. Dead code.
crates/lib/src/transaction/versioned_transaction.rs Wires up health telemetry: wraps sign_message in a match and calls pool.record_signing_success/failure via the global state accessor. Logic is correct and non-intrusive; the if let Ok guards ensure signing itself still works even if the pool is not initialised.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant versioned_transaction
    participant SignerPool
    participant SignerWithMetadata
    participant Signer

    Caller->>versioned_transaction: sign_transaction(signer)
    versioned_transaction->>Signer: sign_message(message_bytes)

    alt Signing succeeds
        Signer-->>versioned_transaction: Ok(signature)
        versioned_transaction->>SignerPool: record_signing_success(signer)
        SignerPool->>SignerWithMetadata: record_success()
        Note over SignerWithMetadata: consecutive_failures = 0\nis_healthy = true
        versioned_transaction-->>Caller: Ok(transaction, encoded)
    else Signing fails
        Signer-->>versioned_transaction: Err(e)
        versioned_transaction->>SignerPool: record_signing_failure(signer)
        SignerPool->>SignerWithMetadata: record_failure()
        Note over SignerWithMetadata: consecutive_failures += 1\nif >= 3: is_healthy = false
        versioned_transaction-->>Caller: Err(SigningError)
    end

    Caller->>SignerPool: get_next_signer()
    SignerPool->>SignerPool: healthy_signers()
    Note over SignerPool: Filters out is_healthy=false\nFallback to full pool if all unhealthy
    SignerPool-->>Caller: Arc<Signer>
Loading

Comments Outside Diff (1)

  1. crates/lib/src/signer/pool.rs, line 286-300 (link)

    P1 get_signer_by_pubkey bypasses health filtering

    get_signer_by_pubkey is called from get_request_signer_with_signer_key when a client pins to a specific signer via a signer_key. That path entirely skips the healthy_signers() filter, so a signer that has been marked unhealthy (3+ consecutive failures) will still be returned and used, directly defeating the health-monitoring goal of this PR for the client-consistency code path.

    After resolving target_pubkey and finding signer_meta, add a health check before returning — e.g. return Err(KoraError::NoHealthySigners(...)) if !signer_meta.is_healthy(). This also gives the newly-added NoHealthySigners variant its first real use site.

Reviews (1): Last reviewed commit: "feat(signer): add health monitoring and ..." | Re-trigger Greptile

greptile-apps[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

devin-ai-integration[bot]

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant