A gossip-based high availability (HA) manager for Solana validators. This tool helps automate unexpected failovers due to <insert one of endless reasons>. To automate planned failovers, see solana-validator-failover
validator-1 (active) loses network connectivity. A passive peer detects the leaderless cluster and takes over automatically.
Passive node (validator-2) monitoring the cluster and promoting itself to active:
Active node (validator-1) detecting it has dropped from gossip and stepping down:
solana-validator-ha provides a simple, low-dependency HA solution for running 2 or more Solana validators together, where one is active (voting) and the rest are passive (non-voting). All peers share the same active keypair identity and each has its own unique passive keypair identity.
Each peer runs solana-validator-ha independently. It monitors the Solana gossip network to detect whether any peer is currently active and voting. When no active peer has been seen for a configurable number of consecutive samples (the leaderless threshold), a failover is triggered. Each peer makes this decision independently using the same gossip data, with a rank-based delay to prevent multiple peers from racing to become active simultaneously.
A node will only become active in a failover if:
- It appears in gossip (the validator process is running and reachable on the network);
- Its local RPC reports healthy; and
- It has been continuously healthy for at least
failover.self_healthy.minimum_duration(guards against startup health flaps).
To make this work, two (
Called when this node should assume the active role. See example-scripts/ha-set-role.sh for inspiration.
failover:
active:
command: "set-identity-with-rollback.sh"
args: [
"--active-identity-file", "{{ .ActiveIdentityKeypairFile }}",
"--passive-identity-file", "{{ .PassiveIdentityKeypairFile }}",
]Called when this node should assume a passive (non-voting) role — a.k.a Seppuku. This command must be idempotent: it may be called any time the node detects it should not be active (e.g. when it drops out of gossip). The safest pattern is to configure validators to always start with a passive identity, so this command can simply restart the validator service and wait for it to come back passive. See example-scripts/ha-set-role.sh for inspiration.
failover:
passive:
# ⚠️ This must make absolutely sure the validator goes passive.
# ⚠️ If set-identity fails, restart/stop the service, pull the plug,
# ⚠️ or call your mum crying for help. Do WHATEVER is necessary.
command: "seppuku.sh"
args: [
"--passive-identity-file", "{{ .PassiveIdentityKeypairFile }}",
]Note:
post-passivehooks only run if the passive command succeeds, as a safeguard against false positives.
- 🔍 Intelligent Peer Detection: Automatically detects validator roles based on network gossip and RPC identity
- 🛡️ Startup Health Protection: Requires a configurable minimum continuous healthy streak before a node can become a failover candidate
- 🪝 Hooks: Pre/Post failover hook support for role transitions
- 📊 Prometheus Metrics: Rich metrics collection for monitoring and alerting
- 🏁 First-Responder Failover: Race-based failover with IP-rank delay so the fastest eligible passive validator assumes the active role
Download and install the latest release binary for your system.
-
Clone the repository:
git clone https://github.com/sol-strategies/solana-validator-ha.git cd solana-validator-ha -
Build the application:
make build # or manually: go build -o bin/solana-validator-ha ./cmd/solana-validator-ha -
Copy the binary to where you need it:
cp ./bin/solana-validator-ha /usr/local/bin/solana-validator-ha
The application uses a YAML configuration file with the following root sections:
log:
# required: false | default: info
# Minimum log level. One of: debug, info, warn, error, fatal
level: info
# required: false | default: text
# Log format. One of: text, logfmt, json
format: textvalidator:
# required: true
# Vanity name for this validator — used in logging and metrics
name: "primary-validator"
# required: false | default: http://localhost:8899
# Local RPC URL for querying health and identity status
rpc_url: "http://localhost:8899"
# required: false | default: see internal/config/validator.go
# List of URLs used to determine this node's public IPv4 address.
# Each URL should return the IP as a plain string on the first line of the response.
public_ip_service_urls: []
identities:
# required: true (or set active_pubkey)
# Path to the active keypair file — shared across all HA peers.
# Takes precedence over active_pubkey if both are set.
active: "/path/to/active-identity.json"
# required: true (or set active)
# Base58-encoded active pubkey. Used when the keypair file is not available on this node.
active_pubkey: 111111ActivePubkey1111111111111111111111111
# required: true (or set passive_pubkey)
# Path to the passive keypair file — unique per peer.
# Takes precedence over passive_pubkey if both are set.
passive: "/path/to/passive-identity.json"
# required: true (or set passive)
# Base58-encoded passive pubkey. Used when the keypair file is not available on this node.
passive_pubkey: 111111PassivePubkey1111111111111111111111111prometheus:
# required: false | default: 9090
# Port to serve Prometheus metrics on /metrics
port: 9090
# required: false | default: 9091
# Port to serve the health check endpoint on /health
health_check_port: 9091
# required: false
# Static key:value labels attached to all exposed metrics
static_labels:
brand: ha-validators
cluster: mainnet-beta
region: ha-region-1cluster:
# required: true
# Solana cluster this validator is running on. One of: mainnet-beta, devnet, testnet
name: "mainnet-beta"
# required: false | default: cluster default RPC URL for cluster.name
# RPC URLs used to query gossip state. Supplying multiple URLs provides resilience
# against individual RPC drop-outs. URLs that return 403/429/503 are automatically
# deprioritised for 60 s before being retried.
# The local validator RPC URL may be included here (logs a warning) and acts as a
# rate-limit-immune fallback. For sub-second peer detection add it alongside at least
# one remote URL and configure HA peers as mutual --entrypoint flags in agave.
# See "Using Local RPC as a Failsafe Fallback" below for details.
rpc_urls: []Public RPC endpoints can rate-limit or block requests (403 Forbidden, 429 Too Many Requests), which prevents getClusterNodes from returning gossip data and causes log noise like:
ERRO [gossip_state]: failed to get cluster nodes
error= method call failed on all RPC endpoints method: GetClusterNodes,
attempted_urls: [https://api.mainnet-beta.solana.com],
errors: [403 "Access forbidden"]
Adding the local validator RPC alongside your remote URLs provides a fallback that is always available and never rate-limited:
cluster:
name: "mainnet-beta"
rpc_urls:
- "https://api.mainnet-beta.solana.com" # remote — may rate-limit
- "http://localhost:8899" # local — immune to rate limitsA warning is logged at startup to remind you of the trade-off:
WARN config: cluster.rpc_urls contains the local validator RPC URL (http://localhost:8899)
— ensure HA peers are configured as mutual --entrypoint flags for direct gossip,
otherwise peer state may be stale
When each HA peer is listed as an --entrypoint in the other peers' agave config, they exchange gossip directly over UDP (CRDS). The local RPC then has sub-second fresh data for every peer — better than remote RPCs that see gossip after network propagation.
# On validator-1 (185.26.11.91)
--entrypoint validator-2.example.com:8001
# On validator-2
--entrypoint validator-1.example.com:8001
Without mutual --entrypoint flags, the local validator learns about peers through the wider gossip network. Peer data may be 30–60 s stale. This is still better than a complete getClusterNodes failure: the system keeps working with slightly older data rather than losing all visibility. The multi-URL retry logic means the remote URLs are always tried first; the local RPC is only reached if all remote calls fail.
When any URL (local or remote) returns a 403, 429, or 503 response, solana-validator-ha automatically deprioritises that URL for 60 s and logs a warning:
WARN [rpc_client]: RPC endpoint rate-limited or access forbidden, cooling down
method=GetClusterNodes url=https://api.mainnet-beta.solana.com cooldown=1m0s
During the cooldown the URL is still retried as a last resort, but healthy URLs are always attempted first.
See example-scripts/ha-set-role.sh for an example failover script.
failover:
# required: false | default: false
# When true, log failover commands instead of running them — useful for testing config.
dry_run: false
# required: false | default: 5s
# How often to refresh gossip state and evaluate failover decisions.
poll_interval_duration: 5s
# required: false | default: 3 (~15s at the default 5s poll interval)
# How many consecutive gossip samples without an active, non-delinquent voting peer
# before the cluster is considered leaderless and a failover is triggered.
leaderless_samples_threshold: 3
# Overrides the number of slots a peer must be behind the tip to be considered delinquent.
# ⚠️ Set with caution — too low a value causes false positives on transient hiccups.
# The Agave default is 128 slots (~51s), defined as DELINQUENT_VALIDATOR_SLOT_DISTANCE:
# https://github.com/anza-xyz/agave/blob/master/rpc-client-types/src/request.rs
# Since Agave v2.0, --health-check-slot-distance also defaults to 128 via the same constant:
# https://github.com/anza-xyz/agave/blob/master/validator/src/commands/run/args/json_rpc_config.rs
# Both thresholds agree — there is no gap between delinquency detection and health status.
# If you set value below 128, add --health-check-slot-distance <value> to your validator
# startup flags to keep the thresholds aligned — a startup warning will remind you if not.
# Values <= 1 are clamped to 2 on startup.
delinquent_slot_distance_override:
# required: false | default: false
enabled: false
# required: false | default: 128
# Slots behind the tip at which a peer is considered delinquent (when enabled: true).
value: 128
# Guards against startup health flaps: a validator that briefly reports healthy during
# startup before falling behind and going unhealthy again.
self_healthy:
# required: false | default: 30s
# How long the local validator RPC must continuously report healthy before this
# node is eligible to become active in a failover.
minimum_duration: 30s
# required: false | default: 2s
# How often to sample local RPC health. Runs independently of poll_interval_duration
# so the healthy streak timer is not skewed by gossip refresh latency.
poll_interval_duration: 2s
# required: true | min: 1
# Map of HA peers, excluding this node — it is added automatically at startup.
# Keys are vanity names used in logging and metrics. IPs must be valid, unique IPv4 addresses.
peers:
backup-validator-1:
ip: 192.168.1.11
backup-validator-2:
ip: 192.168.1.12
# required: true
# Commands and hooks to run when this node should become active.
# command, args, and env values support Go template strings:
# {{ .ActiveIdentityKeypairFile }} — absolute path to validator.identities.active
# {{ .PassiveIdentityKeypairFile }} — absolute path to validator.identities.passive
# {{ .ActiveIdentityPubkey }} — active pubkey string
# {{ .PassiveIdentityPubkey }} — passive pubkey string
# {{ .SelfName }} — value of validator.name
active:
# required: true
command: set-identity-with-rollback.sh
# required: false
env:
CUSTOM_ENV_VAR: "{{ .ActiveIdentityPubkey }}"
# required: false
args: [
"active",
"--active-identity-file", "{{ .ActiveIdentityKeypairFile }}",
"--passive-identity-file", "{{ .PassiveIdentityKeypairFile }}",
]
# required: false
# Hooks run in declaration order. A pre-hook with must_succeed: true aborts
# subsequent hooks and skips the active command if it fails.
hooks:
pre:
- name: notify-slack-promoting
command: /home/solana/solana-validator-ha/hooks/pre-active/send-slack-alert.sh
must_succeed: false
env: {}
args: [
"--channel", "#save-my-bacon",
"--message", "solana-validator-ha promoting {{ .SelfName }} to active ({{ .PassiveIdentityPubkey }} -> {{ .ActiveIdentityPubkey }})"
]
post:
- name: notify-slack-promoted
command: /home/solana/solana-validator-ha/hooks/post-active/send-slack-alert.sh
env: {}
args: [
"--channel", "#saved-my-bacon",
"--message", "solana-validator-ha promoted {{ .SelfName }} to active with identity {{ .ActiveIdentityPubkey }}"
]
# required: true
# Commands and hooks to run when this node should become passive.
# Supports the same template variables as active above.
# This command must be idempotent — it may be called multiple times in succession.
# post-passive hooks only run if the passive command succeeds.
passive:
# required: true
command: seppuku.sh
# required: false
args: [
"--passive-identity-file", "{{ .PassiveIdentityKeypairFile }}",
"--stop-service-on-identity-set-failure",
"--wait-for-and-force-identity-on-service-starting-up",
]
# required: false
hooks:
pre:
- name: notify-slack-demoting
command: /home/solana/solana-validator-ha/hooks/pre-passive/send-slack-alert.sh
must_succeed: false
args: [
"--channel", "#oh-shit-wake-people-up",
"--message", "solana-validator-ha demoting {{ .SelfName }} to passive ({{ .ActiveIdentityPubkey }} -> {{ .PassiveIdentityPubkey }})"
]
post:
- name: notify-slack-demoted
command: /home/solana/solana-validator-ha/hooks/post-passive/send-slack-alert.sh
args: [
"--channel", "#postmortem-shelf",
"--message", "solana-validator-ha demoted {{ .SelfName }} to passive with identity {{ .PassiveIdentityPubkey }}"
]make dev
make testThe application exposes Prometheus metrics on the configured port (default: 9090) and a health check endpoint on a separate configurable port (default: 9091):
solana_validator_ha_metadata: Validator metadata with role and status labelssolana_validator_ha_peer_count: Number of peers visible in gossipsolana_validator_ha_self_in_gossip: Whether this validator appears in gossip (1=yes, 0=no)solana_validator_ha_failover_status: Current failover status
validator_name: Configured validator namepublic_ip: Validator's public IP addressvalidator_role: Current role (active/passive/unknown)validator_status: Health status (healthy/unhealthy)- Plus any configured static labels
/metrics: Prometheus metrics (onprometheus.port, default: 9090)/health: Basic health check (onprometheus.health_check_port, default: 9091)
This project is licensed under the MIT License - see the LICENSE file for details.


