This document describes how the server handles NATS connection failures and connection loss scenarios.
The server will panic if it cannot establish an initial connection to the NATS cluster. This is intentional because:
- Core Dependency: NATS is a core infrastructure component that the server cannot function without
- Fail Fast: Better to fail immediately than to start in a broken state
- Operational Clarity: Clear indication that the server environment is not properly configured
if err != nil {
// Panic on initial connection failure - server cannot function without NATS
l.Fatal("Failed to connect to NATS cluster: %v", err)
panic(fmt.Sprintf("Failed to connect to NATS cluster: %v", err))
}
// Verify connection is established
if nc.Status() != nats.CONNECTED {
l.Fatal("NATS connection not in CONNECTED state: %s", nc.Status())
panic(fmt.Sprintf("NATS connection not in CONNECTED state: %s", nc.Status()))
}After connection establishment, the server verifies the connection is in the CONNECTED state before proceeding. This ensures the NATS client is fully ready for operations.
Once the server is running and the initial connection is established, the server handles connection loss gracefully:
- No Panic: The server continues running even if the NATS connection is lost
- Automatic Reconnection: NATS client automatically attempts to reconnect
- Status Monitoring: Background goroutine monitors connection status every 5 seconds
- Graceful Degradation: Services become unavailable but the server remains operational
The server registers several connection event handlers:
nats.DisconnectHandler(func(nc *nats.Conn) {
l.Error("NATS connection disconnected")
}),
nats.ReconnectHandler(func(nc *nats.Conn) {
l.Info("NATS connection reconnected")
}),
nats.ClosedHandler(func(nc *nats.Conn) {
l.Error("NATS connection closed")
}),
nats.ErrorHandler(func(nc *nats.Conn, sub *nats.Subscription, err error) {
l.Error("NATS error: %v", err)
}),The server configures NATS with resilience options:
nats.MaxReconnects(-1), // Unlimited reconnection attempts
nats.ReconnectWait(1*time.Second), // Wait 1 second between attempts
nats.ReconnectJitter(100*time.Millisecond, 1*time.Second), // Add jitter
nats.Timeout(10*time.Second), // Connection timeout
nats.PingInterval(30*time.Second), // Send ping every 30 seconds
nats.MaxPingsOutstanding(3), // Allow 3 missed pingsNATS provides built-in connection monitoring and event handling. The server relies on these mechanisms:
- Automatic Status Tracking: NATS client internally tracks connection status
- Event Handlers: Connection events are automatically triggered and logged
- Reconnection Logic: NATS handles reconnection attempts automatically
- Health Checks: Built-in ping/pong mechanism detects connection issues
No additional monitoring goroutine is needed as NATS handles all connection state management internally.
- who, urlShort, ntfy services check connection status on startup
- If connection is lost, these services will log errors but continue running
- NATS microservice endpoints become unavailable until reconnection
- Existing WebSocket connections remain open
- New subscriptions and commands will fail until NATS reconnects
- Clients receive error messages for failed operations
- Logger continues to function (messages are queued when NATS is unavailable)
- Queued messages are published once NATS reconnects
- Network Issues: NATS client automatically reconnects when network is restored
- Server Restart: NATS client reconnects when NATS server comes back online
- Temporary Outages: Services resume normal operation after reconnection
- Configuration Issues: Fix NATS configuration and restart server
- Authentication Issues: Verify JWT and NKEY credentials
- Network Configuration: Check firewall and DNS settings
- Monitor NATS connection status in logs
- Set up alerts for connection loss events
- Track reconnection frequency and success rate
- Ensure NATS cluster is available before deploying the server
- Use health checks to verify NATS connectivity
- Consider using embedded NATS for development/testing
- Check NATS server logs for connection issues
- Verify network connectivity to NATS endpoints
- Validate authentication credentials
- Review NATS server configuration
FATAL Failed to connect to NATS cluster: dial tcp: lookup connect.ngs.global: i/o timeout
panic: Failed to connect to NATS cluster: dial tcp: lookup connect.ngs.global: i/o timeout
ERROR NATS connection disconnected
ERROR NATS connection status changed to: DISCONNECTED
ERROR NATS connection permanently closed - server will continue but messaging will be unavailable
INFO NATS connection reconnected
INFO Successfully connected to NATS cluster
- Initial Failure: Server panics if it cannot connect to NATS (fail-fast behavior)
- Connection Loss: Server continues running, NATS client attempts automatic reconnection
- Graceful Degradation: Services become unavailable but server remains operational
- Automatic Recovery: Full functionality restored when NATS reconnects
- Built-in Monitoring: NATS handles all connection state management and event handling internally
This approach ensures the server fails fast on configuration issues while providing resilience during operational network problems, leveraging NATS's robust built-in connection management capabilities.