-
Notifications
You must be signed in to change notification settings - Fork 232
Description
Context:
We observed elevated Redis GET/SET failures that correlated strongly with increased DNS latency. Scaling up the CoreDNS pods reduced DNS latency. Subsequently, restarting the application service pods mitigated the Redis errors. While the temporal correlation is clear, the exact mechanism linking DNS latency to client side MOVED errors is not yet fully understood.
One plausible explanation is that increased DNS latency caused Redis redirect handling to fail at the client. When a MOVED response was returned, DNS resolution or connection attempts to the redirected node may have exceeded client timeouts, causing the redirect to fail and the original MOVED error to surface. Repeated failures may have left clients with stale topology until service pods were restarted. Is the understanding here accurate? As a follow-up, whether any client-side settings such as retries on MOVED, redirect-specific timeouts, and topology refresh on failures can be tuned to enable self-healing from transient DNS or topology issues?
Error message:
failed to execute SETEX: MOVED 7555 cache-usw2-001-cell-003-0019-001.cache-usw2-001-cell-003.f6zuq8.usw2.cache.amazonaws.com:6379
We were seeing this MOVED errors across multiple slots and valkey nodes, above is one such example.
Environment
Rueidis v1.0.66
Go 1.25.3
AWS Elastic cache Valkey Cluster v8.1
Auto-Pipelining enabled
Client Side Tracking enabled.
Cluster Configuration
Node instance type: cache.r7g.4xlarge
Total nodes per cluster: 40 primary + 40 replicas