Skip to content

Elevated MOVED errors while doing SETEX operations #937

@sriramtallapragada

Description

@sriramtallapragada

Context:
We observed elevated Redis GET/SET failures that correlated strongly with increased DNS latency. Scaling up the CoreDNS pods reduced DNS latency. Subsequently, restarting the application service pods mitigated the Redis errors. While the temporal correlation is clear, the exact mechanism linking DNS latency to client side MOVED errors is not yet fully understood.

One plausible explanation is that increased DNS latency caused Redis redirect handling to fail at the client. When a MOVED response was returned, DNS resolution or connection attempts to the redirected node may have exceeded client timeouts, causing the redirect to fail and the original MOVED error to surface. Repeated failures may have left clients with stale topology until service pods were restarted. Is the understanding here accurate? As a follow-up, whether any client-side settings such as retries on MOVED, redirect-specific timeouts, and topology refresh on failures can be tuned to enable self-healing from transient DNS or topology issues?

Error message:
failed to execute SETEX: MOVED 7555 cache-usw2-001-cell-003-0019-001.cache-usw2-001-cell-003.f6zuq8.usw2.cache.amazonaws.com:6379

We were seeing this MOVED errors across multiple slots and valkey nodes, above is one such example.

Environment
Rueidis v1.0.66
Go 1.25.3
AWS Elastic cache Valkey Cluster v8.1
Auto-Pipelining enabled
Client Side Tracking enabled.

Cluster Configuration
Node instance type: cache.r7g.4xlarge
Total nodes per cluster: 40 primary + 40 replicas

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions