Elevated MOVED errors while doing SETEX operations

**Context:** 
We observed elevated Redis GET/SET failures that correlated strongly with increased DNS latency. Scaling up the CoreDNS pods reduced DNS latency. Subsequently, restarting the application service pods mitigated the Redis errors. While the temporal correlation is clear, the exact mechanism linking DNS latency to client side MOVED errors is not yet fully understood.

One plausible explanation is that increased DNS latency caused Redis redirect handling to fail at the client. When a MOVED response was returned, DNS resolution or connection attempts to the redirected node may have exceeded client timeouts, causing the redirect to fail and the original MOVED error to surface. Repeated failures may have left clients with stale topology until service pods were restarted. Is the understanding here accurate?  As a follow-up, whether any client-side settings such as retries on MOVED, redirect-specific timeouts, and topology refresh on failures can be tuned to enable self-healing from transient DNS or topology issues?

**Error message:**
`failed to execute SETEX: MOVED 7555 cache-usw2-001-cell-003-0019-001.cache-usw2-001-cell-003.f6zuq8.usw2.cache.amazonaws.com:6379`

We were seeing this MOVED errors across multiple slots and valkey nodes, above is one such example. 

**Environment**
Rueidis v1.0.66
Go 1.25.3
AWS Elastic cache Valkey Cluster v8.1
Auto-Pipelining enabled
Client Side Tracking enabled.

**Cluster Configuration**
**Node instance type**: cache.r7g.4xlarge
**Total nodes per cluster**: 40 primary + 40 replicas


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elevated MOVED errors while doing SETEX operations #937

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Elevated MOVED errors while doing SETEX operations #937

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions