Skip to content

Errors and Panic in Rueidis during Redis Cluster Shard Failover #929

@ashmeet-zepto

Description

@ashmeet-zepto

Description

We observed a set of issues when using rueidis v1.0.69 with a Redis Cluster (v7.4).

During a failover in one shard of the cluster, the application experienced an increase in Redis request latency, followed by multiple client-side errors. No application code changes were deployed during this time.

The following issues were observed:

  • Intermittent transaction errors:
EXEC was aborted by redis or connection closed
  • Redis protocol parse errors:
rueidis: parse error: redis message type simple string is not a array
rueidis: parse error: redis message type array is not a string
  • Connection-level errors:
EOF
dial tcp <ip>:<port>: operation was canceled

Once the parse errors started, the client did not recover and required restarting application pods.

Panic observed during cluster refresh

In the same time window, we observed a panic originating from the cluster topology refresh path.

panic: runtime error: index out of range [2] with length 0

goroutine 258842186 [running]:
github.com/redis/rueidis.parseSlots(...)
    /go/pkg/mod/github.com/redis/[email protected]/cluster.go:378
github.com/redis/rueidis.clusterslots.parse(...)
    /go/pkg/mod/github.com/redis/[email protected]/cluster.go:378
github.com/redis/rueidis.(*clusterClient)._refresh(...)
    /go/pkg/mod/github.com/redis/[email protected]/cluster.go:167
github.com/redis/rueidis.(*call).do(...)
    /go/pkg/mod/github.com/redis/[email protected]/cluster.go:217
github.com/redis/rueidis.(*call).LazyDo.func1(...)
    /go/pkg/mod/github.com/redis/[email protected]/singleflight.go:58
github.com/redis/rueidis.(*call).LazyDo(...)
    /go/pkg/mod/github.com/redis/[email protected]/singleflight.go:53

Client Code

  • Connection dial timeout: default
  • Connection write timeout: default

Client Initialization

client, err := rueidisotel.NewClient(
    rueidis.ClientOption{
        InitAddress:       conf.NodeAddresses,
        Username:          conf.UserName,
        Password:          conf.Password,
        DisableCache:      conf.DisableCache,
        CacheSizeEachConn: cacheSizeEachConnection,
    },
)
if err != nil {
    return err
}

if err := client.Do(
    context.Background(),
    client.B().Ping().Build(),
).Error(); err != nil {
    return err
}

Client Usage

func (c *Cache) MGetWithClientSideCache(
    ctx context.Context,
    keys []string,
) (map[string]string, error) {

    results, err := rueidis.MGetCache(
        c.client,
        ctx,
        30*time.Second,
        keys,
    )
    if err != nil {
        return nil, err
    }

    finalResult := make(map[string]string, len(results))
    for key, result := range results {
        value, err := result.ToString()
        if err != nil && !rueidis.IsRedisNil(err) {
            return nil, err
        }
        finalResult[key] = value
    }

    return finalResult, nil
}

Environment

  • Rueidis v1.0.69
  • Go 1.23.0
  • Redis Cluster v7.4
  • Auto-Pipelining enabled

Questions

  1. When does rueidis refresh the cluster topology?
    If no refresh interval is configured, what situations cause a topology refresh to happen automatically?

  2. Is it normal to see EXEC abort errors during a shard failover?
    Should applications expect these errors during cluster changes?

  3. What situations can lead to Redis protocol parse errors in rueidis?
    For example, can they happen due to connection interruptions or cluster changes?

  4. After a protocol parse error, should the client recover on its own or be restarted?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions