Split Redis configuration for each connection pool by database

## Synopsis

In some customer sites, we have experienced:

- Too high CPU usage of the Redis service
- Excessive disk usage by AOF files, which may be a Redis bug or some subtle config issues:\* redis/redis#10806
- redis/redis#10742

To mitigate these issues, i.e., to prevent them from impacting the SLA and reduce the chance of happening, let's split the Redis instances in our "halfstack" so that we could apply different levels of HA configuration depending on the purpose.

Since we already split the connection pools for each Redis database by different purposes as follows, it is easy to just change the initialization routine of those connection pools to use different connection parameters.

### Redis databases

```python
# src/ai/backend/common/defs.py
REDIS_STAT_DB: Final = 0
REDIS_RLIM_DB: Final = 1
REDIS_LIVE_DB: Final = 2
REDIS_IMAGE_DB: Final = 3
REDIS_STREAM_DB: Final = 4
```

I do not plan to pre-define how to group these Redis databases but leave it to the our solution architect team to determine the desired setup per site.

Though, here are some references to understand the background:

**Data persistency requirements**

- Needed:\* `STAT` (statistics): if lost, it impacts the usage accounting critical for billing and etc.
- `LIVE` (agent liveness, idle checkers): if lost, it impacts the scheduler performance and mis-terminate the running sessions depending on the idle-checker config

- Not much needed:\* `RLIM` (rate limit): volatile information, effective only for up to 15 minutes, no need to enforce it strictly on failover
- `STREAM` (event bus): events are transient and often they are retried periodically or have another mean of synchronization.
- `IMAGE` (per-agent image availability map): if lost, automatically reconstructed from agent heartbeats

**Main load patterns**

- `STAT`, `LIVE`: proportional to the number of sessions
- `RLIM`: proportional to the volume of client API requests
- `IMAGE`: proportional to the number of cluster nodes
- `STREAM`: proportional to the number of cluster nodes & the number of sessions
  ## Current Implementation

The configuration of Redis connection parameters are stored in etcd and shared across the entire cluster nodes including Manager, Agent, Storage Proxy, and App Proxy.

Currently, the Redis configuration can specify only one Redis instance (either a single `host:port` address or a list of sentinel `host:port` addresses).

```python
redis_helper_config_iv = t.Dict({
    t.Key("socket_timeout", default=5.0): t.ToFloat,
    t.Key("socket_connect_timeout", default=2.0): t.ToFloat,
    t.Key("reconnect_poll_timeout", default=0.3): t.ToFloat,
}).allow_extra("*")

redis_config_iv = t.Dict({
    t.Key("addr", default=redis_default_config["addr"]): t.Null | tx.HostPortPair,
    t.Key(  # if present, addr is ignored and service_name becomes mandatory.
        "sentinel", default=redis_default_config["sentinel"]
    ): t.Null | tx.DelimiterSeperatedList(tx.HostPortPair),
    t.Key("service_name", default=redis_default_config["service_name"]): t.Null | t.String,
    t.Key("password", default=redis_default_config["password"]): t.Null | t.String,
    t.Key(
        "redis_helper_config",
        default=redis_helper_default_config,
    ): redis_helper_config_iv,
}).allow_extra("*")
```

## Proposed Addition

Let's extend the configuration format to additionally specify per-db instance addresses (either a single `host:port` address or a list of sentinel `host:port` addresses).

We could just allow configuring optional, additional mappings from DB index to address settings (either single or sentinels), while non-specified ones falling back to the base address settings.  This way, we could keep backward compatibility with existing setups.

**An example split:**

- Instance 1 for `STAT`, `LIVE`: persistence with AOF for HA
- Instance 2 for `STREAM`, `IMAGE`: no persistence with HA
- Instance 3 for `RLIM`: no persistence with HA



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split Redis configuration for each connection pool by database #3195

Synopsis

Redis databases

Current Implementation

Proposed Addition

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Split Redis configuration for each connection pool by database #3195

Description

Synopsis

Redis databases

Current Implementation

Proposed Addition

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions