Skip to content

Split Redis configuration for each connection pool by database #3195

@achimnol

Description

Synopsis

In some customer sites, we have experienced:

To mitigate these issues, i.e., to prevent them from impacting the SLA and reduce the chance of happening, let's split the Redis instances in our "halfstack" so that we could apply different levels of HA configuration depending on the purpose.

Since we already split the connection pools for each Redis database by different purposes as follows, it is easy to just change the initialization routine of those connection pools to use different connection parameters.

Redis databases

# src/ai/backend/common/defs.py
REDIS_STAT_DB: Final = 0
REDIS_RLIM_DB: Final = 1
REDIS_LIVE_DB: Final = 2
REDIS_IMAGE_DB: Final = 3
REDIS_STREAM_DB: Final = 4

I do not plan to pre-define how to group these Redis databases but leave it to the our solution architect team to determine the desired setup per site.

Though, here are some references to understand the background:

Data persistency requirements

  • Needed:* STAT (statistics): if lost, it impacts the usage accounting critical for billing and etc.

  • LIVE (agent liveness, idle checkers): if lost, it impacts the scheduler performance and mis-terminate the running sessions depending on the idle-checker config

  • Not much needed:* RLIM (rate limit): volatile information, effective only for up to 15 minutes, no need to enforce it strictly on failover

  • STREAM (event bus): events are transient and often they are retried periodically or have another mean of synchronization.

  • IMAGE (per-agent image availability map): if lost, automatically reconstructed from agent heartbeats

Main load patterns

  • STAT, LIVE: proportional to the number of sessions
  • RLIM: proportional to the volume of client API requests
  • IMAGE: proportional to the number of cluster nodes
  • STREAM: proportional to the number of cluster nodes & the number of sessions

    Current Implementation

The configuration of Redis connection parameters are stored in etcd and shared across the entire cluster nodes including Manager, Agent, Storage Proxy, and App Proxy.

Currently, the Redis configuration can specify only one Redis instance (either a single host:port address or a list of sentinel host:port addresses).

redis_helper_config_iv = t.Dict({
    t.Key("socket_timeout", default=5.0): t.ToFloat,
    t.Key("socket_connect_timeout", default=2.0): t.ToFloat,
    t.Key("reconnect_poll_timeout", default=0.3): t.ToFloat,
}).allow_extra("*")

redis_config_iv = t.Dict({
    t.Key("addr", default=redis_default_config["addr"]): t.Null | tx.HostPortPair,
    t.Key(  # if present, addr is ignored and service_name becomes mandatory.
        "sentinel", default=redis_default_config["sentinel"]
    ): t.Null | tx.DelimiterSeperatedList(tx.HostPortPair),
    t.Key("service_name", default=redis_default_config["service_name"]): t.Null | t.String,
    t.Key("password", default=redis_default_config["password"]): t.Null | t.String,
    t.Key(
        "redis_helper_config",
        default=redis_helper_default_config,
    ): redis_helper_config_iv,
}).allow_extra("*")

Proposed Addition

Let's extend the configuration format to additionally specify per-db instance addresses (either a single host:port address or a list of sentinel host:port addresses).

We could just allow configuring optional, additional mappings from DB index to address settings (either single or sentinels), while non-specified ones falling back to the base address settings. This way, we could keep backward compatibility with existing setups.

An example split:

  • Instance 1 for STAT, LIVE: persistence with AOF for HA
  • Instance 2 for STREAM, IMAGE: no persistence with HA
  • Instance 3 for RLIM: no persistence with HA

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions