Description
Synopsis
In some customer sites, we have experienced:
- Too high CPU usage of the Redis service
- Excessive disk usage by AOF files, which may be a Redis bug or some subtle config issues:* [BUG] AOF Compaction stopped redis/redis#10806
- [QUESTION] Shrink AOF file that is being saved for a long time redis/redis#10742
To mitigate these issues, i.e., to prevent them from impacting the SLA and reduce the chance of happening, let's split the Redis instances in our "halfstack" so that we could apply different levels of HA configuration depending on the purpose.
Since we already split the connection pools for each Redis database by different purposes as follows, it is easy to just change the initialization routine of those connection pools to use different connection parameters.
Redis databases
# src/ai/backend/common/defs.py
REDIS_STAT_DB: Final = 0
REDIS_RLIM_DB: Final = 1
REDIS_LIVE_DB: Final = 2
REDIS_IMAGE_DB: Final = 3
REDIS_STREAM_DB: Final = 4
I do not plan to pre-define how to group these Redis databases but leave it to the our solution architect team to determine the desired setup per site.
Though, here are some references to understand the background:
Data persistency requirements
-
Needed:*
STAT
(statistics): if lost, it impacts the usage accounting critical for billing and etc. -
LIVE
(agent liveness, idle checkers): if lost, it impacts the scheduler performance and mis-terminate the running sessions depending on the idle-checker config -
Not much needed:*
RLIM
(rate limit): volatile information, effective only for up to 15 minutes, no need to enforce it strictly on failover -
STREAM
(event bus): events are transient and often they are retried periodically or have another mean of synchronization. -
IMAGE
(per-agent image availability map): if lost, automatically reconstructed from agent heartbeats
Main load patterns
STAT
,LIVE
: proportional to the number of sessionsRLIM
: proportional to the volume of client API requestsIMAGE
: proportional to the number of cluster nodesSTREAM
: proportional to the number of cluster nodes & the number of sessionsCurrent Implementation
The configuration of Redis connection parameters are stored in etcd and shared across the entire cluster nodes including Manager, Agent, Storage Proxy, and App Proxy.
Currently, the Redis configuration can specify only one Redis instance (either a single host:port
address or a list of sentinel host:port
addresses).
redis_helper_config_iv = t.Dict({
t.Key("socket_timeout", default=5.0): t.ToFloat,
t.Key("socket_connect_timeout", default=2.0): t.ToFloat,
t.Key("reconnect_poll_timeout", default=0.3): t.ToFloat,
}).allow_extra("*")
redis_config_iv = t.Dict({
t.Key("addr", default=redis_default_config["addr"]): t.Null | tx.HostPortPair,
t.Key( # if present, addr is ignored and service_name becomes mandatory.
"sentinel", default=redis_default_config["sentinel"]
): t.Null | tx.DelimiterSeperatedList(tx.HostPortPair),
t.Key("service_name", default=redis_default_config["service_name"]): t.Null | t.String,
t.Key("password", default=redis_default_config["password"]): t.Null | t.String,
t.Key(
"redis_helper_config",
default=redis_helper_default_config,
): redis_helper_config_iv,
}).allow_extra("*")
Proposed Addition
Let's extend the configuration format to additionally specify per-db instance addresses (either a single host:port
address or a list of sentinel host:port
addresses).
We could just allow configuring optional, additional mappings from DB index to address settings (either single or sentinels), while non-specified ones falling back to the base address settings. This way, we could keep backward compatibility with existing setups.
An example split:
- Instance 1 for
STAT
,LIVE
: persistence with AOF for HA - Instance 2 for
STREAM
,IMAGE
: no persistence with HA - Instance 3 for
RLIM
: no persistence with HA
Activity