Introduce Pool Topology#222
Conversation
4d73b0a to
d362216
Compare
7a72081 to
b3a6f49
Compare
evacchi
left a comment
There was a problem hiding this comment.
left a few comments, especially around config
| if hasQueueConfig != hasPoolConfig { | ||
| setupLog.Error(fmt.Errorf("both pool-config-file and queues/topics config file must be specified together, or both must be omitted"), "Configuration error") | ||
| os.Exit(1) | ||
| } |
There was a problem hiding this comment.
this is bit too restrictive, and also I am wondering why we need 2 separate config files instead of one with two sections; for instance (even though it's not necessarily a deal breaker) it would break this PR llm-d/llm-d-batch-gateway#458
maybe let's split the config file into 2 sections, and let's make sure that it's clear that "pool" here means worker pool (maybe we should use that name in structs/interfaces as well), because otherwise it can be easily mistaken for inference pool!
There was a problem hiding this comment.
pool in this case is actually closer to inference pool than worker pool. If IGW is being used then the pool config will be roughly mapped with inference pools.
There was a problem hiding this comment.
right, I'm just saying that the term "pool" is overloaded here even if there was a 1:1 mapping; so I'd advise being explicit
| type PoolConfig struct { | ||
| ID string `json:"id"` | ||
| IGWBaseURL string `json:"igw_base_url"` | ||
| RequestPathURL string `json:"request_path_url"` | ||
| Workers int `json:"workers"` | ||
| HTTPHeaders map[string]string `json:"http_headers,omitempty"` | ||
| } |
There was a problem hiding this comment.
I am not 100% convinced about bundling routing fields to the worker config.
If it's a "worker pool", should the routing config stay on the queue side (where it was before) and the worker pool only define concurrency + grouping?
it would make sense if the worker was tied to a specific endpoint, but routing is handled at the message level (each message carries its own RequestURL and HttpHeaders)
Otherwise you could keep the queue config:
{
"queue_name": "critical-queue",
"worker_pool_id": "fast-pool",
"igw_base_url": "http://fast-model:8000",
"request_path_url": "/v1/completions",
"gate_type": "constant"
}
and then have an optional worker_pools section:
{
"id": "fast-pool",
"workers": 4
}
the default fallback policy could be either 1 single pool for all queues (possibly suboptimal) or one pool per queue (maybe wasteful).
There was a problem hiding this comment.
The routing config should be in the pool config. This is intentional because when multiple queues are specified and they are using the same endpoint they should specify the same pool because the queues are saturating the same endpoint.
the default fallback policy could be either 1 single pool for all queues (possibly suboptimal) or one pool per queue (maybe wasteful).
When specifying a single queue via command line then we fall back to creating a single pool. This is backwards compatible as that queue will get all workers.
If multiple queues are used we should force pools to be used otherwise the fallback logic becomes complicated and we can't make assumptions about what the user wants. The current logic will error if this is the case.
it would make sense if the worker was tied to a specific endpoint, but routing is handled at the message level (each message carries its own RequestURL and HttpHeaders)
This makes more sense than putting the endpoint in the queue itself. Because saturation (and per-pool gating in the future) happens independently at the endpoint level. Different endpoints should always be in different pools because saturating one endpoint does not impact a different endpoint.
There was a problem hiding this comment.
The pool's purpose is worker isolation and concurrency control (don't let one saturated group block another). I think that routing (where to send) is a separate concern.
The Worker function is a generic HTTP dispatcher - it sends to whatever msg.RequestURL it receives, regardless of pool. The routing default (base URL, path) logically belongs on the queue, where the request enters the system, with the request's own endpoint always taking precedence.
The argument that - two queues sharing a pool must share the same endpoint - looks to conflate worker grouping with endpoint identity. You could have two queues with different default endpoints sharing the same worker pool - the workers don't care, they just dispatch whatever URL the merge policy resolved.
Keeping the pool config to {id, workers} and routing on the queue side also avoids the need for a second config file, which addresses the concern about breaking batch-gateway PR #458.
There was a problem hiding this comment.
Workers should be roughly grouped by endpoint identity. If a backend endpoint hangs, it should only block workers in the same pool.
You could have two queues with different default endpoints sharing the same worker pool - the workers don't care, they just dispatch whatever URL the merge policy resolved.
Is there a use case for this that cannot be achieved via queue gates and (future) pool gates? Why not just create multiple pools and use gates to ensure some workers remain idle? Or implement global concurrency limit to be shared between all workers? IMO The global concurrency limit is a bad idea because it introduces back the problems pool topology is trying to fix.
In the future we could add support for overflow pools could be configured to pull extra capacity from queues if they have extra capacity.
There was a problem hiding this comment.
@jtechapps Interesting points.
I think that in frequent deployments there would be multiple queues dedicated per model (e.g. each queue is associated with a different workload class), and each request will carry its own router URL and API path. The pool's or queue's base+path url defaults would be typically overridden by the request-level routing.
If we couple routing to the pool, it means either:
- One pool per endpoint combination — which requires arbitrarily many pools, and forces reconfig and redeploy of async-processor whenever a backend adds or changes endpoints. In other words, this adds sensitivity in async-processor - to an aspect that it shouldn't be sensitive to.
- Or the pool's routing fields are dead config that's always overridden.
The pool should be a generic worker isolation primitive — admins decide the granularity of pools (per-model, per-queue, per-workload-class, etc.). Bundling routing into the pool config forces a specific topology (one endpoint per isolation group), making a decision that should be the admin's decision.
I think that default routing belongs on the queue, which is typically associated with a specific model and workload class. The pool should be pure worker isolation.
// pool config
{"id": "priority-pool", "workers": 8}
// queue config
{"queue_name": "model-a-critical-workload", "pool_id": "priority-pool",
"igw_base_url": "http://router:8080", "request_path_url": "/v1/chat/completions"}
This also avoids needing a second config file (which adds complexity). We can use a single config file — pool definitions ({id, workers}) can be an optional worker_pools section in the existing queue config. And since the pool is purely about worker isolation, naming it "worker pool" would avoid confusion with the K8s InferencePool CRD.
add a new pool object for grouping and isolating backends each pool configures an inference backend and has its own workers. when one pool is saturated it does not block processing other requests on different pools. if queue config file is specified, pool config must also be specified. each queue in the config must now specify the pool id that should handle requests from the queue. if a queue config is not specified and only the command line parameters are provided, a single "default" pool will be created. multiple queues can specify the same pool id. the merge policy has been changed to merge channels per pool. round robin merge policy will perform round robin balancing for all the queues sharing the same pool id. TESTED=unit tests + e2e pub/sub queue with llm-d simulator Signed-off-by: Jacob Murry <jacobmurry@google.com>
|
Unsigned commits detected! Please sign your commits. For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation. |
add metric field to existing metrics for grouping by pool name. Signed-off-by: Jacob Murry <jacobmurry@google.com>
worker pools only defines concurrency and grouping. Move endpoint and request fields back to queues. make pool config optional. if it's not specified create a "default" worker pool with the global concurrency level. This makes configurations backwards compatible. Signed-off-by: Jacob Murry <jacobmurry@google.com>
Signed-off-by: Jacob Murry <jacobmurry@google.com>
update helm chart to support: 1. pub/sub topic queues config for specifying multiple queues 2. worker pools config for specifying worker pools Signed-off-by: Jacob Murry <jacobmurry@google.com>
Signed-off-by: Jacob Murry <jacobmurry@google.com>
|
I kept the worker pool config and queue configs as separate files for now to preserve backwards compatibility because topics/queues are stored as top level JSON array. The worker pool config would also have to be loaded in each queue implementation. |
instead of silently overriding the pool id in queue config when worker pool config is omitted. Signed-off-by: Jacob Murry <jacobmurry@google.com>
What does this PR do?
Inference pool isolation feature from #186
This PR adds a new pool object for grouping and isolating backends.
If queue config file is specified, pool config must also be specified.
Each queue in the queue config must now specify pool id that should handle requests from the queue.
If a queue config is not specified and only the command line parameters are provided, a single "default" pool will be created.
Multiple queues can specify the same pool id.
The merge policy has been changed to merge channels per pool.
Round robin merge policy will perform round robin balancing for all the queues sharing the same pool id.
Why is this change needed?
Each pool configures an inference backend and has its own workers. when one pool is saturated it does not block processing other requests on different pools.
How was this tested?
I tested by running async-processor locally against llm-d-simulator instances running on the same host.
I ran a slow and fast llm-d-simulator:
./bin/llm-d-inference-sim --model fake-model --port 8000 --time-to-first-token=1s./bin/llm-d-inference-sim --model fake-model --port 8001 --time-to-first-token=120sI configured two pools with one worker each and two pub/sub queues.
Pool config:
Pub/sub queue config:
Run async-processor with minimal configuration:
I added requests to each queue and verified that fast pool was not blocked by stuck workers of the slow pool.
Add requests to queue:
Pull results:
Checklist
git commit -s) per DCOmake test)make lint)Related Issues
#205