Introduce Pool Topology by jtechapps · Pull Request #222 · llm-d-incubation/llm-d-async

jtechapps · 2026-06-04T06:29:24Z

What does this PR do?

Inference pool isolation feature from #186

This PR adds a new pool object for grouping and isolating backends.

If queue config file is specified, pool config must also be specified.
Each queue in the queue config must now specify pool id that should handle requests from the queue.

If a queue config is not specified and only the command line parameters are provided, a single "default" pool will be created.

Multiple queues can specify the same pool id.
The merge policy has been changed to merge channels per pool.
Round robin merge policy will perform round robin balancing for all the queues sharing the same pool id.

Why is this change needed?

Each pool configures an inference backend and has its own workers. when one pool is saturated it does not block processing other requests on different pools.

How was this tested?

Unit tests added/updated
Integration/e2e tests added/updated
Manual testing performed

I tested by running async-processor locally against llm-d-simulator instances running on the same host.

I ran a slow and fast llm-d-simulator:

fast ./bin/llm-d-inference-sim --model fake-model --port 8000 --time-to-first-token=1s
slow ./bin/llm-d-inference-sim --model fake-model --port 8001 --time-to-first-token=120s

I configured two pools with one worker each and two pub/sub queues.

Pool config:

[
  {
    "id": "fast-pool",
    "igw_base_url": "http://localhost:8000/",
    "request_path_url": "/v1/completions",
    "workers": 1
  },
  {
    "id": "slow-pool",
    "igw_base_url": "http://localhost:8000/",
    "request_path_url": "/v1/completions"
    "workers": 1
  }
]

Pub/sub queue config:

[
  {
    "pool_id": "fast-pool",
    "subscriber_id": "projects/${PROJECT}/subscriptions/inference-requests1",
    "gate_type": "constant",
    "gate_params": {}
  },
  {
    "pool_id": "slow-pool",
    "subscriber_id": "projects/${PROJECT}/subscriptions/inference-requests2",
    "gate_type": "constant",
    "gate_params": {}
  }
]

Run async-processor with minimal configuration:

go run cmd/main.go \
  --message-queue-impl=gcp-pubsub \
  --pubsub.project-id=${PROJECT} \
  --pubsub.result-topic-id=llm-d-async-1-results-1 \
  --pubsub.topics-config-file=pubsub1-config.json \
  --pool-config-file=pool-config1.json \
  --metrics-endpoint-auth=false \
  --metrics-port=19090

I added requests to each queue and verified that fast pool was not blocked by stuck workers of the slow pool.

Add requests to queue:

# fast queue
gcloud pubsub topics publish llm-d-async-1 \
 --project=${PROJECT} \
 --message='{"id":"q1-req1","created":1700000000,"deadline":2000000000,"endpoint":"/v1/completions","payload":{"model":"fake-model","prompt":"How are you today?"}}'

# slow queue
gcloud pubsub topics publish llm-d-async-2 \
 --project=${PROJECT} \
 --message='{"id":"q2-req1","created":1700000000,"deadline":2000000000,"endpoint":"/v1/completions","payload":{"model":"fake-model","prompt":"What is a dog?"}}'

Pull results:

gcloud pubsub subscriptions pull projects/${PROJECT}/subscriptions/llm-d-async-1-results-1-sub \
  --project=${PROJECT} \
  --limit=1  --auto-ack

Checklist

Commits are signed off (git commit -s) per DCO
Code follows project contributing guidelines
Tests pass locally (make test)
Linters pass (make lint)
Documentation updated (if applicable)

Related Issues

#205

evacchi

left a few comments, especially around config

evacchi · 2026-06-09T08:15:47Z

+	if hasQueueConfig != hasPoolConfig {
+		setupLog.Error(fmt.Errorf("both pool-config-file and queues/topics config file must be specified together, or both must be omitted"), "Configuration error")
+		os.Exit(1)
+	}


this is bit too restrictive, and also I am wondering why we need 2 separate config files instead of one with two sections; for instance (even though it's not necessarily a deal breaker) it would break this PR llm-d/llm-d-batch-gateway#458

maybe let's split the config file into 2 sections, and let's make sure that it's clear that "pool" here means worker pool (maybe we should use that name in structs/interfaces as well), because otherwise it can be easily mistaken for inference pool!

pool in this case is actually closer to inference pool than worker pool. If IGW is being used then the pool config will be roughly mapped with inference pools.

right, I'm just saying that the term "pool" is overloaded here even if there was a 1:1 mapping; so I'd advise being explicit

evacchi · 2026-06-09T08:27:21Z

+type PoolConfig struct {
+	ID             string            `json:"id"`
+	IGWBaseURL     string            `json:"igw_base_url"`
+	RequestPathURL string            `json:"request_path_url"`
+	Workers        int               `json:"workers"`
+	HTTPHeaders    map[string]string `json:"http_headers,omitempty"`
+}


I am not 100% convinced about bundling routing fields to the worker config.

If it's a "worker pool", should the routing config stay on the queue side (where it was before) and the worker pool only define concurrency + grouping?

it would make sense if the worker was tied to a specific endpoint, but routing is handled at the message level (each message carries its own RequestURL and HttpHeaders)

Otherwise you could keep the queue config:

{ "queue_name": "critical-queue", "worker_pool_id": "fast-pool", "igw_base_url": "http://fast-model:8000", "request_path_url": "/v1/completions", "gate_type": "constant" }

and then have an optional worker_pools section:

{ "id": "fast-pool", "workers": 4 }

the default fallback policy could be either 1 single pool for all queues (possibly suboptimal) or one pool per queue (maybe wasteful).

The routing config should be in the pool config. This is intentional because when multiple queues are specified and they are using the same endpoint they should specify the same pool because the queues are saturating the same endpoint.

the default fallback policy could be either 1 single pool for all queues (possibly suboptimal) or one pool per queue (maybe wasteful).

When specifying a single queue via command line then we fall back to creating a single pool. This is backwards compatible as that queue will get all workers.

If multiple queues are used we should force pools to be used otherwise the fallback logic becomes complicated and we can't make assumptions about what the user wants. The current logic will error if this is the case.

it would make sense if the worker was tied to a specific endpoint, but routing is handled at the message level (each message carries its own RequestURL and HttpHeaders)

This makes more sense than putting the endpoint in the queue itself. Because saturation (and per-pool gating in the future) happens independently at the endpoint level. Different endpoints should always be in different pools because saturating one endpoint does not impact a different endpoint.

The pool's purpose is worker isolation and concurrency control (don't let one saturated group block another). I think that routing (where to send) is a separate concern.

The Worker function is a generic HTTP dispatcher - it sends to whatever msg.RequestURL it receives, regardless of pool. The routing default (base URL, path) logically belongs on the queue, where the request enters the system, with the request's own endpoint always taking precedence.

The argument that - two queues sharing a pool must share the same endpoint - looks to conflate worker grouping with endpoint identity. You could have two queues with different default endpoints sharing the same worker pool - the workers don't care, they just dispatch whatever URL the merge policy resolved.

Keeping the pool config to {id, workers} and routing on the queue side also avoids the need for a second config file, which addresses the concern about breaking batch-gateway PR #458.

Workers should be roughly grouped by endpoint identity. If a backend endpoint hangs, it should only block workers in the same pool.

You could have two queues with different default endpoints sharing the same worker pool - the workers don't care, they just dispatch whatever URL the merge policy resolved.

Is there a use case for this that cannot be achieved via queue gates and (future) pool gates? Why not just create multiple pools and use gates to ensure some workers remain idle? Or implement global concurrency limit to be shared between all workers? IMO The global concurrency limit is a bad idea because it introduces back the problems pool topology is trying to fix.

In the future we could add support for overflow pools could be configured to pull extra capacity from queues if they have extra capacity.

@jtechapps Interesting points.

I think that in frequent deployments there would be multiple queues dedicated per model (e.g. each queue is associated with a different workload class), and each request will carry its own router URL and API path. The pool's or queue's base+path url defaults would be typically overridden by the request-level routing.

If we couple routing to the pool, it means either:

One pool per endpoint combination — which requires arbitrarily many pools, and forces reconfig and redeploy of async-processor whenever a backend adds or changes endpoints. In other words, this adds sensitivity in async-processor - to an aspect that it shouldn't be sensitive to.

Or the pool's routing fields are dead config that's always overridden.

The pool should be a generic worker isolation primitive — admins decide the granularity of pools (per-model, per-queue, per-workload-class, etc.). Bundling routing into the pool config forces a specific topology (one endpoint per isolation group), making a decision that should be the admin's decision.

I think that default routing belongs on the queue, which is typically associated with a specific model and workload class. The pool should be pure worker isolation.

// pool config {"id": "priority-pool", "workers": 8} // queue config {"queue_name": "model-a-critical-workload", "pool_id": "priority-pool", "igw_base_url": "http://router:8080", "request_path_url": "/v1/chat/completions"}

This also avoids needing a second config file (which adds complexity). We can use a single config file — pool definitions ({id, workers}) can be an optional worker_pools section in the existing queue config. And since the pool is purely about worker isolation, naming it "worker pool" would avoid confusion with the K8s InferencePool CRD.

add a new pool object for grouping and isolating backends each pool configures an inference backend and has its own workers. when one pool is saturated it does not block processing other requests on different pools. if queue config file is specified, pool config must also be specified. each queue in the config must now specify the pool id that should handle requests from the queue. if a queue config is not specified and only the command line parameters are provided, a single "default" pool will be created. multiple queues can specify the same pool id. the merge policy has been changed to merge channels per pool. round robin merge policy will perform round robin balancing for all the queues sharing the same pool id. TESTED=unit tests + e2e pub/sub queue with llm-d simulator Signed-off-by: Jacob Murry <jacobmurry@google.com>

github-actions · 2026-06-11T02:56:07Z

Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

add metric field to existing metrics for grouping by pool name. Signed-off-by: Jacob Murry <jacobmurry@google.com>

worker pools only defines concurrency and grouping. Move endpoint and request fields back to queues. make pool config optional. if it's not specified create a "default" worker pool with the global concurrency level. This makes configurations backwards compatible. Signed-off-by: Jacob Murry <jacobmurry@google.com>

Signed-off-by: Jacob Murry <jacobmurry@google.com>

update helm chart to support: 1. pub/sub topic queues config for specifying multiple queues 2. worker pools config for specifying worker pools Signed-off-by: Jacob Murry <jacobmurry@google.com>

Signed-off-by: Jacob Murry <jacobmurry@google.com>

jtechapps · 2026-06-15T23:35:48Z

I kept the worker pool config and queue configs as separate files for now to preserve backwards compatibility because topics/queues are stored as top level JSON array. The worker pool config would also have to be loaded in each queue implementation.

jtechapps · 2026-06-15T23:36:02Z

This also fixes #250

Fixes: #250

instead of silently overriding the pool id in queue config when worker pool config is omitted. Signed-off-by: Jacob Murry <jacobmurry@google.com>

jtechapps force-pushed the pooltopology branch 2 times, most recently from 4d73b0a to d362216 Compare June 4, 2026 07:30

shimib reviewed Jun 4, 2026

View reviewed changes

Comment thread pkg/async/random_robin_policy.go

jtechapps force-pushed the pooltopology branch 2 times, most recently from 7a72081 to b3a6f49 Compare June 5, 2026 18:54

jtechapps marked this pull request as ready for review June 5, 2026 19:10

jtechapps requested review from RishabhSaini and ahg-g as code owners June 5, 2026 19:10

evacchi reviewed Jun 9, 2026

View reviewed changes

lioraron reviewed Jun 9, 2026

View reviewed changes

Comment thread pipeline/pool.go Outdated

Comment thread pkg/redis/redisimpl.go Outdated

Comment thread cmd/main.go Outdated

lioraron reviewed Jun 9, 2026

View reviewed changes

Comment thread pkg/pubsub/pubsubimpl_test.go Outdated

jtechapps force-pushed the pooltopology branch from b3a6f49 to 20d4daa Compare June 11, 2026 02:55

jtechapps added 2 commits June 15, 2026 17:06

Add metric field for pool name

575dfc6

add metric field to existing metrics for grouping by pool name. Signed-off-by: Jacob Murry <jacobmurry@google.com>

jtechapps force-pushed the pooltopology branch from 20d4daa to 1f7f50f Compare June 15, 2026 18:41

Merge branch 'main' into pooltopology

39dd854

Signed-off-by: Jacob Murry <jacobmurry@google.com>

jtechapps force-pushed the pooltopology branch from 86ba652 to 39dd854 Compare June 15, 2026 20:55

[Worker Pools] Update helm chart

71fff3c

update helm chart to support: 1. pub/sub topic queues config for specifying multiple queues 2. worker pools config for specifying worker pools Signed-off-by: Jacob Murry <jacobmurry@google.com>

jtechapps force-pushed the pooltopology branch from dd1bebd to 71fff3c Compare June 15, 2026 23:21

[Worker Pools] Fix queue stats

9e17677

Signed-off-by: Jacob Murry <jacobmurry@google.com>

[Worker Pool] Fix worker pool queue validation

7264588

instead of silently overriding the pool id in queue config when worker pool config is omitted. Signed-off-by: Jacob Murry <jacobmurry@google.com>

shimib approved these changes Jun 16, 2026

View reviewed changes

shimib merged commit 6b8783c into llm-d-incubation:main Jun 16, 2026
10 checks passed

jtechapps mentioned this pull request Jun 16, 2026

[Bug]: Add pub/sub multi-queue config to helm #250

Closed

Conversation

jtechapps commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is this change needed?

How was this tested?

Checklist

Related Issues

Uh oh!

Uh oh!

evacchi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

jtechapps commented Jun 15, 2026

Uh oh!

jtechapps commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jtechapps commented Jun 4, 2026 •

edited

Loading