Skip to content

Introduce Pool Topology#222

Merged
shimib merged 7 commits into
llm-d-incubation:mainfrom
jtechapps:pooltopology
Jun 16, 2026
Merged

Introduce Pool Topology#222
shimib merged 7 commits into
llm-d-incubation:mainfrom
jtechapps:pooltopology

Conversation

@jtechapps

@jtechapps jtechapps commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Inference pool isolation feature from #186

This PR adds a new pool object for grouping and isolating backends.

If queue config file is specified, pool config must also be specified.
Each queue in the queue config must now specify pool id that should handle requests from the queue.

If a queue config is not specified and only the command line parameters are provided, a single "default" pool will be created.

Multiple queues can specify the same pool id.
The merge policy has been changed to merge channels per pool.
Round robin merge policy will perform round robin balancing for all the queues sharing the same pool id.

Why is this change needed?

Each pool configures an inference backend and has its own workers. when one pool is saturated it does not block processing other requests on different pools.

How was this tested?

  • Unit tests added/updated
  • Integration/e2e tests added/updated
  • Manual testing performed

I tested by running async-processor locally against llm-d-simulator instances running on the same host.

I ran a slow and fast llm-d-simulator:

  1. fast ./bin/llm-d-inference-sim --model fake-model --port 8000 --time-to-first-token=1s
  2. slow ./bin/llm-d-inference-sim --model fake-model --port 8001 --time-to-first-token=120s

I configured two pools with one worker each and two pub/sub queues.

Pool config:

[
  {
    "id": "fast-pool",
    "igw_base_url": "http://localhost:8000/",
    "request_path_url": "/v1/completions",
    "workers": 1
  },
  {
    "id": "slow-pool",
    "igw_base_url": "http://localhost:8000/",
    "request_path_url": "/v1/completions"
    "workers": 1
  }
]

Pub/sub queue config:

[
  {
    "pool_id": "fast-pool",
    "subscriber_id": "projects/${PROJECT}/subscriptions/inference-requests1",
    "gate_type": "constant",
    "gate_params": {}
  },
  {
    "pool_id": "slow-pool",
    "subscriber_id": "projects/${PROJECT}/subscriptions/inference-requests2",
    "gate_type": "constant",
    "gate_params": {}
  }
]

Run async-processor with minimal configuration:

go run cmd/main.go \
  --message-queue-impl=gcp-pubsub \
  --pubsub.project-id=${PROJECT} \
  --pubsub.result-topic-id=llm-d-async-1-results-1 \
  --pubsub.topics-config-file=pubsub1-config.json \
  --pool-config-file=pool-config1.json \
  --metrics-endpoint-auth=false \
  --metrics-port=19090

I added requests to each queue and verified that fast pool was not blocked by stuck workers of the slow pool.

Add requests to queue:

# fast queue
gcloud pubsub topics publish llm-d-async-1 \
 --project=${PROJECT} \
 --message='{"id":"q1-req1","created":1700000000,"deadline":2000000000,"endpoint":"/v1/completions","payload":{"model":"fake-model","prompt":"How are you today?"}}'

# slow queue
gcloud pubsub topics publish llm-d-async-2 \
 --project=${PROJECT} \
 --message='{"id":"q2-req1","created":1700000000,"deadline":2000000000,"endpoint":"/v1/completions","payload":{"model":"fake-model","prompt":"What is a dog?"}}'

Pull results:

gcloud pubsub subscriptions pull projects/${PROJECT}/subscriptions/llm-d-async-1-results-1-sub \
  --project=${PROJECT} \
  --limit=1  --auto-ack

Checklist

  • Commits are signed off (git commit -s) per DCO
  • Code follows project contributing guidelines
  • Tests pass locally (make test)
  • Linters pass (make lint)
  • Documentation updated (if applicable)

Related Issues

#205

@jtechapps jtechapps force-pushed the pooltopology branch 2 times, most recently from 4d73b0a to d362216 Compare June 4, 2026 07:30
Comment thread pkg/async/random_robin_policy.go
@jtechapps jtechapps force-pushed the pooltopology branch 2 times, most recently from 7a72081 to b3a6f49 Compare June 5, 2026 18:54
@jtechapps jtechapps marked this pull request as ready for review June 5, 2026 19:10

@evacchi evacchi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments, especially around config

Comment thread cmd/main.go
Comment thread cmd/main.go Outdated
Comment on lines +111 to +114
if hasQueueConfig != hasPoolConfig {
setupLog.Error(fmt.Errorf("both pool-config-file and queues/topics config file must be specified together, or both must be omitted"), "Configuration error")
os.Exit(1)
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is bit too restrictive, and also I am wondering why we need 2 separate config files instead of one with two sections; for instance (even though it's not necessarily a deal breaker) it would break this PR llm-d/llm-d-batch-gateway#458

maybe let's split the config file into 2 sections, and let's make sure that it's clear that "pool" here means worker pool (maybe we should use that name in structs/interfaces as well), because otherwise it can be easily mistaken for inference pool!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pool in this case is actually closer to inference pool than worker pool. If IGW is being used then the pool config will be roughly mapped with inference pools.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I'm just saying that the term "pool" is overloaded here even if there was a 1:1 mapping; so I'd advise being explicit

Comment thread pipeline/pool.go Outdated
Comment on lines +23 to +29
type PoolConfig struct {
ID string `json:"id"`
IGWBaseURL string `json:"igw_base_url"`
RequestPathURL string `json:"request_path_url"`
Workers int `json:"workers"`
HTTPHeaders map[string]string `json:"http_headers,omitempty"`
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% convinced about bundling routing fields to the worker config.

If it's a "worker pool", should the routing config stay on the queue side (where it was before) and the worker pool only define concurrency + grouping?

it would make sense if the worker was tied to a specific endpoint, but routing is handled at the message level (each message carries its own RequestURL and HttpHeaders)

Otherwise you could keep the queue config:

  {
    "queue_name": "critical-queue",
    "worker_pool_id": "fast-pool",
    "igw_base_url": "http://fast-model:8000",
    "request_path_url": "/v1/completions",
    "gate_type": "constant"
  }

and then have an optional worker_pools section:

  {
    "id": "fast-pool",
    "workers": 4
  }

the default fallback policy could be either 1 single pool for all queues (possibly suboptimal) or one pool per queue (maybe wasteful).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The routing config should be in the pool config. This is intentional because when multiple queues are specified and they are using the same endpoint they should specify the same pool because the queues are saturating the same endpoint.

the default fallback policy could be either 1 single pool for all queues (possibly suboptimal) or one pool per queue (maybe wasteful).

When specifying a single queue via command line then we fall back to creating a single pool. This is backwards compatible as that queue will get all workers.

If multiple queues are used we should force pools to be used otherwise the fallback logic becomes complicated and we can't make assumptions about what the user wants. The current logic will error if this is the case.

it would make sense if the worker was tied to a specific endpoint, but routing is handled at the message level (each message carries its own RequestURL and HttpHeaders)

This makes more sense than putting the endpoint in the queue itself. Because saturation (and per-pool gating in the future) happens independently at the endpoint level. Different endpoints should always be in different pools because saturating one endpoint does not impact a different endpoint.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pool's purpose is worker isolation and concurrency control (don't let one saturated group block another). I think that routing (where to send) is a separate concern.

The Worker function is a generic HTTP dispatcher - it sends to whatever msg.RequestURL it receives, regardless of pool. The routing default (base URL, path) logically belongs on the queue, where the request enters the system, with the request's own endpoint always taking precedence.

The argument that - two queues sharing a pool must share the same endpoint - looks to conflate worker grouping with endpoint identity. You could have two queues with different default endpoints sharing the same worker pool - the workers don't care, they just dispatch whatever URL the merge policy resolved.

Keeping the pool config to {id, workers} and routing on the queue side also avoids the need for a second config file, which addresses the concern about breaking batch-gateway PR #458.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workers should be roughly grouped by endpoint identity. If a backend endpoint hangs, it should only block workers in the same pool.

You could have two queues with different default endpoints sharing the same worker pool - the workers don't care, they just dispatch whatever URL the merge policy resolved.

Is there a use case for this that cannot be achieved via queue gates and (future) pool gates? Why not just create multiple pools and use gates to ensure some workers remain idle? Or implement global concurrency limit to be shared between all workers? IMO The global concurrency limit is a bad idea because it introduces back the problems pool topology is trying to fix.

In the future we could add support for overflow pools could be configured to pull extra capacity from queues if they have extra capacity.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtechapps Interesting points.

I think that in frequent deployments there would be multiple queues dedicated per model (e.g. each queue is associated with a different workload class), and each request will carry its own router URL and API path. The pool's or queue's base+path url defaults would be typically overridden by the request-level routing.

If we couple routing to the pool, it means either:

  • One pool per endpoint combination — which requires arbitrarily many pools, and forces reconfig and redeploy of async-processor whenever a backend adds or changes endpoints. In other words, this adds sensitivity in async-processor - to an aspect that it shouldn't be sensitive to.
  • Or the pool's routing fields are dead config that's always overridden.

The pool should be a generic worker isolation primitive — admins decide the granularity of pools (per-model, per-queue, per-workload-class, etc.). Bundling routing into the pool config forces a specific topology (one endpoint per isolation group), making a decision that should be the admin's decision.

I think that default routing belongs on the queue, which is typically associated with a specific model and workload class. The pool should be pure worker isolation.

// pool config
{"id": "priority-pool", "workers": 8}

// queue config
{"queue_name": "model-a-critical-workload", "pool_id": "priority-pool",
 "igw_base_url": "http://router:8080", "request_path_url": "/v1/chat/completions"}

This also avoids needing a second config file (which adds complexity). We can use a single config file — pool definitions ({id, workers}) can be an optional worker_pools section in the existing queue config. And since the pool is purely about worker isolation, naming it "worker pool" would avoid confusion with the K8s InferencePool CRD.

Comment thread pkg/redis/redisimpl.go Outdated
Comment thread pkg/async/random_robin_policy.go Outdated
Comment thread cmd/main.go Outdated
Comment thread pipeline/pool.go Outdated
Comment thread pkg/redis/redisimpl.go Outdated
Comment thread cmd/main.go Outdated
Comment thread pkg/pubsub/pubsubimpl_test.go Outdated
add a new pool object for grouping and isolating backends

each pool configures an inference backend and has its own workers.
when one pool is saturated it does not block processing other requests
on different pools.

if queue config file is specified, pool config must also be specified.
each queue in the config must now specify the pool id that should
handle requests from the queue.

if a queue config is not specified and only the command line parameters
are provided, a single "default" pool will be created.

multiple queues can specify the same pool id.
the merge policy has been changed to merge channels per pool.
round robin merge policy will perform round robin balancing for all the
queues sharing the same pool id.

TESTED=unit tests + e2e pub/sub queue with llm-d simulator

Signed-off-by: Jacob Murry <jacobmurry@google.com>
@github-actions

Copy link
Copy Markdown

Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

add metric field to existing metrics for grouping by pool name.

Signed-off-by: Jacob Murry <jacobmurry@google.com>
worker pools only defines concurrency and grouping. Move endpoint and
request fields back to queues.

make pool config optional. if it's not specified create a "default"
worker pool with the global concurrency level. This makes configurations
backwards compatible.

Signed-off-by: Jacob Murry <jacobmurry@google.com>
Signed-off-by: Jacob Murry <jacobmurry@google.com>
update helm chart to support:
1. pub/sub topic queues config for specifying multiple queues
2. worker pools config for specifying worker pools

Signed-off-by: Jacob Murry <jacobmurry@google.com>
Signed-off-by: Jacob Murry <jacobmurry@google.com>
@jtechapps

Copy link
Copy Markdown
Contributor Author

I kept the worker pool config and queue configs as separate files for now to preserve backwards compatibility because topics/queues are stored as top level JSON array. The worker pool config would also have to be loaded in each queue implementation.

@jtechapps

Copy link
Copy Markdown
Contributor Author

This also fixes #250

Fixes: #250

instead of silently overriding the pool id in queue config when worker
pool config is omitted.

Signed-off-by: Jacob Murry <jacobmurry@google.com>
@shimib shimib merged commit 6b8783c into llm-d-incubation:main Jun 16, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants