chore(engine)!: Share worker threads across all scheduler connections #20229

rfratto · 2025-12-12T22:18:12Z

Note

I split this up across multiple commits, it's probably a bit easier to review commit-by-commit.

Previously, worker threads would be assigned to a random connected scheduler, with that assignment sticking until:

That thread completes a task from the assigned scheduler, or
the scheduler disconnects.

This is fine if there's only one scheduler, but it causes issues when there are multiple schedulers:

Idle schedulers may "hog" a thread that could be used by a busy scheduler.
While the random selection is even over time, any given instant in time does not have even distribution of threads. In the worse case some schedulers may have zero threads assigned to them.

This is a particular problem as the total compute capacity per scheduler decreases as the number of schedulers go up, even if queries are distributed amongst the schedulers.

This PR fixes this issue by allowing any scheduler to give a task to any ready worker thread. This permits scaling schedulers independently of the number of workers without risking saturation.

Details

A new message type, WorkerSubscribe, is introduced. This message is sent by a scheduler to a worker, asking the worker to send a WorkerReady when there is at least one ready thread.

WorkerSubscribe is sent by the scheduler when a worker connects, and after a worker sends an HTTP 429 after rejecting a task assignment.

A new mechanism, jobManager, is used by the worker to bridge the connection to schedulers and running worker threads. It is implemented as a cancellable condition variable.

BREAKING CHANGE: Workers now expect schedulers to send WorkerSubscribe before any WorkerReady message is sent.

This adds a new message, WorkerSubscribe, used by the scheduler to explicitly request when a worker has at least one worker thread available. Workers that receive a WorkerSubscribe message *must* send a WorkerReady to the subscribed scheduler the next time a worker thread is available, or if a worker thread is already available. Sending a WorkerReady message in response to a WorkerSubscribe should clear the subscription to reduce message noise. IThis commit only updates the scheduler to send the message, but workers are not yet updated to act on it. ntroducing this message is backwards-compatible, as schedulers can still receive WorkerReady messages without a subscription.

Previously, worker threads would be assigned to a random connected scheduler, with that assignment sticking until: * That thread completes a task from the assigned scheduler, or * the scheduler disconnects. This is fine if there's only one scheduler, but it causes issues when there are multiple schedulers: * Idle schedulers may "hog" a thread that could be used by a busy scheduler. * While the random selection is even over time, any given instant in time does not have even distribution of threads. In the worse case some schedulers may have zero threads assigned to them. This is a particular problem as the total compute capcity per scheduler decreases as the number of schedulers go up, even if queries are distributed amongst the schedulers. This commit fixes this issue by allowing any scheduler to give a task to any ready worker thread. This permits scaling schedulers independently of the number of workers without risking saturation. Details ------- A new message type, WorkerSubscribe, is introduced. This message is sent by a scheduler to a worker, asking the worker to send a WorkerReady when there is at least one ready thread. WorkerSubscribe is sent by the scheduler when a worker connects, and after a worker sends an HTTP 429 after rejecting a task assignment. A new mechanism, `jobManager`, is used by the worker to bridge the connection to schedulers and running worker threads. It is implemented as a cancellable condition variable. BREAKING CHANGE: Workers now expect schedulers to send WorkerSubscribe before any WorkerReady message is sent. Signed-off-by: Robert Fratto <[email protected]>

Adds basic worker metrics to track state: * `loki_engine_worker_tasks_assigned_total` reports total number of successfully assigned tasks. * `loki_engine_worker_task_exec_seconds` reports a histogram of task execution time, permitting to drill down to task time at the worker level to detect symptoms of CPU saturation. * `loki_engine_worker_threads` reports the number of worker threads, by state (idle, ready, busy). `loki_engine_worker_threads` can be used in combination with the scheduler load metric to compute scheduler saturation on the fly. Signed-off-by: Robert Fratto <[email protected]>

trevorwhitney

this looks pretty good to me. I have one question about naming, but other than that (and especially since we've already tested it) I'm happy to ✅ , but I'm commenting for now to give others who have been more involved with scheduling to get a chance to review.

trevorwhitney · 2025-12-12T23:28:19Z

pkg/engine/internal/proto/wirepb/wirepb.proto

+// from workers once they have at least one worker thread available.
+//
+// The subscription is cleared once the next WorkerReadyMessage is sent.
+message WorkerSubscribeMessage {}


do we want to capture the flow of messages in their name? for example, WorkerReady and WorkerHello are from workers, but WorkerSubscribe is now to the workers. Do we want the name to capture this is from the Scheduler asking the working for a subscription, and not from the Worker asking to subscribe? I'm thinking something like SchedulerReadyForSubscription or SchedulerPing?

trevorwhitney · 2025-12-12T23:29:59Z

pkg/engine/internal/scheduler/scheduler.go

+	}
+
+	// Request to be notified when the worker is ready.
+	s.workerSubscribe(ctx, worker)


in which case this would becaome s.readyForSubscription?

rfratto requested a review from a team as a code owner December 12, 2025 22:18

pull-request-size bot added the size/XL label Dec 12, 2025

rfratto added 2 commits December 12, 2025 17:35

rfratto force-pushed the scheduler-improve-worker-thread-distribution branch from d432232 to 31ee43c Compare December 12, 2025 22:35

trevorwhitney reviewed Dec 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(engine)!: Share worker threads across all scheduler connections #20229

chore(engine)!: Share worker threads across all scheduler connections #20229

Uh oh!

rfratto commented Dec 12, 2025

Uh oh!

trevorwhitney left a comment

Uh oh!

trevorwhitney Dec 12, 2025

Uh oh!

trevorwhitney Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chore(engine)!: Share worker threads across all scheduler connections #20229

Are you sure you want to change the base?

chore(engine)!: Share worker threads across all scheduler connections #20229

Uh oh!

Conversation

rfratto commented Dec 12, 2025

Details

Uh oh!

trevorwhitney left a comment

Choose a reason for hiding this comment

Uh oh!

trevorwhitney Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

trevorwhitney Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants