Poller scaling by Sushisource · Pull Request #874 · temporalio/sdk-rust

Sushisource · 2025-02-08T01:51:22Z

I intend to add a few unit tests too, and some simpler integ tests once server is actually released with this. This PR will have to sit and wait for server to release with the changes temporalio/temporal#7300

What was changed

Read and handle poller scaling decisions from server

Why?

Part of the worker management effort to simplify configuration of workers for users.

Checklist

Closes
How was this tested:
Big manual tests + integ tests (to come)
Any docs updates needed?
Doc updates will come as part of adding to all SDKs

cretz

LGTM but didn't get into the details. I think an English description of what's happening could help. I tried to guess at one in on the comments. But since the default behavior is unchanged for users, no problem.

I do think we should consider holding off (or at least holding off lang side) until a server is released that can even take advantage of this. We will want some end to end automated test somewhere that just proves poller auto scaling works (often we can just do this in lang in a smoke-test kinda integration test just to confirm).

cretz · 2025-02-27T14:51:56Z

+            Some(tokio::task::spawn(async move {
+                let mut interval = tokio::time::interval(Duration::from_millis(100));
+                loop {
+                    tokio::select! {
+                        _ = interval.tick() => {}
+                        _ = shutdown.cancelled() => { break; }
+                    }
+                    let ingested = rhc.ingested_this_period.swap(0, Ordering::Relaxed);
+                    let ingested_last = rhc.ingested_last_period.swap(ingested, Ordering::Relaxed);
+                    rhc.scale_up_allowed
+                        .store(ingested_last >= ingested, Ordering::Relaxed);
+                }
+            }))


So to confirm, the algorithm is:

Server returns amount of pollers to scale up (or down if negative) on each poll response, and SDK respects that decision so long as it's bounded by min/max, and in the case of scale up, there were at least as many polls accepted last 100ms period as this 100ms?

So if I accepted a poll 50ms ago that told me to scale up but my last poll response was 250ms ago, I would not scale up? (because ingested_last is 0 and ingested is 1)

I'm glad you looked at this again because it sort of obviously makes no sense. I was wondering why my testing wasn't showing good restriction of overshooting like I know it did at one point, and I had tried like a bazillion different methods but I knew this worked and it was simple, so I went back to it -- but results were inconsistent.

Somehow this comparison got flipped at some point. I've changed it back and the results more consistent again.

Honestly though I'm tempted to get rid of this entirely. My tests show it can still be "defeated" semi often by rapid scale ups, and ingestion is indeed going up so there's not really any reason to say no, and then you end up with a bunch of polls sitting there once the backlog clears.

Smoothing out the calculation so that the shorter timeout gets set more reliably might have more value

Agreed. I am always a bit wary when I see hardcoded time expectations like 100ms unless that's a good number arrived by testing or something (which maybe it is). Maybe a more naive "do not make poller count changes more frequently than X interval" can help, but I don't understand the details and haven't run the tests.

Sushisource · 2025-02-27T17:56:36Z

LGTM but didn't get into the details. I think an English description of what's happening could help. I tried to guess at one in on the comments. But since the default behavior is unchanged for users, no problem.

I do think we should consider holding off (or at least holding off lang side) until a server is released that can even take advantage of this. We will want some end to end automated test somewhere that just proves poller auto scaling works (often we can just do this in lang in a smoke-test kinda integration test just to confirm).

Yeah, I'm happy to wait for a server release so that we have an integ test.

I'll add a high-level prose description of the approach somewhere too.

Add sustained load test

lower short-timeout threshold

Sushisource requested a review from a team as a code owner February 8, 2025 01:51

Sushisource mentioned this pull request Feb 8, 2025

Poller Scaling Decisions temporalio/temporal#7300

Merged

Sushisource force-pushed the poller-scaling branch 3 times, most recently from 7b4e70c to 14a75ad Compare February 13, 2025 00:50

Sushisource force-pushed the poller-scaling branch 2 times, most recently from d70715f to 7be4134 Compare February 27, 2025 02:50

cretz approved these changes Feb 27, 2025

View reviewed changes

Sushisource force-pushed the poller-scaling branch from c854037 to 1fc55b0 Compare March 19, 2025 21:38

Sushisource added 12 commits April 2, 2025 09:39

Add scaling decision proto & restructure poller to prepare

faed919

Poller scaling implementation

2a50026

Move tests to a manual tests file

2f77101

Add sustained load test

Proto updates again

870c204

Prevent overshooting / many idle pollers

4acc7d7

Rebase fixes

b395476

WFT poller refactor / prioritize sticky

385986b

Fix ingestor task not finishing

52e2462

Fix wrong inequality direction for scale up prevention

8774ac9

lower short-timeout threshold

More doc comments on algorithm

fa68fe8

Fix merge problem

c233c24

Add poller scaling test to heavy tests

96db34d

Sushisource force-pushed the poller-scaling branch from 1fc55b0 to 96db34d Compare April 2, 2025 16:57

Sushisource enabled auto-merge (squash) April 2, 2025 17:06

Sushisource merged commit 93471ac into master Apr 2, 2025
16 checks passed

Sushisource deleted the poller-scaling branch April 2, 2025 17:06

Sushisource mentioned this pull request Apr 16, 2025

Worker autotuning - Pollers #662

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poller scaling#874

Poller scaling#874
Sushisource merged 12 commits intomasterfrom
poller-scaling

Sushisource commented Feb 8, 2025

Uh oh!

cretz left a comment

Uh oh!

cretz Feb 27, 2025 •

edited

Loading

Uh oh!

Sushisource Feb 27, 2025 •

edited

Loading

Uh oh!

Sushisource Feb 27, 2025

Uh oh!

cretz Feb 27, 2025

Uh oh!

Sushisource commented Feb 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Sushisource commented Feb 8, 2025

What was changed

Why?

Checklist

Uh oh!

cretz left a comment

Choose a reason for hiding this comment

Uh oh!

cretz Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sushisource Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sushisource Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

cretz Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

Sushisource commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cretz Feb 27, 2025 •

edited

Loading

Sushisource Feb 27, 2025 •

edited

Loading

Sushisource commented Feb 27, 2025 •

edited

Loading