Skip to content

Added SharedReplaceableClient and made worker client replaceable between retries#1000

Merged
maciejdudko merged 3 commits intotemporalio:masterfrom
maciejdudko:replaceable-client
Sep 18, 2025
Merged

Added SharedReplaceableClient and made worker client replaceable between retries#1000
maciejdudko merged 3 commits intotemporalio:masterfrom
maciejdudko:replaceable-client

Conversation

@maciejdudko
Copy link
Copy Markdown
Contributor

@maciejdudko maciejdudko commented Sep 4, 2025

What was changed

  • Added SharedReplaceableClient type, a client wrapper that allows replacing the underlying client post-creation. When wrapped inside RetryClient, it allows replacing the client between retries of the same RPC call.
  • Changed WorkerClientBag to use SharedReplaceableClient so that replacing the worker client works as expected for polling.
  • Extracted dev server options used by integration test runner into a function in test-utils to make it easier to spawn additional server instances inside tests.

Why?

Fixes an issue where workers would get stuck in retry loop when polling and replacing the client would have no effect.

Checklist

  1. Closes [Bug] Replacing a client on a worker that is failing to poll doesn't start using the new client #976

  2. How was this tested:

  • Unit tests in client/src/replaceable.rs
  • Integration test replace_client_works_after_polling_failure in tests/integ_tests/polling_tests.rs

Comment thread client/src/replaceable.rs
Comment on lines +24 to +34
struct SharedClientData<C>
where
C: Clone + Send + Sync,
{
client: RwLock<C>,
generation: AtomicU32,
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems more or less the same as https://docs.rs/arc-swap/latest/arc_swap/

Any reason not to just use that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need both (A) mutable access to the obtained value and (B) avoid unnecessary clones if the value hasn't changed, and as far as I can tell, there's no easy way to do that with ArcSwap.

Copy link
Copy Markdown
Member

@Sushisource Sushisource Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't actually need mutable access though, no? The only place the write portion of the lock is used is when you're replacing the value, which is the swap part of arcswap. Either way, in fetch(), the entire client is cloned.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the mutable access to the local clone. In arc_swap, fetch-if-updated is implemented through arc_swap::cache::Cache type, and that type only provides immutable references to the inner value, which would prevent us from calling client methods.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how Cache plays into it - the main things you'd use are either https://docs.rs/arc-swap/latest/arc_swap/type.ArcSwap.html#method.load-1 or https://docs.rs/arc-swap/latest/arc_swap/type.ArcSwap.html#method.load_full

In this case, since we literally always just immediately clone the client after taking it out of the lock, you can just use load_full and it directly gives you a cloned, ownable client. Seems like that'd work fine?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_full doesn't allow checking if the value changed before making a clone, and we'd have to call it before every client call. Considering that "client was not replaced" is the >99.9% case, I think it's worth optimizing for. The PR implementation does 1 atomic load and no copying on the hot path.

Copy link
Copy Markdown
Member

@Sushisource Sushisource Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaaah, ok sorry I see what you're getting at now, with the whole pathway that goes through refresh_inner (which, side note, might rename that to get_mut_refreshing or something that indicates the action is getting with a side effect of maybe refreshing).

Yeah, that makes sense. And now I see why you were talking about Cache since it has that functionality, but doesn't do mut refs.

Overall then I think this makes sense and works for me. I would say it might be good to have a unit test that tries some loading / replacing with multiple threads going at once to ensure there's not anything unexpected happening when stressed that way

@maciejdudko maciejdudko marked this pull request as ready for review September 10, 2025 22:19
@maciejdudko maciejdudko requested a review from a team as a code owner September 10, 2025 22:19
@maciejdudko maciejdudko force-pushed the replaceable-client branch 3 times, most recently from db3cd8d to f2f2a36 Compare September 11, 2025 16:55
Copy link
Copy Markdown
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks

@maciejdudko maciejdudko merged commit 8bd6f59 into temporalio:master Sep 18, 2025
19 checks passed
@maciejdudko maciejdudko deleted the replaceable-client branch September 18, 2025 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Replacing a client on a worker that is failing to poll doesn't start using the new client

2 participants