-
Notifications
You must be signed in to change notification settings - Fork 133
session: Prepare on one shard per node only #1320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
session: Prepare on one shard per node only #1320
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good. I just wonder - maybe a better idea than spawning a separate task is to just introduce configurable timeout for each attempt (see my other comment).
b318d22
to
c747a25
Compare
v1.0.1: Addressed @muzarski's nits. |
c747a25
to
5a5a8db
Compare
v.1.0.2: rebased on main. |
5a5a8db
to
bd62bba
Compare
v1.1:
|
bd62bba
to
87aee23
Compare
v1.1: rebased on main. Incorporated changes from #1329. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change to prepare on single shard per node is fine.
I'm not sure about the second change (waiting only for the first response).
I do agree that indefinite waiting (if really possible) is a problem, but the approach taken does not convince me.
- This is a totally separate change from the one we discussed and planned, so it should be a separate PR. I would like to not block the change we planned, so please split those changes into 2 PRs. I undestand we do have a habit of making small side improvements in PRs, but this is not it, it is another feature comparable in size with the original one that caused this PR.
- Rationale about cpp-driver doing this too is not enough. cpp-driver can't really be called a good driver, and it has many, many questionable design decisions. What do other drivers do? If Python Java and Gocql also do this then it is a much stronger arguments.
- You did not fix the indefinite waiting problem (btw is it really indefinite? Won't the connection get broken?). The way to fix it is to utilize client-side timeouts (probably using timeout from global execution profile). If we do this then one of the 2 stated benefits of this change ceases to exist.
- Previously indefinite timeout would cause the Future returned from
Prepare
to never resolve.- It was easily noticeable by user. Now the user will not know about such issue.
- Such Future could be dropped by user, and the only remnant of this would be some entry in
ResponseHandlerMap
. Now for each such case we will get stale Tokio task running indefinitely in background.
- The other stated benfit is the latency of preparation. This does not convince me. I don't remember latency of preparation ever being brought up as an issue by anyone, so why are working on it?
- This is probably a rare case, but please note that the following is possible after your change:
- User prepared a statement
prepare()
returns after receiving one response- User executes this prepared statement
- It lands on a different node than the one that responded first
- The node did not yet process prepare request, so returns UNPREPARED
- We now get unnecessary round trip and network calls.
The complex logic of the preparation_worker is not a problem to me. It's encapsulated, it's (I hope you agree) well-documented, and it uses tokio abstractions to be more high-level.
No matter how well implemented and encapsulated it is, it is still additional code, additional complexity. It increases the amount of stuff that contributors need to read and understand.
We should always have a good reason to introduce additional code, in order to minimize future maintenance burden.
87aee23
to
40c507f
Compare
v2.0:
|
Opened #1332. |
40c507f
to
0236ffe
Compare
v2.0.1: fixed outdated comments. |
0236ffe
to
0021871
Compare
v2.1:
|
0021871
to
b209464
Compare
`Session::prepare()` is split into thin generic `prepare()` function and nongeneric `prepare_nongeneric()` function that contains all logic. This is to prevent monomorphisation of the whole logic just because one passes several types that implement `Into<Statement>` to `Session::prepare()`.
There's no need for `Session::prepare_nongeneric()` to accept owned `Statement`. It's thus made accept `&Statement`, which elides clone in at least one call site (`Session::prepare_batch()`).
`Session::extract_partitioner_name()` was needlessly a method. It does not require `Self` to perform its work. Therefore, I made it an associated function.
`ClusterState::iter_working_connections` is added `_to_shards` suffix to stress that it returns an iterator over connections to **all shards**. This is different from the semantics of the function that is going to be implemented in subsequent commits, thus the name change.
…lity The convoluted iterator logic in that method caused mental trouble many times. Now it's better split into parts and documented with comments. I hope it helps future readers and reviewers.
This is analogous to iter_working_connections_to_shards(), but it returns only one (random) connection for each node, not for each shard.
Scylla prepares the statement on every shard when processing PREPARE request: https://github.com/scylladb/scylladb/blob/8f0d0daf53397aa68312571ab9d01d8b75cd1770/transport/server.cc#L1104-L1114. At the same time, driver unnecessarily tries to prepare the statement on all connections, which means connections to every shard, so possibly multiple connections to every node. This commit makes the driver prepare the statement only on a single (random) connection to every node. There is one catch: with the new logic, we might sometimes fail preparation even though we could succeed if tried on different shards (some shards might be, for example, overloaded). This issue is solved in further commits.
The docstring now mentions that the statement is prepared on all nodes.
This prepares for the fallback logic introduced in the next commit.
Before this commit, we could unnecessarily return an error from prepare: Let's say we have a 1-node cluster, and we have a broken connection, and we don't retry on another - we would retry error to the user despite possibly being able to prepare. This commit introduces fallback logic to `Session::prepare()`: if preparation on a single (random) connection to every node fails, the whole preparation is retried, this time on a single connection to every shard.
v2.1.1: adressed @Lorak-mmk comments. |
b209464
to
00d31ab
Compare
Rebased on main. |
Overview
This PR comes with two main enhancements / optimisations:
Wait only for one prepare attempt to succeed, not for all prepare attempts. This is inspired by cpp-driver's logic.(dropped)Below, I describe both changes in detail.
Prepare on every node, not on every shard
Just as other drivers do, now the Rust driver also prepares statements on a single connection to every node, not on a single connection to every shard. This brings performance benefits for both driver and cluster, as it lowers the overhead of handling duplicated prepare requests on both sides.
Problem: unlucky random choice
As mentioned in #1290 (comment), we might be unlucky enough to end up with all randomly chosen connections being non-working. For example, the targeted shards might be overloaded.
Solution
Fallback logic is introduced. Once we fail to prepare the statement on any node though the randomly chosen connections, the preparation is re-attempted on all connections (which is essentially how it used to work before this PR).
Wait only for one prepare attempt to succeed(dropped)Having taken a look at cpp-driver's handling of statement preparation, I noticed that the cpp-driver waits only for the first preparation to succeed. Preparation on all remaining nodes is done in the background, which decreases latency of the prepare operation from the driver user's PoV. To understand why, consider a node that is stuck/overloaded/temporarily unavailable. If we wait for prepare responses from all nodes, we are limited by the slowest response. This is not what ScyllaDB is designed for - it should be available and fast even if some nodes happen to be unavailable or overloaded.I decided to implement this behaviour in the Rust driver. This is achieved by spawning a tokio worker task which prepares the statement on all nodes. It feeds all results, successes or failures, into a channel and this way signals the parent task. The parent task finishes early once it receives a successful response (a deserializedPreparedStatement
). Meanwhile, the worker task keeps handling the remaining preparations in the background.The change brings two main benefits:1. reduces driver's latency upon statement preparation;2. prevents the situation when one stuck node freezes the driver uponpreparation - there are no preparation timeouts! (sic!).
Tests
TODO. Waiting for #1246 to be merged. Then I'll add a new feature to the proxy to be aware of which shard was targeted by the intercepted frame. With that feature available, I'll write tests.
Fixes: #1290
Pre-review checklist
[ ] I have provided docstrings for the public items that I want to introduce.[ ] I have adjusted the documentation in./docs/source/
.Fixes:
annotations to PR description.