Skip to content

session: Prepare on one shard per node only #1320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 7, 2025

Conversation

wprzytula
Copy link
Collaborator

@wprzytula wprzytula commented Apr 12, 2025

Overview

This PR comes with two main enhancements / optimisations:

  1. Prepare on every node, not on every shard. This corresponds to Don't prepare the statement on all connections #1290.
  2. Wait only for one prepare attempt to succeed, not for all prepare attempts. This is inspired by cpp-driver's logic. (dropped)

Below, I describe both changes in detail.

Prepare on every node, not on every shard

Just as other drivers do, now the Rust driver also prepares statements on a single connection to every node, not on a single connection to every shard. This brings performance benefits for both driver and cluster, as it lowers the overhead of handling duplicated prepare requests on both sides.

Problem: unlucky random choice

As mentioned in #1290 (comment), we might be unlucky enough to end up with all randomly chosen connections being non-working. For example, the targeted shards might be overloaded.

Solution

Fallback logic is introduced. Once we fail to prepare the statement on any node though the randomly chosen connections, the preparation is re-attempted on all connections (which is essentially how it used to work before this PR).

Wait only for one prepare attempt to succeed (dropped)

Having taken a look at cpp-driver's handling of statement preparation, I noticed that the cpp-driver waits only for the first preparation to succeed. Preparation on all remaining nodes is done in the background, which decreases latency of the prepare operation from the driver user's PoV. To understand why, consider a node that is stuck/overloaded/temporarily unavailable. If we wait for prepare responses from all nodes, we are limited by the slowest response. This is not what ScyllaDB is designed for - it should be available and fast even if some nodes happen to be unavailable or overloaded.

I decided to implement this behaviour in the Rust driver. This is achieved by spawning a tokio worker task which prepares the statement on all nodes. It feeds all results, successes or failures, into a channel and this way signals the parent task. The parent task finishes early once it receives a successful response (a deserialized PreparedStatement). Meanwhile, the worker task keeps handling the remaining preparations in the background.

The change brings two main benefits:
1. reduces driver's latency upon statement preparation;
2. prevents the situation when one stuck node freezes the driver upon
preparation - there are no preparation timeouts! (sic!).

Tests

TODO. Waiting for #1246 to be merged. Then I'll add a new feature to the proxy to be aware of which shard was targeted by the intercepted frame. With that feature available, I'll write tests.

Fixes: #1290

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • [ ] I have provided docstrings for the public items that I want to introduce.
  • [ ] I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@wprzytula wprzytula added the performance Improves performance of existing features label Apr 12, 2025
@wprzytula wprzytula added this to the 1.2.0 milestone Apr 12, 2025
@wprzytula wprzytula requested review from Lorak-mmk and muzarski April 12, 2025 15:01
@wprzytula wprzytula self-assigned this Apr 12, 2025
Copy link

github-actions bot commented Apr 12, 2025

cargo semver-checks found no API-breaking changes in this PR.
Checked commit: 00d31ab

Copy link
Contributor

@muzarski muzarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. I just wonder - maybe a better idea than spawning a separate task is to just introduce configurable timeout for each attempt (see my other comment).

@wprzytula wprzytula force-pushed the prepare-on-one-shard-per-node-only branch from b318d22 to c747a25 Compare April 16, 2025 08:48
@wprzytula
Copy link
Collaborator Author

wprzytula commented Apr 16, 2025

v1.0.1: Addressed @muzarski's nits.

@wprzytula wprzytula force-pushed the prepare-on-one-shard-per-node-only branch from c747a25 to 5a5a8db Compare April 16, 2025 08:49
@wprzytula
Copy link
Collaborator Author

v.1.0.2: rebased on main.

@wprzytula wprzytula force-pushed the prepare-on-one-shard-per-node-only branch from 5a5a8db to bd62bba Compare April 19, 2025 14:30
@wprzytula
Copy link
Collaborator Author

v1.1:

  • replaced internal calls to Session::prepare() with Session::prepare_nongeneric();
  • added verification of prepared id equality in the worker task.

@wprzytula wprzytula requested a review from muzarski April 19, 2025 14:32
@wprzytula wprzytula force-pushed the prepare-on-one-shard-per-node-only branch from bd62bba to 87aee23 Compare April 22, 2025 08:51
@wprzytula
Copy link
Collaborator Author

v1.1: rebased on main. Incorporated changes from #1329.

Copy link
Collaborator

@Lorak-mmk Lorak-mmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to prepare on single shard per node is fine.

I'm not sure about the second change (waiting only for the first response).
I do agree that indefinite waiting (if really possible) is a problem, but the approach taken does not convince me.

  1. This is a totally separate change from the one we discussed and planned, so it should be a separate PR. I would like to not block the change we planned, so please split those changes into 2 PRs. I undestand we do have a habit of making small side improvements in PRs, but this is not it, it is another feature comparable in size with the original one that caused this PR.
  2. Rationale about cpp-driver doing this too is not enough. cpp-driver can't really be called a good driver, and it has many, many questionable design decisions. What do other drivers do? If Python Java and Gocql also do this then it is a much stronger arguments.
  3. You did not fix the indefinite waiting problem (btw is it really indefinite? Won't the connection get broken?). The way to fix it is to utilize client-side timeouts (probably using timeout from global execution profile). If we do this then one of the 2 stated benefits of this change ceases to exist.
  4. Previously indefinite timeout would cause the Future returned from Prepare to never resolve.
    • It was easily noticeable by user. Now the user will not know about such issue.
    • Such Future could be dropped by user, and the only remnant of this would be some entry in ResponseHandlerMap. Now for each such case we will get stale Tokio task running indefinitely in background.
  5. The other stated benfit is the latency of preparation. This does not convince me. I don't remember latency of preparation ever being brought up as an issue by anyone, so why are working on it?
  6. This is probably a rare case, but please note that the following is possible after your change:
    • User prepared a statement
    • prepare() returns after receiving one response
    • User executes this prepared statement
    • It lands on a different node than the one that responded first
    • The node did not yet process prepare request, so returns UNPREPARED
    • We now get unnecessary round trip and network calls.

The complex logic of the preparation_worker is not a problem to me. It's encapsulated, it's (I hope you agree) well-documented, and it uses tokio abstractions to be more high-level.

No matter how well implemented and encapsulated it is, it is still additional code, additional complexity. It increases the amount of stuff that contributors need to read and understand.
We should always have a good reason to introduce additional code, in order to minimize future maintenance burden.

@wprzytula wprzytula force-pushed the prepare-on-one-shard-per-node-only branch from 87aee23 to 40c507f Compare April 24, 2025 12:35
@wprzytula
Copy link
Collaborator Author

v2.0:

  • dropped the "wait for only first successful response" part of this PR.

@wprzytula
Copy link
Collaborator Author

Opened #1332.

@wprzytula wprzytula force-pushed the prepare-on-one-shard-per-node-only branch from 40c507f to 0236ffe Compare April 25, 2025 06:45
@wprzytula
Copy link
Collaborator Author

v2.0.1: fixed outdated comments.

@wprzytula wprzytula requested review from Lorak-mmk and muzarski April 25, 2025 07:19
@wprzytula wprzytula force-pushed the prepare-on-one-shard-per-node-only branch from 0236ffe to 0021871 Compare May 7, 2025 07:03
@wprzytula
Copy link
Collaborator Author

v2.1:

  • squashed the &mut dyn Iterator commit into the fallback commit,
  • extracted more commits to make review smoother,
  • improved readability of the iter_working_connections*() methods,
  • brought back the, accidentally removed, logic that collects prepared tracing IDs from prepared statements on all connections.

@wprzytula wprzytula requested a review from Lorak-mmk May 7, 2025 07:19
@wprzytula wprzytula force-pushed the prepare-on-one-shard-per-node-only branch from 0021871 to b209464 Compare May 7, 2025 08:38
wprzytula added 12 commits May 7, 2025 10:38
`Session::prepare()` is split into thin generic `prepare()` function
and nongeneric `prepare_nongeneric()` function that contains all logic.
This is to prevent monomorphisation of the whole logic just because
one passes several types that implement `Into<Statement>` to
`Session::prepare()`.
There's no need for `Session::prepare_nongeneric()` to accept owned
`Statement`. It's thus made accept `&Statement`, which elides clone
in at least one call site (`Session::prepare_batch()`).
`Session::extract_partitioner_name()` was needlessly a method. It does
not require `Self` to perform its work.
Therefore, I made it an associated function.
`ClusterState::iter_working_connections` is added `_to_shards` suffix
to stress that it returns an iterator over connections to
**all shards**. This is different from the semantics of the function
that is going to be implemented in subsequent commits, thus the name
change.
…lity

The convoluted iterator logic in that method caused mental trouble many
times. Now it's better split into parts and documented with comments.
I hope it helps future readers and reviewers.
This is analogous to iter_working_connections_to_shards(), but it
returns only one (random) connection for each node, not for each shard.
Scylla prepares the statement on every shard when processing PREPARE
request:
https://github.com/scylladb/scylladb/blob/8f0d0daf53397aa68312571ab9d01d8b75cd1770/transport/server.cc#L1104-L1114.

At the same time, driver unnecessarily tries to prepare the statement on
all connections, which means connections to every shard, so possibly
multiple connections to every node.

This commit makes the driver prepare the statement only on a single
(random) connection to every node.

There is one catch: with the new logic, we might sometimes fail
preparation even though we could succeed if tried on different shards
(some shards might be, for example, overloaded). This issue is solved
in further commits.
The docstring now mentions that the statement is prepared on all nodes.
This prepares for the fallback logic introduced in the next commit.
Before this commit, we could unnecessarily return an error from prepare:
Let's say we have a 1-node cluster, and we have a broken connection, and
we don't retry on another - we would retry error to the user despite
possibly being able to prepare.

This commit introduces fallback logic to `Session::prepare()`:
if preparation on a single (random) connection to every node fails,
the whole preparation is retried, this time on a single connection to
every shard.
@wprzytula
Copy link
Collaborator Author

v2.1.1: adressed @Lorak-mmk comments.

@wprzytula wprzytula force-pushed the prepare-on-one-shard-per-node-only branch from b209464 to 00d31ab Compare May 7, 2025 08:39
@wprzytula
Copy link
Collaborator Author

Rebased on main.

@wprzytula wprzytula requested review from Lorak-mmk and muzarski May 7, 2025 08:39
@wprzytula wprzytula merged commit 506027c into scylladb:main May 7, 2025
12 checks passed
@wprzytula wprzytula deleted the prepare-on-one-shard-per-node-only branch May 7, 2025 08:45
@wprzytula wprzytula mentioned this pull request May 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improves performance of existing features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't prepare the statement on all connections
3 participants