Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce intra cluster conflicts #5371

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rnewson
Copy link
Member

@rnewson rnewson commented Dec 29, 2024

Overview

CouchDB issues all write requests in parallel without coordination, applying a quorum on the results of those independent actions. When updating a document concurrently this can lead to the introduction of a stored conflict if two different writes reach separate nodes first. This is undesirable.

This patch changes fabric_doc_update in the following ways;

  1. Workers are no longer started immediately, but are given a unique reference each.
  2. For each range in the write request, one node is chosen to "lead" the write decision (calculated as the lowest live node that hosts the shard range)
  3. "Leader" workers are started.
  4. Any doc update that receives "conflict" from a Leader is added to the reply dict W times and the doc updates are removed from the other (unstarted) workers. If that leaves the worker with nothing to do, it is removed entirely.

Testing recommendations

There is some existing coverage in the module itself but more testing is needed before this can be merged.

Related Issues or Pull Requests

Checklist

  • Code is written and works correctly
  • Changes are covered by tests
  • Any new configurable parameters are documented in rel/overlay/etc/default.ini
  • Documentation changes were made in the src/docs folder
  • Documentation changes were backported (separated PR) to affected branches

@rnewson rnewson force-pushed the reduce-intra-cluster-conflicts branch 2 times, most recently from b727716 to 04000f7 Compare December 29, 2024 20:56
@rnewson
Copy link
Member Author

rnewson commented Jan 1, 2025

noting need for a liveness check at the start (so the 'leader' is always a live node as of the invocation) and also to consider maintenance mode (receipt of that message from a 'leader' specifically)

@rnewson rnewson force-pushed the reduce-intra-cluster-conflicts branch 3 times, most recently from 577dc25 to e0807f5 Compare January 6, 2025 11:18
Comment on lines 384 to 397
append_update_replies(
Rest1, Rest2, W, dict:append_list(Doc, lists:duplicate(W, conflict), Dict0)
);
append_update_replies([Doc | Rest1], [Reply | Rest2], W, Dict0) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we hit a conflict on the 2nd reply of a W=3 request, would we still want to do a lists:duplicate(W, conflict) set of a fake replies?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm good question. the 2nd reply is only happening if we started the followers, and if we do that we're doing them all in parallel, so whether we send conflict based on just the 2nd reply or conflict|accepted if we got 201/409 from both, there's still the same chance of a stored conflict, it's only the status code of the response that would be wrong (we'd send 409 instead of 202, but that can happen anyway).

I had considered doing this in tiers. that is, if we hear that the 'leader' shard is down (rexi DOWN or the rexi EXIT for maintenance mode) we could recalculate and deem one of the N-1 remaining shards as the leader, and do the same logic. I didn't do it because I didn't like the latency implications, but it leads to the situation above.

I agree it would be clearer if the fake reply thing only triggers if the reply is from a leader. It won't make things much better but it'll be easier to understand later.

Copy link
Contributor

@nickva nickva Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it seems unusual that on a 2nd conflict reply we'd end up with with 4 total replies on an N=3 db -- the original 1 + (W=3) conflict ones. Just worried that has a chance to mess up some our logic since we're, in general, careful around getting just the right number of replies and decide what to based on that.

Rest1, Rest2, W, dict:append_list(Doc, lists:duplicate(W, conflict), Dict0)
);
append_update_replies([Doc | Rest1], [Reply | Rest2], W, Dict0) ->
append_update_replies(Rest1, Rest2, W, dict:append(Doc, Reply, Dict0)).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny style nit: DocRest vs ReplyRest might be more clear, and we're already using them in remove_conflicts function, so it would be more consistent, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change append_update_replies, so Rest1 / Rest2 are in the original.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point. A clean up like that (a rename for clarity) would go a long with a separate #acc{} record PR ;-)

Copy link
Contributor

@nickva nickva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very nice improvement!

I haven't looked at it in depth yet, so had only a few style suggestions and questions.

Definitely like introducing the acc record. To help review the change easier, would be possible to split it out as a separate PR and and merge that first? It's a great improvement on its own, and it would help minimize the leader/follow bits changes only in a subsequent PR.

@rnewson
Copy link
Member Author

rnewson commented Jan 6, 2025

I put the acc introduction in its own commit for exactly that reason and intend to keep those separate commits when merging.

@rnewson rnewson force-pushed the reduce-intra-cluster-conflicts branch 3 times, most recently from 79f92d7 to ef74641 Compare January 6, 2025 22:27
@nickva
Copy link
Contributor

nickva commented Jan 6, 2025

I put the acc introduction in its own commit for exactly that reason and intend to keep those separate commits when merging.

That would look nice when merging, however since a bunch of notes and comments are related to just the acc change may still look nice to do a preliminary cleanup / #acc{} record PR and keep the the overall discussion concerning it in that PR then move on the main one.

@rnewson rnewson force-pushed the reduce-intra-cluster-conflicts branch from ef74641 to e30f426 Compare January 7, 2025 09:22
@rnewson
Copy link
Member Author

rnewson commented Jan 7, 2025

it's hard enough to get CI to run for one pull request. I'll do it this time but I really don't like this idea at all. I separated out the refactoring from the functional change within the PR and it makes little sense to commit one without the other.

@rnewson rnewson mentioned this pull request Jan 7, 2025
@rnewson
Copy link
Member Author

rnewson commented Jan 7, 2025

#5385 for acc record

@rnewson rnewson force-pushed the reduce-intra-cluster-conflicts branch 2 times, most recently from 7871979 to e482d53 Compare January 7, 2025 12:01
@nickva
Copy link
Contributor

nickva commented Jan 7, 2025

it's hard enough to get CI to run for one pull request. I'll do it this time but I really don't like this idea at all. I separated out the refactoring from the functional change within the PR and it makes little sense to commit one without the other.

I think it's worthwhile. At least in this case the test failures do seem related to bulk_docs:

07:52:20  7) test bulk docs emits conflict error for duplicate doc `_id`s (BulkDocsTest)
07:52:20       test/elixir/test/bulk_docs_test.exs:124
07:52:20       Expected 201 and the same number of response rows as in request, but got
07:52:20       %HTTPotion.Response{
07:52:20         status_code: 500,
07:52:20         body: [
07:52:20           %{
07:52:20             "error" => "conflict",
07:52:20             "id" => "0",
07:52:20             "reason" => "Document update conflict."
07:52:20           },
07:52:20           %{
07:52:20             "error" => "conflict",
07:52:20             "id" => "1",
07:52:20             "reason" => "Document update conflict."
07:52:20           },
07:52:20           %{
07:52:20             "error" => "conflict",
07:52:20             "id" => "1",
07:52:20             "reason" => "Document update conflict."
07:52:20           },
07:52:20           %{
07:52:20             "error" => "error",
07:52:20             "id" => "3",
07:52:20             "reason" => "internal_server_error"
07:52:20           }
07:52:20         ],

I keep an eye on flaky failures in the CI and this not a test that usually fails. internal_server_error seems worrying there.

@rnewson
Copy link
Member Author

rnewson commented Jan 7, 2025

for sure, that indicates a bug here. It took 8 runs locally to get an internal_server_error so this will be fun to track down, but it needs to be found

@rnewson rnewson force-pushed the reduce-intra-cluster-conflicts branch from e482d53 to 2b006b8 Compare January 10, 2025 17:41
@rnewson rnewson force-pushed the reduce-intra-cluster-conflicts branch 2 times, most recently from fdf909c to 5c70ed7 Compare January 10, 2025 18:39
@rnewson rnewson force-pushed the reduce-intra-cluster-conflicts branch 3 times, most recently from 3e22ba1 to 50d5248 Compare January 14, 2025 11:09
This should prevent spurious intra-cluster conflicts most of the
time. It is not true consistency, however. spurious conflicts are
still possible whenever the nodes in the cluster disagree on the
current live set of other nodes.
@rnewson rnewson force-pushed the reduce-intra-cluster-conflicts branch from 50d5248 to 5869098 Compare January 14, 2025 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants