Reduce intra cluster conflicts #5371

rnewson · 2024-12-29T18:06:45Z

Overview

CouchDB issues all write requests in parallel without coordination, applying a quorum on the results of those independent actions. When updating a document concurrently this can lead to the introduction of a stored conflict if two different writes reach separate nodes first. This is undesirable.

This patch changes fabric_doc_update in the following ways;

Workers are no longer started immediately, but are given a unique reference each.
For each range in the write request, one node is chosen to "lead" the write decision (calculated as the lowest live node that hosts the shard range)
"Leader" workers are started.
Any doc update that receives "conflict" from a Leader is added to the reply dict W times and the doc updates are removed from the other (unstarted) workers. If that leaves the worker with nothing to do, it is removed entirely.

Testing recommendations

There is some existing coverage in the module itself but more testing is needed before this can be merged.

Related Issues or Pull Requests

Checklist

Code is written and works correctly
Changes are covered by tests
Any new configurable parameters are documented in rel/overlay/etc/default.ini
Documentation changes were made in the src/docs folder
Documentation changes were backported (separated PR) to affected branches

rnewson · 2025-01-01T11:12:05Z

noting need for a liveness check at the start (so the 'leader' is always a live node as of the invocation) and also to consider maintenance mode (receipt of that message from a 'leader' specifically)

src/fabric/src/fabric_doc_update.erl

nickva · 2025-01-06T18:01:30Z

src/fabric/src/fabric_doc_update.erl

+    append_update_replies(
+        Rest1, Rest2, W, dict:append_list(Doc, lists:duplicate(W, conflict), Dict0)
+    );
+append_update_replies([Doc | Rest1], [Reply | Rest2], W, Dict0) ->


What if we hit a conflict on the 2nd reply of a W=3 request, would we still want to do a lists:duplicate(W, conflict) set of a fake replies?

hmmm good question. the 2nd reply is only happening if we started the followers, and if we do that we're doing them all in parallel, so whether we send conflict based on just the 2nd reply or conflict|accepted if we got 201/409 from both, there's still the same chance of a stored conflict, it's only the status code of the response that would be wrong (we'd send 409 instead of 202, but that can happen anyway).

I had considered doing this in tiers. that is, if we hear that the 'leader' shard is down (rexi DOWN or the rexi EXIT for maintenance mode) we could recalculate and deem one of the N-1 remaining shards as the leader, and do the same logic. I didn't do it because I didn't like the latency implications, but it leads to the situation above.

I agree it would be clearer if the fake reply thing only triggers if the reply is from a leader. It won't make things much better but it'll be easier to understand later.

Overall it seems unusual that on a 2nd conflict reply we'd end up with with 4 total replies on an N=3 db -- the original 1 + (W=3) conflict ones. Just worried that has a chance to mess up some our logic since we're, in general, careful around getting just the right number of replies and decide what to based on that.

nickva · 2025-01-06T18:08:54Z

src/fabric/src/fabric_doc_update.erl

+        Rest1, Rest2, W, dict:append_list(Doc, lists:duplicate(W, conflict), Dict0)
+    );
+append_update_replies([Doc | Rest1], [Reply | Rest2], W, Dict0) ->
+    append_update_replies(Rest1, Rest2, W, dict:append(Doc, Reply, Dict0)).


Tiny style nit: DocRest vs ReplyRest might be more clear, and we're already using them in remove_conflicts function, so it would be more consistent, too.

I didn't change append_update_replies, so Rest1 / Rest2 are in the original.

Ah good point. A clean up like that (a rename for clarity) would go a long with a separate #acc{} record PR ;-)

src/fabric/src/fabric_doc_update.erl

nickva

A very nice improvement!

I haven't looked at it in depth yet, so had only a few style suggestions and questions.

Definitely like introducing the acc record. To help review the change easier, would be possible to split it out as a separate PR and and merge that first? It's a great improvement on its own, and it would help minimize the leader/follow bits changes only in a subsequent PR.

rnewson · 2025-01-06T18:34:47Z

I put the acc introduction in its own commit for exactly that reason and intend to keep those separate commits when merging.

nickva · 2025-01-06T23:44:07Z

I put the acc introduction in its own commit for exactly that reason and intend to keep those separate commits when merging.

That would look nice when merging, however since a bunch of notes and comments are related to just the acc change may still look nice to do a preliminary cleanup / #acc{} record PR and keep the the overall discussion concerning it in that PR then move on the main one.

rnewson · 2025-01-07T09:23:17Z

it's hard enough to get CI to run for one pull request. I'll do it this time but I really don't like this idea at all. I separated out the refactoring from the functional change within the PR and it makes little sense to commit one without the other.

rnewson · 2025-01-07T09:25:34Z

#5385 for acc record

nickva · 2025-01-07T16:49:13Z

it's hard enough to get CI to run for one pull request. I'll do it this time but I really don't like this idea at all. I separated out the refactoring from the functional change within the PR and it makes little sense to commit one without the other.

I think it's worthwhile. At least in this case the test failures do seem related to bulk_docs:

07:52:20  7) test bulk docs emits conflict error for duplicate doc `_id`s (BulkDocsTest)
07:52:20       test/elixir/test/bulk_docs_test.exs:124
07:52:20       Expected 201 and the same number of response rows as in request, but got
07:52:20       %HTTPotion.Response{
07:52:20         status_code: 500,
07:52:20         body: [
07:52:20           %{
07:52:20             "error" => "conflict",
07:52:20             "id" => "0",
07:52:20             "reason" => "Document update conflict."
07:52:20           },
07:52:20           %{
07:52:20             "error" => "conflict",
07:52:20             "id" => "1",
07:52:20             "reason" => "Document update conflict."
07:52:20           },
07:52:20           %{
07:52:20             "error" => "conflict",
07:52:20             "id" => "1",
07:52:20             "reason" => "Document update conflict."
07:52:20           },
07:52:20           %{
07:52:20             "error" => "error",
07:52:20             "id" => "3",
07:52:20             "reason" => "internal_server_error"
07:52:20           }
07:52:20         ],

I keep an eye on flaky failures in the CI and this not a test that usually fails. internal_server_error seems worrying there.

rnewson · 2025-01-07T23:48:17Z

for sure, that indicates a bug here. It took 8 runs locally to get an internal_server_error so this will be fun to track down, but it needs to be found

src/fabric/src/fabric_doc_update.erl

This should prevent spurious intra-cluster conflicts most of the time. It is not true consistency, however. spurious conflicts are still possible whenever the nodes in the cluster disagree on the current live set of other nodes.

rnewson force-pushed the reduce-intra-cluster-conflicts branch 2 times, most recently from b727716 to 04000f7 Compare December 29, 2024 20:56

rnewson force-pushed the reduce-intra-cluster-conflicts branch 3 times, most recently from 577dc25 to e0807f5 Compare January 6, 2025 11:18

nickva reviewed Jan 6, 2025

View reviewed changes

src/fabric/src/fabric_doc_update.erl Outdated Show resolved Hide resolved

nickva reviewed Jan 6, 2025

View reviewed changes

src/fabric/src/fabric_doc_update.erl Outdated Show resolved Hide resolved

nickva reviewed Jan 6, 2025

View reviewed changes

src/fabric/src/fabric_doc_update.erl Outdated Show resolved Hide resolved

nickva reviewed Jan 6, 2025

View reviewed changes

src/fabric/src/fabric_doc_update.erl Show resolved Hide resolved

nickva reviewed Jan 6, 2025

View reviewed changes

rnewson force-pushed the reduce-intra-cluster-conflicts branch 3 times, most recently from 79f92d7 to ef74641 Compare January 6, 2025 22:27

rnewson force-pushed the reduce-intra-cluster-conflicts branch from ef74641 to e30f426 Compare January 7, 2025 09:22

rnewson mentioned this pull request Jan 7, 2025

introduce acc record #5385

Merged

rnewson force-pushed the reduce-intra-cluster-conflicts branch 2 times, most recently from 7871979 to e482d53 Compare January 7, 2025 12:01

rnewson force-pushed the reduce-intra-cluster-conflicts branch from e482d53 to 2b006b8 Compare January 10, 2025 17:41

nickva reviewed Jan 10, 2025

View reviewed changes

src/fabric/src/fabric_doc_update.erl Outdated Show resolved Hide resolved

rnewson force-pushed the reduce-intra-cluster-conflicts branch 2 times, most recently from fdf909c to 5c70ed7 Compare January 10, 2025 18:39

rnewson force-pushed the reduce-intra-cluster-conflicts branch 3 times, most recently from 3e22ba1 to 50d5248 Compare January 14, 2025 11:09

reject write at leader if conflict

5869098

This should prevent spurious intra-cluster conflicts most of the time. It is not true consistency, however. spurious conflicts are still possible whenever the nodes in the cluster disagree on the current live set of other nodes.

rnewson force-pushed the reduce-intra-cluster-conflicts branch from 50d5248 to 5869098 Compare January 14, 2025 11:09

rnewson mentioned this pull request Mar 4, 2025

upgrade clause for fabric_doc_update #5459

Closed

5 tasks

Reduce intra cluster conflicts #5371

Are you sure you want to change the base?

Reduce intra cluster conflicts #5371

Uh oh!

Conversation

rnewson commented Dec 29, 2024

Overview

Testing recommendations

Related Issues or Pull Requests

Checklist

Uh oh!

rnewson commented Jan 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nickva Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

rnewson Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

nickva Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nickva Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

rnewson Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

rnewson Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

nickva Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nickva left a comment

Choose a reason for hiding this comment

Uh oh!

rnewson commented Jan 6, 2025

Uh oh!

nickva commented Jan 6, 2025

Uh oh!

rnewson commented Jan 7, 2025

Uh oh!

rnewson commented Jan 7, 2025

Uh oh!

nickva commented Jan 7, 2025

Uh oh!

rnewson commented Jan 7, 2025

Uh oh!

Uh oh!

Uh oh!

nickva Jan 6, 2025 •

edited

Loading