Implement parallel preads #5399

nickva · 2025-01-14T08:15:21Z

Implement parallel preads

Let clients issue concurrent pread calls without blocking each other or having to wait for all the writes and fsync calls.

Even though at the POSIX level pread calls are thread-safe [1], Erlang OTP file backend forces a single controlling process for raw file handles. So, all our reads were always funneled through the couch_file gen_server, having to queue up behind potentially slower writes. In particular this is problematic with remote file systems, where fsyncs and writes may take a lot longer while preads can hit the cache and return quicker.

Parallel pread calls are implemented via a NIF which copies some of the file functions from OTP's prim_file NIF [2]. The original OTP handle is dup-ed, and then closed, then our NIF takes control of the new duplicated file descriptor. This is necessary in order to allow multiple reader access via reader/writer locks, and also to carefully manage the closing state.

In order to keep things simple the new handles created by couch_cfile implement the #file_descriptor{module = $Module, data = $Data} protocol, such that once opened the regular file module in OTP will know how to dispatch calls with this handle to our couch_cfile.erl functions. In this way most of the couch_file stays the same, with all the same file: calls in the main data path.

couch_cfile bypass is also opportunistic, if it is not available (on Windows) or not enabled, things proceed as before.

The reason we need a new dup()-ed file descriptor is to manage closing very carefully. Since on POSIX systems file descriptors are just integers, it's very easy to accidentally read from an already closed and re-opened (by something
else) file descriptor. That's why there are locks and a whole new file descriptor which our NIF controls. But as long as we control the file descriptor with our resource "handle" we can be sure it will stay open and won't be re-used by any other process.

Since neither one of the three compatible IOQ systems currently know how call a simple MFA, and instead only send a $gen_call message to a gen_server, parallel cfile reads are only available if we bypass the IOQ. By default if the requests are already configured to bypass the IOQ, then they will use the parallel preads. To enable parallel preads for all requests, toggle the [couchdb] cfile_skip_ioq setting to true.

To gain confidence that the new couch_cfile behaves the same way as the Erlang/OTP one there is a property test which asserts that for any pair of {Raw, CFile} handle any supported file operations return exactly the same results. It was validated by modifying some of couch_file.c arguments and the property tests started to fail.

A simple sequential benchmark was run initially to show that even the most unfavorable case, all sequential operations, we haven't gotten worse:

> fabric_bench:go(#{q=>1, n=>1, doc_size=>small, docs=>100000}).
 *** Parameters
 * batch_size       : 1000
 * doc_size         : small
 * docs             : 100000
 * individual_docs  : 1000
 * n                : 1
 * q                : 1

 *** Environment
 * Nodes        : 1
 * Bench ver.   : 1
 * N            : 1
 * Q            : 1
 * OS           : unix/linux

Each case ran 5 times and picked the best rate in ops/sec, so higher is better:

                                                Default  CFile

* Add 100000 docs, ok:100/accepted:0     (Hz):   16000    16000
* Get random doc 100000X                 (Hz):    4900     5800
* All docs                               (Hz):  120000   140000
* All docs w/ include_docs               (Hz):   24000    31000
* Changes                                (Hz):   49000    51000
* Single doc updates 1000X               (Hz):     380      410

[1] https://www.man7.org/linux/man-pages/man2/pread.2.html
[2] https://github.com/erlang/otp/blob/maint-25/erts/emulator/nifs/unix/unix_prim_file.c
[3] https://github.com/saleyn/emmap
[4] https://www.man7.org/linux/man-pages/man2/dup.2.html

big-r81 · 2025-01-14T09:04:34Z

Ran an erlfmt-format to make the CI happy ...

nickva · 2025-01-20T19:40:43Z

Another benchmark with concurrent reads on a 12 node cluster. This one exercises concurrency but it's still reads only. So might not show the best parts of the PR - how readers can read without waiting on writes to complete but still shows a 40% improvement in requests per second (35k rps for main and 50rps for preads branch).

Environment: Erlang 25, 12 node cluster, Debian 11, x86_64

Test parameters:

Concurrency (number of clients): 2000
Q: 64
Doc count: 1000000
Length: 1 hour

Config settings:
  couchdb.max_dbs_open = 50000
  ioq.bypass.db_update = true
  ioq.bypass.interactive = true

Summary:
ops : number of get_doc operations completed in 1 hour
other times are in milliseconds

release	ops	median	p75	p90
MAIN	127701394	15469	21095	27490
PREADS	173440163	7595	18967	39236

janl · 2025-01-24T22:33:04Z

I did some testing here as well, all this on a q=1 db on a single node.

It’s quite impressive. Especially in terms of added concurrency. main does 10k rps on a single node with a concurrency level of 10–15 requesting a single doc. It drops with larger concurrency. parallel_preads does 25k rps (!) and handily goes up to 200 concurrent requests. Couldn’t go further for ulimit reasons that I can’t bother to sort out tonight. But that’s quite the win

—

another preads test: I’m using test/bench/benchbulk to write 1M 10byte docs in batches of 1000 into a db and ab to read the first inserted doc with concurrency 15 for 100k times. in both main and parallel_preads adding the reads slows the writes down noticeably, but the read rps roughly stays the same (only concurrent_preads roughly does 2x over main

dialled it up to 1M reads, so read and write tests roughly have the same duration / effect on each other

inserting 1M docs while reading the same doc 1M times with 15 concurrent readers under parallel_preads:

real	1m27.434s
user	0m21.057s
sys	0m14.472s

reads test ran for 84.316 seconds vs 87.434 ^

just inserting the docs without reads:

real	0m17.923s
user	0m19.268s
sys	0m8.213s

for the r/w test on main we come out at 7100rps, 2–3ms response times, with longest request 69ms

vs parallel_preads: 11800rps, 1–2ms, worst case 67ms,

so while quite a bit more throughput, concurrent reads and writes still block each other.

this is on an M4 Pro 14 core box with essentially infinite IOPS and memory for the sake of this test, 10 4.5GHz CPUs + 4 slower ones. (fastest consumer CPU on the market atm).

nickva · 2025-02-08T21:10:15Z

A full CI run testing on a variety of OSes (FreeBSD, MacOS, RHEL-like):

https://ci-couchdb.apache.org/job/jenkins-cm1/job/FullPlatformMatrix/job/jenkins-parallel-preads/3/pipeline-graph/

nickva · 2025-02-17T21:59:49Z

To gain more confidence added a new tests runs concurrent preads from multiple processes for various block sizes. To verify that each process read the correct bytes, use a simple checksumming scheme: write byte X rem 256 at position X. So the file looks like 0,1,2,...,255,0,1,2,...255,0,... This way each reader can check not only that they read the correct length of bytes but that the bytes are what they are supposed to be at that position.

Locally ran block size test for 30 minutes with this diff, the tests passed successfully:

diff --git a/src/couch/test/eunit/couch_cfile_tests.erl b/src/couch/test/eunit/couch_cfile_tests.erl
index e1e99951c..ab0fe5d3b 100644
--- a/src/couch/test/eunit/couch_cfile_tests.erl
+++ b/src/couch/test/eunit/couch_cfile_tests.erl
@@ -41,9 +41,9 @@ couch_cfile_test_() ->
                     ?TDEF_FE(t_gc_is_closing_file_handles),
                     ?TDEF_FE(t_monitor_is_closing_file_handles),
                     ?TDEF_FE(t_janitor_proc_is_up),
-                    ?TDEF_FE(t_concurrent_reads_512b),
-                    ?TDEF_FE(t_concurrent_reads_4kb),
-                    ?TDEF_FE(t_concurrent_reads_1mb)
+                    ?TDEF_FE(t_concurrent_reads_512b, 3600),
+                    ?TDEF_FE(t_concurrent_reads_4kb, 3600),
+                    ?TDEF_FE(t_concurrent_reads_1mb, 3600)
                 ]
         end
     }.
@@ -397,7 +397,7 @@ t_concurrent_reads_512b(Path) ->
     Fd = cfile(Path),
     Eof = write(Fd, 0, 512),
     ReadersPidRefs = spawn_readers(20, Fd, Eof),
-    timer:sleep(2000),
+    timer:sleep(1800 * 1000),
     [Pid ! stop_reading || {Pid, _} <- ReadersPidRefs],
     Count = gather_read_results(ReadersPidRefs, 0),
     ?assert(is_integer(Count) andalso Count > 1000).
@@ -406,7 +406,7 @@ t_concurrent_reads_4kb(Path) ->
     Fd = cfile(Path),
     Eof = write(Fd, 0, 4096),
     ReadersPidRefs = spawn_readers(10, Fd, Eof),
-    timer:sleep(2000),
+    timer:sleep(1800 * 1000),
     [Pid ! stop_reading || {Pid, _} <- ReadersPidRefs],
     Count = gather_read_results(ReadersPidRefs, 0),
     ?assert(is_integer(Count) andalso Count > 100).
@@ -415,7 +415,7 @@ t_concurrent_reads_1mb(Path) ->
     Fd = cfile(Path),
     Eof = write(Fd, 0, 1048576),
     ReadersPidRefs = spawn_readers(2, Fd, Eof),
-    timer:sleep(2000),
+    timer:sleep(1800 * 1000),
     [Pid ! stop_reading || {Pid, _} <- ReadersPidRefs],
     Count = gather_read_results(ReadersPidRefs, 0),
     ?assert(is_integer(Count) andalso Count > 10).

==> asf (setup_eunit)
==> couch (eunit)
======================== EUnit ========================
module 'couch_cfile_tests'
  ...
  couch_cfile_tests:43: -couch_cfile_test_/0-fun-8- (t_janitor_proc_is_up)...[0.001 s] ok
  couch_cfile_tests:44: -couch_cfile_test_/0-fun-6- (t_concurrent_reads_512b)...[1800.023 s] ok
  couch_cfile_tests:45: -couch_cfile_test_/0-fun-4- (t_concurrent_reads_4kb)...[1800.001 s] ok
  couch_cfile_tests:46: -couch_cfile_test_/0-fun-2- (t_concurrent_reads_1mb)...[1800.137 s] ok
  [done in 5400.849 s]
=======================================================
  All 16 tests passed.
Cover analysis: /Users/nvatama/asf/src/couch/.eunit/index.html

Code Coverage:

 : not executed
==> rel (eunit)
==> asf (eunit)

nickva · 2025-02-17T22:46:08Z

For property tests comparing behavior of cfile handles vs raw Erlang file handles, I increased to test number to 10M from the default 25k. The test ran for about 30 min and passed as well:

git diff src/couch/include/couch_eunit_proper.hrl src/couch/test/eunit/couch_cfile_prop_tests.erl
diff --git a/src/couch/include/couch_eunit_proper.hrl b/src/couch/include/couch_eunit_proper.hrl
index dcf07701a..86d9b245b 100644
--- a/src/couch/include/couch_eunit_proper.hrl
+++ b/src/couch/include/couch_eunit_proper.hrl
@@ -19,7 +19,7 @@
             atom_to_list(F),
             {timeout, QuickcheckTimeout,
                 ?_assert(proper:quickcheck(?MODULE:F(), [
-                    {to_file, user},
+                    %{to_file, user},
                     {start_size, 2},
                     {numtests, NumTests},
                     long_result
diff --git a/src/couch/test/eunit/couch_cfile_prop_tests.erl b/src/couch/test/eunit/couch_cfile_prop_tests.erl
index a2d954e5c..9ada2d8fa 100644
--- a/src/couch/test/eunit/couch_cfile_prop_tests.erl
+++ b/src/couch/test/eunit/couch_cfile_prop_tests.erl
@@ -19,7 +19,7 @@
 -include_lib("kernel/include/file.hrl").

 property_test_() ->
-    ?EUNIT_QUICKCHECK(60, 25000).
+    ?EUNIT_QUICKCHECK(3600, 10000000).

======================= EUnit ========================
module 'couch_cfile_prop_tests'
  couch_cfile_prop_tests:22: -property_test_/0-fun-2- (prop_file_ops_results_match_raw_file)...[1722.006 s] ok
  [done in 1723.859 s]
=======================================================
  Test passed.
Cover analysis: /Users/nvatama/asf/src/couch/.eunit/index.html

Code Coverage:
couch_cfile_prop_tests :  98%

Total                  : 98%

jaydoane

Agree with @janl that the results are quite impressive, as is the test coverage!

src/couch/src/couch_cfile.erl

src/couch/src/couch_file.erl

jaydoane · 2025-02-19T18:06:06Z

src/couch/src/couch_file.erl

+            % Use parallel preads only if the request would be bypassed by the
+            % IOQ. All three compatible ioqs (the two from the


An elegant workaround to this IOQ limitation!

src/couch/test/eunit/couch_cfile_tests.erl

Let clients issue concurrent pread calls without blocking each other or having to wait for all the writes and fsync calls. Even though at the POSIX level pread calls are thread-safe [1], Erlang OTP file backend forces a single controlling process for raw file handles. So, all our reads were always funnelled through the couch_file gen_server, having to queue up behind potentially slower writes. In particular this is problematic with remote file systems, where fsyncs and writes may take a lot longer while preads can hit the cache and return quicker. Parallel pread calls are implemented via a NIF which copies some of the file functions OTP's prim_file NIF [2]. The original OTP handle is dup-ed, and then closed, then our NIF takes control of the new duplicated file descriptor. This is necessary in order to allow multiple reader access via reader/writer locks, and also to carefully manage the closing state. In order to keep things simple the new handles created by couch_cfile implement the `#file_descriptor{module = $Module, data = $Data}` protocol, such that once opened, the regular `file` module in OTP will "know" how to dispatch calls with this handle to our couch_cfile.erl functions. In this way most of the `couch_file` stays the same, with all the same `file:` calls in the main data path. `couch_cfile` bypass is also opportunistic, if it is not available (on Windows) or not enables things proceed as before. The reason we need a new dup()-ed file descriptor is to manage closing very carefully. Since on POSIX systems file descriptors are just integers, it's very easy to accidentally read from an already closed and re-opened (by something else) file descriptor. That's why there are locks and a whole new file descriptor which our NIF controls. But as long as we control the file descriptor with our resource "handle", we can be sure it will stay open and won't be re-used by any other process. To gain confidence that the new couch_cfile behaves the same way as the Erlang/OTP one there is a property test which asserts that for any pair of {Raw, CFile} handle any supported file operations return exactly the same results. It was validated by modifying some of couch_file.c arguments and the property tests started to fail. Since neither one of the three compatible IOQ systems currently know how call a simple MFA, and instead only send a `$gen_call` message to a gen_server, parallel cfile reads are only available if we bypass the IOQ. By default if the requests are already configured to bypass the IOQ, then they will use the parallel preads. To enable parallel preads for all requests, toggle the `[couchdb] cfile_skip_ioq` setting to `true`. A simple sequential benchmark was run initially to show that even the most unfavorable case, all sequential operations, we haven't gotten worse: ``` > fabric_bench:go(#{q=>1, n=>1, doc_size=>small, docs=>100000}). *** Parameters * batch_size : 1000 * doc_size : small * docs : 100000 * individual_docs : 1000 * n : 1 * q : 1 *** Environment * Nodes : 1 * Bench ver. : 1 * N : 1 * Q : 1 * OS : unix/linux ``` Each case ran 5 times and picked the best rate in ops/sec, so higher is better: ``` Default CFile * Add 100000 docs, ok:100/accepted:0 (Hz): 16000 16000 * Get random doc 100000X (Hz): 4900 5800 * All docs (Hz): 120000 140000 * All docs w/ include_docs (Hz): 24000 31000 * Changes (Hz): 49000 51000 * Single doc updates 1000X (Hz): 380 410 ``` [1] https://www.man7.org/linux/man-pages/man2/pread.2.html [2] https://github.com/erlang/otp/blob/maint-25/erts/emulator/nifs/unix/unix_prim_file.c [3] https://github.com/saleyn/emmap [4] https://www.man7.org/linux/man-pages/man2/dup.2.html

Given a message and a priority, return `true` if that operation would be bypassed. This is used by the parallel preads commit [1] with the cfile handle when we know the request would be bypassed by the IOQ. If/when we figure out how to teach the IOQ server to call a regular MFA, as opposed to always sending a '$gen_call' message, we won't have a need for this and can remove this API. While at it, do some minor test housekeeping and use the standard `?TDEF_FE` helper macro to remove the eunit `?_test/begin` boilerplate code. [1] apache/couchdb#5399

big-r81 force-pushed the parallel-preads branch from 0a74021 to 3bebbf8 Compare January 14, 2025 09:02

nickva force-pushed the parallel-preads branch 2 times, most recently from 43a53ae to 54751ac Compare January 14, 2025 16:01

nickva mentioned this pull request Jan 14, 2025

Allow preads from non-owner processes erlang/otp#9250

Closed

nickva force-pushed the parallel-preads branch 6 times, most recently from d081c04 to 1c32542 Compare January 18, 2025 08:41

nickva force-pushed the parallel-preads branch 8 times, most recently from 689204c to c709a32 Compare January 23, 2025 05:55

nickva force-pushed the parallel-preads branch 3 times, most recently from 3d9694b to 1ac7b23 Compare February 1, 2025 04:02

nickva force-pushed the parallel-preads branch from 1ac7b23 to 53689af Compare February 17, 2025 06:19

nickva force-pushed the parallel-preads branch from 53689af to 7f7085c Compare February 18, 2025 20:46

jaydoane approved these changes Feb 19, 2025

View reviewed changes

nickva force-pushed the parallel-preads branch from cb29b98 to acafbe6 Compare February 19, 2025 22:12

nickva merged commit c1a539b into main Feb 19, 2025
23 checks passed

nickva deleted the parallel-preads branch February 19, 2025 23:07

nickva mentioned this pull request Feb 20, 2025

Add ioq:bypass/2 API apache/couchdb-ioq#26

Merged

nickva mentioned this pull request Feb 24, 2025

Raw file pread from multiple processes erlang/otp#9239

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement parallel preads #5399

Implement parallel preads #5399

Uh oh!

nickva commented Jan 14, 2025 •

edited

Loading

Uh oh!

big-r81 commented Jan 14, 2025

Uh oh!

nickva commented Jan 20, 2025 •

edited

Loading

Uh oh!

janl commented Jan 24, 2025 •

edited

Loading

Uh oh!

nickva commented Feb 8, 2025

Uh oh!

nickva commented Feb 17, 2025

Uh oh!

nickva commented Feb 17, 2025

Uh oh!

jaydoane left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaydoane Feb 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		% Use parallel preads only if the request would be bypassed by the
		% IOQ. All three compatible ioqs (the two from the

Implement parallel preads #5399

Implement parallel preads #5399

Uh oh!

Conversation

nickva commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

big-r81 commented Jan 14, 2025

Uh oh!

nickva commented Jan 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

janl commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nickva commented Feb 8, 2025

Uh oh!

nickva commented Feb 17, 2025

Uh oh!

nickva commented Feb 17, 2025

Uh oh!

jaydoane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaydoane Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nickva commented Jan 14, 2025 •

edited

Loading

nickva commented Jan 20, 2025 •

edited

Loading

janl commented Jan 24, 2025 •

edited

Loading