Skip to content

Commit

Permalink
Implement parallel preads
Browse files Browse the repository at this point in the history
Let clients issue concurrent pread calls without blocking each other or having
to wait for all the writes and fsync calls.

Even though at the POSIX level pread calls are thread-safe [1], Erlang OTP file
backend forces a single controlling process for raw file handles. So, all our
reads were always funnelled through the couch_file gen_server, having to queue
up behind potentially slower writes. In particular this is problematic with
remote file systems, where fsyncs and writes may take a lot longer while preads
can hit the cache and return quicker.

Parallel pread calls are implemented via a NIF which copies the pread and file
closing bits from OTP's prim_file NIF [2]. Access to the shared handle is
controlled via RW locks similar to how emmap does it [3]. Multiple readers can
"read" acquire the RW lock and issue pread calls in parallel on the same file
descriptor. If a writer acquires it, all the readers will have to wait for it.
This kind of synchronization is necessary to carefully manage the closing
state.

In order to keep things simple the write path and the opening and handling of
the main couch_file isn't affected. The pread parallel bypass is a pure
opportunistic optimization when it's enabled; if not enabled, reads can proceed
as they always did - through the gen_server.

The cost of enabling it is using at most one extra file descriptor reference
obtained via the dup() [4] system call from the main couch_file handle. Unlike
another, newly opened file "descriptrion", the new "descriptor" is just a
reference pointing to the exact same file description in the kernel and sharing
all the buffers, position, modes, etc, with the main couch_file. The reason we
need a new dup()-ed file descriptor is to manage closing very carefully. Since
on POSIX systems file descriptors are just integers, it's very easy to
accidentally read from an already closed and re-opened (by something else) file
descriptor. That's why there are locks and a whole new file descriptor which
our NIF controls.

Another alternative was to use the exact same file descriptor as the main file,
and then, after every single pread validate that the data was read from the
same file by calling fstat and matching major/minor/inode numbers. Then also
hoping that a pread on any random pipe/socket/stdio handle will never cause any
issue, block or just quickly return an error.

So far only checked that the cluster starts up, reads and writes go through,
and a quick sequential benchmark indicates that the plain, sequential reads and
writes haven't gotten worse, they all seemed to have improved a bit:

```
> fabric_bench:go(#{q=>1, n=>1, doc_size=>small, docs=>100000}).
 *** Parameters
 * batch_size       : 1000
 * doc_size         : small
 * docs             : 100000
 * individual_docs  : 1000
 * n                : 1
 * q                : 1

 *** Environment
 * Nodes        : 1
 * Bench ver.   : 1
 * N            : 1
 * Q            : 1
 * OS           : unix/linux
```

Each case ran 5 times and picked the best rate in ops/sec, so higher is better:

```
                                                Default  CFile

* Add 100000 docs, ok:100/accepted:0     (Hz):   16000    16000
* Get random doc 100000X                 (Hz):    4900     5800
* All docs                               (Hz):  120000   140000
* All docs w/ include_docs               (Hz):   24000    31000
* Changes                                (Hz):   49000    51000
* Single doc updates 1000X               (Hz):     380      410
```

[1] https://www.man7.org/linux/man-pages/man2/pread.2.html
[2] https://github.com/erlang/otp/blob/maint-25/erts/emulator/nifs/unix/unix_prim_file.c
[3] https://github.com/saleyn/emmap
[4] https://www.man7.org/linux/man-pages/man2/dup.2.html
  • Loading branch information
nickva authored and big-r81 committed Jan 14, 2025
1 parent 5ed0654 commit 3bebbf8
Show file tree
Hide file tree
Showing 5 changed files with 760 additions and 24 deletions.
1 change: 1 addition & 0 deletions src/couch/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ ebin/
priv/couch_js/config.h
priv/couchjs
priv/couchspawnkillable
priv/couch_cfile/*.d
priv/*.exp
priv/*.lib
priv/*.dll
Expand Down
Loading

0 comments on commit 3bebbf8

Please sign in to comment.