Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parallel preads #5399

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Implement parallel preads #5399

wants to merge 1 commit into from

Conversation

nickva
Copy link
Contributor

@nickva nickva commented Jan 14, 2025

Implement parallel preads

Let clients issue concurrent pread calls without blocking each other or having to wait for all the writes and fsync calls.

Even though at the POSIX level pread calls are thread-safe [1], Erlang OTP file backend forces a single controlling process for raw file handles. So, all our reads were always funneled through the couch_file gen_server, having to queue up behind potentially slower writes. In particular this is problematic with remote file systems, where fsyncs and writes may take a lot longer while preads can hit the cache and return quicker.

Parallel pread calls are implemented via a NIF which copies some of the file functions OTP's prim_file NIF [2]. The original OTP handle is dup-ed, and then closed, then our NIF takes control of the new duplicated file descriptor. This is necessary in order to allow multiple reader access via reader/writer locks, and also to carefully manage the closing state.

In order to keep things simple the new handles created by couch_cfile implements the #file_descriptor{module = $Module, data = $Data} protocol, such that once opened the regular file module in OTP will know how to dispatch calls with this handle to our couch_cfile.erl functions. In this way most of the couch_file stays the same, with all the same file: calls in the main data path.

couch_cfile bypass is also opportunistic, if it is not available (on Windows) or not enabled, things proceed as before.

The reason we need a new dup()-ed file descriptor is to manage closing very carefully. Since on POSIX systems file descriptors are just integers, it's very easy to accidentally read from an already closed and re-opened (by something
else) file descriptor. That's why there are locks and a whole new file descriptor which our NIF controls. But as long as we control the the file descriptor with our resource "handle" we can be sure it will stay open and won't be re-used by any other process.

So far only checked that the cluster starts up, reads and writes go through, and a quick sequential benchmark indicates that the plain, sequential reads and writes haven't gotten worse, they all seemed to have improved a bit:

> fabric_bench:go(#{q=>1, n=>1, doc_size=>small, docs=>100000}).
 *** Parameters
 * batch_size       : 1000
 * doc_size         : small
 * docs             : 100000
 * individual_docs  : 1000
 * n                : 1
 * q                : 1

 *** Environment
 * Nodes        : 1
 * Bench ver.   : 1
 * N            : 1
 * Q            : 1
 * OS           : unix/linux

Each case ran 5 times and picked the best rate in ops/sec, so higher is better:

                                                Default  CFile

* Add 100000 docs, ok:100/accepted:0     (Hz):   16000    16000
* Get random doc 100000X                 (Hz):    4900     5800
* All docs                               (Hz):  120000   140000
* All docs w/ include_docs               (Hz):   24000    31000
* Changes                                (Hz):   49000    51000
* Single doc updates 1000X               (Hz):     380      410

[1] https://www.man7.org/linux/man-pages/man2/pread.2.html
[2] https://github.com/erlang/otp/blob/maint-25/erts/emulator/nifs/unix/unix_prim_file.c
[3] https://github.com/saleyn/emmap
[4] https://www.man7.org/linux/man-pages/man2/dup.2.html

@big-r81
Copy link
Contributor

big-r81 commented Jan 14, 2025

Ran an erlfmt-format to make the CI happy ...

@nickva nickva force-pushed the parallel-preads branch 2 times, most recently from 43a53ae to 54751ac Compare January 14, 2025 16:01
@nickva nickva force-pushed the parallel-preads branch 5 times, most recently from 12e6a98 to d081c04 Compare January 18, 2025 00:36
Let clients issue concurrent pread calls without blocking each other or having
to wait for all the writes and fsync calls.

Even though at the POSIX level pread calls are thread-safe [1], Erlang OTP file
backend forces a single controlling process for raw file handles. So, all our
reads were always funnelled through the couch_file gen_server, having to queue
up behind potentially slower writes. In particular this is problematic with
remote file systems, where fsyncs and writes may take a lot longer while preads
can hit the cache and return quicker.

Parallel pread calls are implemented via a NIF which copies some of the file
functions OTP's prim_file NIF [2]. The original OTP handle is dup-ed, and then
closed, then our NIF takes control of the new duplicated file descriptor. This
is necessary in order to allow multiple reader access via reader/writer locks,
and also to carefully manage the closing state.

In order to keep things simple the new handles created by couch_cfile implement
the `#file_descriptor{module = $Module, data = $Data}` protocol, such that once
opened the regular `file` module in OTP will know how to dispatch calls with
this handle to our couch_cfile.erl functions. In this way most of the
couch_file stays the same, with all the same `file:` calls in the main data
path.

couch_cfile bypass is also opportunistic, if it is not available (on Windows)
or not enables things proceed as before.

The reason we need a new dup()-ed file descriptor is to manage closing very
carefully. Since on POSIX systems file descriptors are just integers, it's very
easy to accidentally read from an already closed and re-opened (by something
else) file descriptor. That's why there are locks and a whole new file
descriptor which our NIF controls. But as long as we control the the file
descriptor with our resource "handle" we can be sure it will stay open and
won't be re-used by any other process.

So far only checked that the cluster starts up, reads and writes go through,
and a quick sequential benchmark indicates that the plain, sequential reads and
writes haven't gotten worse, they all seemed to have improved a bit:

```
> fabric_bench:go(#{q=>1, n=>1, doc_size=>small, docs=>100000}).
 *** Parameters
 * batch_size       : 1000
 * doc_size         : small
 * docs             : 100000
 * individual_docs  : 1000
 * n                : 1
 * q                : 1

 *** Environment
 * Nodes        : 1
 * Bench ver.   : 1
 * N            : 1
 * Q            : 1
 * OS           : unix/linux
```

Each case ran 5 times and picked the best rate in ops/sec, so higher is better:

```
                                                Default  CFile

* Add 100000 docs, ok:100/accepted:0     (Hz):   16000    16000
* Get random doc 100000X                 (Hz):    4900     5800
* All docs                               (Hz):  120000   140000
* All docs w/ include_docs               (Hz):   24000    31000
* Changes                                (Hz):   49000    51000
* Single doc updates 1000X               (Hz):     380      410
```

[1] https://www.man7.org/linux/man-pages/man2/pread.2.html
[2] https://github.com/erlang/otp/blob/maint-25/erts/emulator/nifs/unix/unix_prim_file.c
[3] https://github.com/saleyn/emmap
[4] https://www.man7.org/linux/man-pages/man2/dup.2.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants