Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implement parallel preads
Let clients issue concurrent pread calls without blocking each other or having to wait for all the writes and fsync calls.
Even though at the POSIX level pread calls are thread-safe [1], Erlang OTP file backend forces a single controlling process for raw file handles. So, all our reads were always funneled through the couch_file gen_server, having to queue up behind potentially slower writes. In particular this is problematic with remote file systems, where fsyncs and writes may take a lot longer while preads can hit the cache and return quicker.
Parallel pread calls are implemented via a NIF which copies some of the file functions OTP's prim_file NIF [2]. The original OTP handle is dup-ed, and then closed, then our NIF takes control of the new duplicated file descriptor. This is necessary in order to allow multiple reader access via reader/writer locks, and also to carefully manage the closing state.
In order to keep things simple the new handles created by
couch_cfile
implements the#file_descriptor{module = $Module, data = $Data}
protocol, such that once opened the regularfile
module in OTP will know how to dispatch calls with this handle to ourcouch_cfile.erl
functions. In this way most of the couch_file stays the same, with all the samefile:
calls in the main data path.couch_cfile bypass is also opportunistic, if it is not available (on Windows) or not enabled, things proceed as before.
The reason we need a new dup()-ed file descriptor is to manage closing very carefully. Since on POSIX systems file descriptors are just integers, it's very easy to accidentally read from an already closed and re-opened (by something
else) file descriptor. That's why there are locks and a whole new file descriptor which our NIF controls. But as long as we control the the file descriptor with our resource "handle" we can be sure it will stay open and won't be re-used by any other process.
So far only checked that the cluster starts up, reads and writes go through, and a quick sequential benchmark indicates that the plain, sequential reads and writes haven't gotten worse, they all seemed to have improved a bit:
Each case ran 5 times and picked the best rate in ops/sec, so higher is better:
[1] https://www.man7.org/linux/man-pages/man2/pread.2.html
[2] https://github.com/erlang/otp/blob/maint-25/erts/emulator/nifs/unix/unix_prim_file.c
[3] https://github.com/saleyn/emmap
[4] https://www.man7.org/linux/man-pages/man2/dup.2.html