-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Commit 3bebbf8
Implement parallel preads
Let clients issue concurrent pread calls without blocking each other or having
to wait for all the writes and fsync calls.
Even though at the POSIX level pread calls are thread-safe [1], Erlang OTP file
backend forces a single controlling process for raw file handles. So, all our
reads were always funnelled through the couch_file gen_server, having to queue
up behind potentially slower writes. In particular this is problematic with
remote file systems, where fsyncs and writes may take a lot longer while preads
can hit the cache and return quicker.
Parallel pread calls are implemented via a NIF which copies the pread and file
closing bits from OTP's prim_file NIF [2]. Access to the shared handle is
controlled via RW locks similar to how emmap does it [3]. Multiple readers can
"read" acquire the RW lock and issue pread calls in parallel on the same file
descriptor. If a writer acquires it, all the readers will have to wait for it.
This kind of synchronization is necessary to carefully manage the closing
state.
In order to keep things simple the write path and the opening and handling of
the main couch_file isn't affected. The pread parallel bypass is a pure
opportunistic optimization when it's enabled; if not enabled, reads can proceed
as they always did - through the gen_server.
The cost of enabling it is using at most one extra file descriptor reference
obtained via the dup() [4] system call from the main couch_file handle. Unlike
another, newly opened file "descriptrion", the new "descriptor" is just a
reference pointing to the exact same file description in the kernel and sharing
all the buffers, position, modes, etc, with the main couch_file. The reason we
need a new dup()-ed file descriptor is to manage closing very carefully. Since
on POSIX systems file descriptors are just integers, it's very easy to
accidentally read from an already closed and re-opened (by something else) file
descriptor. That's why there are locks and a whole new file descriptor which
our NIF controls.
Another alternative was to use the exact same file descriptor as the main file,
and then, after every single pread validate that the data was read from the
same file by calling fstat and matching major/minor/inode numbers. Then also
hoping that a pread on any random pipe/socket/stdio handle will never cause any
issue, block or just quickly return an error.
So far only checked that the cluster starts up, reads and writes go through,
and a quick sequential benchmark indicates that the plain, sequential reads and
writes haven't gotten worse, they all seemed to have improved a bit:
```
> fabric_bench:go(#{q=>1, n=>1, doc_size=>small, docs=>100000}).
*** Parameters
* batch_size : 1000
* doc_size : small
* docs : 100000
* individual_docs : 1000
* n : 1
* q : 1
*** Environment
* Nodes : 1
* Bench ver. : 1
* N : 1
* Q : 1
* OS : unix/linux
```
Each case ran 5 times and picked the best rate in ops/sec, so higher is better:
```
Default CFile
* Add 100000 docs, ok:100/accepted:0 (Hz): 16000 16000
* Get random doc 100000X (Hz): 4900 5800
* All docs (Hz): 120000 140000
* All docs w/ include_docs (Hz): 24000 31000
* Changes (Hz): 49000 51000
* Single doc updates 1000X (Hz): 380 410
```
[1] https://www.man7.org/linux/man-pages/man2/pread.2.html
[2] https://github.com/erlang/otp/blob/maint-25/erts/emulator/nifs/unix/unix_prim_file.c
[3] https://github.com/saleyn/emmap
[4] https://www.man7.org/linux/man-pages/man2/dup.2.html1 parent 5ed0654 commit 3bebbf8Copy full SHA for 3bebbf8
File tree
Expand file treeCollapse file tree
5 files changed
+760
-24
lines changedFilter options
- src/couch
- priv/couch_cfile
- src
Expand file treeCollapse file tree
5 files changed
+760
-24
lines changed+1Lines changed: 1 addition & 0 deletions
Original file line number | Diff line number | Diff line change | |
---|---|---|---|
| |||
5 | 5 |
| |
6 | 6 |
| |
7 | 7 |
| |
| 8 | + | |
8 | 9 |
| |
9 | 10 |
| |
10 | 11 |
| |
|
0 commit comments