Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Let clients issue concurrent pread calls without blocking each other or having to wait for all the writes and fsync calls. Even though at the POSIX level pread calls are thread-safe [1], Erlang OTP file backend forces a single controlling process for raw file handles. So, all our reads were always funnelled through the couch_file gen_server, having to queue up behind potentially slower writes. In particular this is problematic with remote file systems, where fsyncs and writes may take a lot longer while preads can hit the cache and return quicker. Parallel pread calls are implemented via a NIF which copies the pread and file closing bits from OTP's prim_file NIF [2]. Access to the shared handle is controlled via RW locks similar to how emmap does it [3]. Multiple readers can "read" acquire the RW lock and issue pread calls in parallel on the same file descriptor. If a writer acquires it, all the readers will have to wait for it. This kind of synchronization is necessary to carefully manage the closing state. In order to keep things simple the write path and the opening and handling of the main couch_file isn't affected. The pread parallel bypass is a pure opportunistic optimization when it's enabled; if not enabled, reads can proceed as they always did - through the gen_server. The cost of enabling it is using at most one extra file descriptor reference obtained via the dup() [4] system call from the main couch_file handle. Unlike another, newly opened file "descriptrion", the new "descriptor" is just a reference pointing to the exact same file description in the kernel and sharing all the buffers, position, modes, etc, with the main couch_file. The reason we need a new dup()-ed file descriptor is to manage closing very carefully. Since on POSIX systems file descriptors are just integers, it's very easy to accidentally read from an already closed and re-opened (by something else) file descriptor. That's why there are locks and a whole new file descriptor which our NIF controls. Another alternative was to use the exact same file descriptor as the main file, and then, after every single pread validate that the data was read from the same file by calling fstat and matching major/minor/inode numbers. Then also hoping that a pread on any random pipe/socket/stdio handle will never cause any issue, block or just quickly return an error. So far only checked that the cluster starts up, reads and writes go through, and a quick sequential benchmark indicates that the plain, sequential reads and writes haven't gotten worse, they all seemed to have improved a bit: ``` > fabric_bench:go(#{q=>1, n=>1, doc_size=>small, docs=>100000}). *** Parameters * batch_size : 1000 * doc_size : small * docs : 100000 * individual_docs : 1000 * n : 1 * q : 1 *** Environment * Nodes : 1 * Bench ver. : 1 * N : 1 * Q : 1 * OS : unix/linux ``` Each case ran 5 times and picked the best rate in ops/sec, so higher is better: ``` Default CFile * Add 100000 docs, ok:100/accepted:0 (Hz): 16000 16000 * Get random doc 100000X (Hz): 4900 5800 * All docs (Hz): 120000 140000 * All docs w/ include_docs (Hz): 24000 31000 * Changes (Hz): 49000 51000 * Single doc updates 1000X (Hz): 380 410 ``` [1] https://www.man7.org/linux/man-pages/man2/pread.2.html [2] https://github.com/erlang/otp/blob/maint-25/erts/emulator/nifs/unix/unix_prim_file.c [3] https://github.com/saleyn/emmap [4] https://www.man7.org/linux/man-pages/man2/dup.2.html
- Loading branch information