Fjall batch.commit() stalls under sustained insert workload #294
Replies: 3 comments
-
Hard to say, your workload writes small values as I am reading, so I wouldn't compaction to be too much of an issue, unless you have a very underpowered CPU or hundreds of thousands of writes. You could try disabling data block compression as a start.
Compaction strategy and compression (see °1) are the largest contributors to how compaction behaves I would say. Currently there is only really Leveled compaction implemented which is more prone to write stalls when the write throughput is heavy. The alternative would be (Size)-Tiered compaction but that is currently not available.
Write batches are serialized internally.
Yes, as explained above, if data is written more quickly than can be reorganized through compaction, the system may stall itself to prevent read performance going out of hand.
Yes. you should both see |
Beta Was this translation helpful? Give feedback.
-
|
Hi Marvin, Thanks for your quick response. I really appreciate it. I continued investigating the write stalls we discussed and now have more precise data from Fjall’s internal metrics. My workload is a write-heavy secondary index for a cache system. Each logical insert writes two small entries into one Fjall keyspace: data_key -> expires_at The workload is mostly insert-only. Current test configuration:
I added logging for:
According to the docs, write_buffer_size is active + sealed memtables. The stalls correlate very strongly with write_buffer_size. A typical sequence looks like this: commit_duration_ms ≈ 85–87 ms then: commit_duration_ms = 1111 ms Queue full starts immediately after this. After that slow commit, commits become faster again, around 24–60 ms, but the queue remains under pressure for a while because backlog has already accumulated. During that period, write_buffer_size continues growing: write_buffer_size = 544 MB Then suddenly Fjall creates new L0 tables and the write buffer drops sharply: l0_table_count: 18 -> 21 So the pattern seems to be: write_buffer_size grows The interesting part is that this happens even with l0_threshold = 64, while l0_table_count is only around 18, so this does not look like the L0 threshold alone. It looks more related to active + sealed memtable pressure / flush behavior. The slow commit and queue pressure start consistently around write_buffer_size ≈ 540 MB. Questions (I hope not too many …):
At this point, the bottleneck does not seem to be caused by the filesystem, io_uring reads/writes, compression, compaction filter, or concurrent commits. It seems tied to Fjall’s internal write buffer / sealed memtable / L0 flush behavior. Any help would be very appreciated. Thanks a lot, and congrats for this amazing crate Joan |
Beta Was this translation helpful? Give feedback.
-
This is possibly one of the culprits. Fjall is hard-coded to start slowing down writes after >= 10 L0 tables. The default threshold is quite sensible at 4. I would not really blindly change this at all, unless you can verify via a benchmark that changing it improves performance for your specific workload.
This is also possibly an issue. The default journal (soft) max size is 512 MiB < 1024 MiB.
The active memtable. It's basically the threshold at which a memtable is sealed and queued for flushing.
See above.
Currently no. At least not directly.
Hard to say. Flushes should be fairly quick, so I'm not sure at which interval you are polling the metrics.
If you see sealed memtables building up that means the worker threads are busy compacting, so they can't take over a flush task. If your worker thread count > 1, one thread is normally reserved to only serve flush tasks.
Unlikely unless you see
For write-heavy workloads, the Size-Tiered (Universal in RocksDB terminology) compaction strategy is preferred, as long as you can accept higher temporary space usage. However that is currently not implemented in |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I am using Fjall as a disk-based secondary index for a high-throughput cache system written in Rust.
The workload is basically insert-only most of the time. Deletes are very rare. The index is used to find cache keys by composite fields such as hotel_code + check_in + check_out.
The current test case is:
1 logical index
1 keyspace
insert-only workload
32,768 rows per batch
approximately 4 flushes per second
Each indexed row writes:
data_key -> expires_at
reverse_key(cache_key) -> reverse_value containing expires_at + data_key
The batch commit path is roughly:
I originally had a single indexer thread doing the whole flush synchronously. Later I moved the heavy commit work to a dedicated commit thread/pool. The indexer now only buffers inserts and sends owned buffers to the commit thread. This confirmed that the issue is not in my queueing logic: the commit job starts, reaches batch.commit(), and then sometimes does not return for many seconds.
Typical normal commits are around 40–150 ms for 32,768 inserted rows. For example:
flush successfully performed in 56 ms = 5.7 ms pre-commit + 50.5 ms commit
flush successfully performed in 67 ms = 5.3 ms pre-commit + 61.7 ms commit
flush successfully performed in 110 ms = 2.8 ms pre-commit + 107 ms commit
But periodically, a commit stalls badly. With commit_threads=1 and max_inflight=1, I see this sequence:
indexer.commit_job.start method=36 buffer=15 inserts=32768
indexer.batch_commit.start method=36 buffer=15 <-- COMMIT STARTS, but 'indexer.batch_commit.end' does not appear below
indexer.wait_for_capacity.start method=36 pending_commits=1 max_inflight=1 active_entries=32768
indexer.wait_commit_result.timeout waiting_method=36 pending_commits=1 active_entries=32768
indexer.wait_commit_result.timeout waiting_method=36 pending_commits=1 active_entries=32768
...
cache handler :: 'put' error on method '36': indexer queue full
The important part is that batch_commit.end never appears for that buffer, so the commit thread is stuck inside batch.commit().
I also tested with:
commit_threads = 2
max_inflight_commits_per_method = 2
and saw the same pattern, except now two commit jobs can get stuck at the same time.
When the stall happens, kernel stacks sometimes show Fjall-related activity around futex waits, and the commit thread may appear sleeping in hrtimer_nanosleep. In one earlier capture during a normal stall, I also saw a Fjall worker in the ext4 buffered write path:
ext4_buffered_write_iter
generic_perform_write
vfs_write
During the stuck state, the application keeps working for cache reads, but the indexer queue fills up because the commit result never arrives.
My questions are:
I can share more logs or a reduced benchmark if useful.
Thanks a lot,
Joan
Beta Was this translation helpful? Give feedback.
All reactions