Fjall batch.commit() stalls under sustained insert workload #294

joanbalaguero · 2026-05-21T12:14:24Z

joanbalaguero
May 21, 2026

Hi,

I am using Fjall as a disk-based secondary index for a high-throughput cache system written in Rust.

The workload is basically insert-only most of the time. Deletes are very rare. The index is used to find cache keys by composite fields such as hotel_code + check_in + check_out.

The current test case is:

1 logical index
1 keyspace
insert-only workload
32,768 rows per batch
approximately 4 flushes per second

Each indexed row writes:
data_key -> expires_at
reverse_key(cache_key) -> reverse_value containing expires_at + data_key

The batch commit path is roughly:

let mut batch = db.batch();

for insert in inserts {
    batch.insert(keyspace, data_key, expires_at_encoded);
    batch.insert(keyspace, reverse_key, reverse_value);
}

batch.commit()?;

I originally had a single indexer thread doing the whole flush synchronously. Later I moved the heavy commit work to a dedicated commit thread/pool. The indexer now only buffers inserts and sends owned buffers to the commit thread. This confirmed that the issue is not in my queueing logic: the commit job starts, reaches batch.commit(), and then sometimes does not return for many seconds.

Typical normal commits are around 40–150 ms for 32,768 inserted rows. For example:

flush successfully performed in 56 ms = 5.7 ms pre-commit + 50.5 ms commit
flush successfully performed in 67 ms = 5.3 ms pre-commit + 61.7 ms commit
flush successfully performed in 110 ms = 2.8 ms pre-commit + 107 ms commit

But periodically, a commit stalls badly. With commit_threads=1 and max_inflight=1, I see this sequence:

indexer.commit_job.start method=36 buffer=15 inserts=32768
indexer.batch_commit.start method=36 buffer=15 <-- COMMIT STARTS, but 'indexer.batch_commit.end' does not appear below
indexer.wait_for_capacity.start method=36 pending_commits=1 max_inflight=1 active_entries=32768
indexer.wait_commit_result.timeout waiting_method=36 pending_commits=1 active_entries=32768
indexer.wait_commit_result.timeout waiting_method=36 pending_commits=1 active_entries=32768
...
cache handler :: 'put' error on method '36': indexer queue full

The important part is that batch_commit.end never appears for that buffer, so the commit thread is stuck inside batch.commit().

I also tested with:
commit_threads = 2
max_inflight_commits_per_method = 2

and saw the same pattern, except now two commit jobs can get stuck at the same time.

When the stall happens, kernel stacks sometimes show Fjall-related activity around futex waits, and the commit thread may appear sleeping in hrtimer_nanosleep. In one earlier capture during a normal stall, I also saw a Fjall worker in the ext4 buffered write path:

ext4_buffered_write_iter
generic_perform_write
vfs_write

During the stuck state, the application keeps working for cache reads, but the indexer queue fills up because the commit result never arrives.

My questions are:

Are there Fjall settings I should tune for this kind of workload?
Which parameters are most relevant here: journal size, memtable/write buffer size, compaction workers, segment size, compression, or something else?
Is it safe/recommended to run multiple concurrent batch.commit() calls against the same keyspace, or should writes to one keyspace be serialized?
Is there any internal write-stall/backoff mechanism that could explain a commit thread sleeping inside batch.commit() for many seconds, even forever?
Are there metrics or debug hooks in Fjall to inspect compaction backlog, journal pressure, stalled writes, or pending flushes?

I can share more logs or a reduced benchmark if useful.

Thanks a lot,
Joan

marvin-j97 · 2026-05-21T19:37:55Z

marvin-j97
May 21, 2026
Maintainer

Are there Fjall settings I should tune for this kind of workload?

Hard to say, your workload writes small values as I am reading, so I wouldn't compaction to be too much of an issue, unless you have a very underpowered CPU or hundreds of thousands of writes. You could try disabling data block compression as a start.

Which parameters are most relevant here: journal size, memtable/write buffer size, compaction workers, segment size, compression, or something else?

Compaction strategy and compression (see °1) are the largest contributors to how compaction behaves I would say. Currently there is only really Leveled compaction implemented which is more prone to write stalls when the write throughput is heavy. The alternative would be (Size)-Tiered compaction but that is currently not available.
Also of course your disk of choice affects the best possible write performance.

Is it safe/recommended to run multiple concurrent batch.commit() calls against the same keyspace, or should writes to one keyspace be serialized?

Write batches are serialized internally.

Is there any internal write-stall/backoff mechanism that could explain a commit thread sleeping inside batch.commit() for many seconds, even forever?

Yes, as explained above, if data is written more quickly than can be reorganized through compaction, the system may stall itself to prevent read performance going out of hand.

5. Are there metrics or debug hooks in Fjall to inspect compaction backlog, journal pressure, stalled writes, or pending flushes?

Yes. you should both see debug log, if you have a log subscriber; and you can query the number of l0_table_count tables, journal disk space, write buffer size and queued flushes.
If you see L0 tables specifically growing to 10+, the system seemingly can't keep up compacting and has to throttle itself to keep read performance acceptable. There are multiple possible explanations for that. See bullet point °1 and °4.

0 replies

joanbalaguero · 2026-05-22T14:07:31Z

joanbalaguero
May 22, 2026
Author

Hi Marvin,

Thanks for your quick response. I really appreciate it.

I continued investigating the write stalls we discussed and now have more precise data from Fjall’s internal metrics.

My workload is a write-heavy secondary index for a cache system. Each logical insert writes two small entries into one Fjall keyspace:

data_key -> expires_at
reverse_key(cache_key) -> reverse_value

The workload is mostly insert-only.

Current test configuration:

One Fjall keyspace
Leveled compaction
l0_threshold = 64
table_target_size = 256 MiB
max_memtable_size = 1024 MiB
data block compression disabled
compaction filter disabled (or enabled, it does not matter, the result is the same)
one dedicated committer thread
batch.commit() calls are serialized by my application before reaching Fjall
io_uring reads/writes enabled, although the same issue was previously reproduced with them disabled too

I added logging for:

l0_table_count
journal_disk_space
write_buffer_size
outstanding_flushes
commit_duration_ms (our metric)

According to the docs, write_buffer_size is active + sealed memtables.

The stalls correlate very strongly with write_buffer_size.

A typical sequence looks like this:

commit_duration_ms ≈ 85–87 ms
l0_table_count = 18
write_buffer_size grows from ~525 MB to ~536 MB

then:

commit_duration_ms = 1111 ms
l0_table_count = 18
journal_disk_space = ~402 MB
write_buffer_size = ~540 MB
outstanding_flushes = 0

Queue full starts immediately after this.

After that slow commit, commits become faster again, around 24–60 ms, but the queue remains under pressure for a while because backlog has already accumulated. During that period, write_buffer_size continues growing:

write_buffer_size = 544 MB
write_buffer_size = 548 MB
...
write_buffer_size = 627 MB

Then suddenly Fjall creates new L0 tables and the write buffer drops sharply:

l0_table_count: 18 -> 21
journal_disk_space: ~402 MB -> 64 MB
write_buffer_size: ~627 MB -> ~93 MB

So the pattern seems to be:

write_buffer_size grows
-> one batch.commit() becomes slow, around 1 second
-> committer queue fills
-> write_buffer_size continues growing
-> eventually sealed memtables are flushed into new L0 tables
-> write_buffer_size drops sharply
-> l0_table_count increases by 3

The interesting part is that this happens even with l0_threshold = 64, while l0_table_count is only around 18, so this does not look like the L0 threshold alone. It looks more related to active + sealed memtable pressure / flush behavior. The slow commit and queue pressure start consistently around write_buffer_size ≈ 540 MB.

Questions (I hope not too many …):

Is this behavior expected when write_buffer_size reaches this range?
Does max_memtable_size apply to a single active memtable, or to the total active + sealed memtables?
If max_memtable_size = 1024 MiB, why would a slow commit / queue pressure start around write_buffer_size ≈ 540 MB?
Is there another internal threshold that controls active + sealed memtable pressure?
Is it expected that outstanding_flushes = 0 even while write_buffer_size is high and just before new L0 tables are created?
Is there a setting to make sealed memtable flushing start earlier or more aggressively, to avoid the 1-second commit stall?
Would increasing Fjall worker threads help here, or is this limited by the current Leveled compaction / write buffer flush pipeline?
For a write-heavy workload with small values, where read performance is less important than avoiding write stalls, is there any recommended configuration?

At this point, the bottleneck does not seem to be caused by the filesystem, io_uring reads/writes, compression, compaction filter, or concurrent commits. It seems tied to Fjall’s internal write buffer / sealed memtable / L0 flush behavior.

Any help would be very appreciated.

Thanks a lot, and congrats for this amazing crate

Joan

0 replies

marvin-j97 · 2026-05-22T14:13:01Z

marvin-j97
May 22, 2026
Maintainer

l0_threshold = 64

This is possibly one of the culprits. Fjall is hard-coded to start slowing down writes after >= 10 L0 tables. The default threshold is quite sensible at 4. I would not really blindly change this at all, unless you can verify via a benchmark that changing it improves performance for your specific workload.

max_memtable_size = 1024 MiB

Is this behavior expected when write_buffer_size reaches this range?

This is also possibly an issue. The default journal (soft) max size is 512 MiB < 1024 MiB.
I would probably set this to be maybe 64-256 MiB at most. 64 MiB already amortizes most of the flush and rotation costs from my experience, so I don't believe going all the way up to a Gig greatly improves performance. Also it makes tables in L0 very large which is not optimal.

Does max_memtable_size apply to a single active memtable, or to the total active + sealed memtables?

The active memtable. It's basically the threshold at which a memtable is sealed and queued for flushing.

If max_memtable_size = 1024 MiB, why would a slow commit / queue pressure start around write_buffer_size ≈ 540 MB?

See above.

Is there another internal threshold that controls active + sealed memtable pressure?

Currently no. At least not directly.

Is it expected that outstanding_flushes = 0 even while write_buffer_size is high and just before new L0 tables are created?

Hard to say. Flushes should be fairly quick, so I'm not sure at which interval you are polling the metrics.

Is there a setting to make sealed memtable flushing start earlier or more aggressively, to avoid the 1-second commit stall?

If you see sealed memtables building up that means the worker threads are busy compacting, so they can't take over a flush task. If your worker thread count > 1, one thread is normally reserved to only serve flush tasks.

Would increasing Fjall worker threads help here, or is this limited by the current Leveled compaction / write buffer flush pipeline?

Unlikely unless you see active_compactions being quite high all the time. Chances are you bottlenecked at the L0-L1 boundary.

For a write-heavy workload with small values, where read performance is less important than avoiding write stalls, is there any recommended configuration?

For write-heavy workloads, the Size-Tiered (Universal in RocksDB terminology) compaction strategy is preferred, as long as you can accept higher temporary space usage. However that is currently not implemented in lsm-tree 3.x.x.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fjall batch.commit() stalls under sustained insert workload #294

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Fjall batch.commit() stalls under sustained insert workload #294

Uh oh!

joanbalaguero May 21, 2026

Replies: 3 comments

Uh oh!

marvin-j97 May 21, 2026 Maintainer

Uh oh!

joanbalaguero May 22, 2026 Author

Uh oh!

Uh oh!

marvin-j97 May 22, 2026 Maintainer

joanbalaguero
May 21, 2026

marvin-j97
May 21, 2026
Maintainer

joanbalaguero
May 22, 2026
Author

marvin-j97
May 22, 2026
Maintainer