Skip to content

[storage/qmdb/current] fix bitmap batch parent chain growth with rwlock#3627

Merged
roberto-bayardo merged 1 commit into
mainfrom
locked-bitmap
Apr 22, 2026
Merged

[storage/qmdb/current] fix bitmap batch parent chain growth with rwlock#3627
roberto-bayardo merged 1 commit into
mainfrom
locked-bitmap

Conversation

@roberto-bayardo
Copy link
Copy Markdown
Collaborator

@roberto-bayardo roberto-bayardo commented Apr 18, 2026

Summary

Refactor the Current QMDB's committed bitmap from a layered BitmapBatch<N> (an Arc<BitMap> with a chain of Layers on top) to Arc<SharedBitmap<N>> where SharedBitmap wraps RwLock<BitMap<N>>. apply_batch, prune, and rewind now mutate the committed bitmap in place under the write lock instead of stacking a Layer.

Motivation

The old design had a memory/perf cliff:

  • Every apply_batch pushed a new Layer on Db::status; the chain only collapsed via Db::flatten().
  • flatten required unique ownership of the terminal Arc<BitMap>. Whenever a live MerkleizedBatch shared the base Arc (the common case — callers typically hold the batch they just applied), flatten fell back to a full (*arc).clone() of the bitmap. For bitmaps of tens/hundreds of MB, that's a silent, expensive memcpy.
  • Workloads with continuous live children never reach a quiescent moment where the base Arc is unique, so the chain grew unboundedly.

Mutating in place under a RwLock bounds memory to the bitmap's actual live size and removes the flatten hazard entirely.

Performance

Performance on existing benchmarks is unchanged. (There was no benchmark that results in the parent-chain-growth issue this change is addressing.). Performance improvement on a new "chained_growth" benchmark is as follows:

Variant Improvement
current::unordered::fixed::mmb chunk=32 ~35%
current::ordered::fixed::mmb chunk=32 ~37%
current::unordered::fixed::mmb chunk=256 ~21%
current::ordered::fixed::mmb chunk=256 ~23%

Design

SharedBitmap<N> (new)

Thin wrapper around RwLock<BitMap<N>> (parking_lot via commonware_utils::sync::RwLock).

  • read() returns a RwLockReadGuard<BitMap<N>>.
  • write() is pub(super) — only Db::apply_batch, Db::prune, Db::rewind can mutate.
  • Implements BitmapReadable<N> for proof-path reads.

BitmapBatch<N> simplified

pub(crate) enum BitmapBatch<const N: usize> {                                                                                                                              
    Base(Arc<SharedBitmap<N>>),   // was Base(Arc<BitMap<N>>)                                                                                                              
    Layer(Arc<BitmapBatchLayer<N>>),    
}                                           

Removed apply_overlay and flatten.

Db

  • status: Arc<SharedBitmap<N>> (was BitmapBatch<N>).
  • apply_batch collects overlay Arcs from the batch chain, drops the batch, then applies overlays under a single write guard.
  • prune and rewind take the write guard, mutate, drop before any .await.
  • Proof paths pass self.status.as_ref() (SharedBitmap implements BitmapReadable).
  • Db::flatten removed. build_grafted_tree is now generic over &impl BitmapReadable<N>.

Behavior differences:

In the old design, each Arc<BitMap> was an immutable snapshot, so reading through a stale MerkleizedBatch returned a hypothetical state — not matching the DB, but internally consistent. With the RwLock design, there's one live bitmap that evolves in place, so reads through a stale batch mix its overlays with post-divergence committed chunks, producing incoherent data. (In either case, the results are not useful.)

Another behavior difference is: if a batch was built before Db::prune advanced the committed bitmap's pruning boundary, and the caller later reads through that batch requesting a chunk at an index that's now pruned, the read panics inside Prunable::get_chunk's bounds assertion. Old design returned the frozen pre-prune snapshot (because Arc::make_mut cloned the bitmap at prune time, leaving external batches with an unpruned copy). New design has nothing to clone, so the pruned index is a hard error.

In practice this only triggers on misuse: library-internal paths (build_chunk_overlay, proof generation, apply_batch's overlay write) already guard against reading pruned indices. But callers who reach into BitmapBatch's chain directly after holding a batch across a prune get a panic rather than silent stale data. The test_current_live_batch_safe_across_prune regression test confirms the common case (live batch across prune, apply after) stays panic-free.

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 18, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
✅ Deployment successful!
View logs
commonware-mcp 77af6e3 Apr 21 2026, 09:32 PM

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 18, 2026

Deploying monorepo with  Cloudflare Pages  Cloudflare Pages

Latest commit: 77af6e3
Status: ✅  Deploy successful!
Preview URL: https://f424fefc.monorepo-eu0.pages.dev
Branch Preview URL: https://locked-bitmap.monorepo-eu0.pages.dev

View logs

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the Current QMDB committed bitmap representation to avoid unbounded layer-chain growth and expensive flatten-time cloning by switching the committed bitmap to a shared, in-place-mutable Arc<SharedBitmap<N>> (internally RwLock<BitMap<N>>), while keeping speculative overlays as batch layers.

Changes:

  • Introduces SharedBitmap<N> and updates BitmapBatch<N>::Base to reference Arc<SharedBitmap<N>>.
  • Refactors Db to store status: Arc<SharedBitmap<N>> and to apply overlays/prune/rewind by mutating the committed bitmap under a write lock.
  • Updates tests by removing flatten()-focused cases and adding a regression test for extending an applied batch.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
storage/src/qmdb/current/sync/mod.rs Initializes Db::status as Arc<SharedBitmap> during sync construction.
storage/src/qmdb/current/mod.rs Initializes Db::status as Arc<SharedBitmap> and updates tests to remove flatten() reliance and cover post-apply extension.
storage/src/qmdb/current/db.rs Refactors Db to use shared bitmap and apply/prune/rewind by in-place mutation under RwLock.
storage/src/qmdb/current/batch.rs Adds SharedBitmap, updates batch/base bitmap plumbing, and removes apply_overlay/flatten.
Comments suppressed due to low confidence (2)

storage/src/qmdb/current/db.rs:235

  • self.status.read() is a blocking RwLockReadGuard, but this guard is kept alive for the entire RangeProof::new_with_ops(...).await. Per utils/src/sync/mod.rs:19, blocking lock guards must not be held across .await. Prefer passing self.status.as_ref() (SharedBitmap implements BitmapReadable) so each bitmap read briefly acquires/releases the lock, or use an async lock if you require a consistent snapshot across awaits.
        let storage = self.grafted_storage();
        let ops_root = self.any.log.root();
        let guard = self.status.read();
        RangeProof::new_with_ops(
            hasher,
            &*guard,
            &storage,
            &self.any.log,
            start_loc,
            max_ops,
            ops_root,
        )
        .await

storage/src/qmdb/current/db.rs:534

  • A blocking RwLockReadGuard is held across build_grafted_tree(...).await here, which violates the repo’s async locking guidance (utils/src/sync/mod.rs:19: do not hold blocking guards across .await). Consider passing self.status.as_ref() directly (it implements BitmapReadable) instead of a guard, or switch to an async lock if a consistent snapshot must span awaits.
        let hasher = StandardHasher::<H>::new();
        let guard = self.status.read();
        let grafted_tree = build_grafted_tree::<F, H, N>(
            &hasher,
            &*guard,
            &pinned_nodes,
            &self.any.log.merkle,
            self.thread_pool.as_ref(),
        )
        .await?;

Comment thread storage/src/qmdb/current/db.rs Outdated
Comment thread storage/src/qmdb/current/db.rs Outdated
Comment thread storage/src/qmdb/current/batch.rs Outdated
Comment thread storage/src/qmdb/current/batch.rs Outdated
Comment thread storage/src/qmdb/current/batch.rs Outdated
@roberto-bayardo roberto-bayardo force-pushed the locked-bitmap branch 4 times, most recently from 35455c3 to 95a5731 Compare April 18, 2026 15:49
@roberto-bayardo roberto-bayardo requested a review from Copilot April 18, 2026 15:50
@roberto-bayardo
Copy link
Copy Markdown
Collaborator Author

bugbot run

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors Current QMDB’s committed activity bitmap storage from an immutable layered BitmapBatch chain to a single Arc<SharedBitmap<N>> backed by a RwLock, so apply_batch/prune/rewind mutate the committed bitmap in place and avoid unbounded layer-chain growth and expensive flatten-time clones.

Changes:

  • Introduce SharedBitmap<N> (RwLock<BitMap<N>>) and make the committed Db::status an Arc<SharedBitmap<N>>.
  • Update Db::apply_batch, prune, and rewind to mutate the shared bitmap under a write lock (and remove Db::flatten).
  • Refresh tests to cover live-batch behavior across prune and extending an already-applied batch.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
storage/src/qmdb/current/sync/mod.rs Initialize Db with Arc<SharedBitmap<N>> for committed bitmap state.
storage/src/qmdb/current/mod.rs Initialize Db with SharedBitmap; update tests to remove flatten assumptions and add regressions for the new shared-bitmap semantics.
storage/src/qmdb/current/db.rs Replace layered bitmap mutation/flattening with in-place overlay application and lock-scoped bitmap pruning/rewinding; update proof/root paths to accept BitmapReadable.
storage/src/qmdb/current/batch.rs Add SharedBitmap, update BitmapBatch::Base to reference it, and adjust overlay building to avoid repeated pruned_chunks() reads in hot loops.

Comment thread storage/src/qmdb/current/batch.rs Outdated
Comment thread storage/src/qmdb/current/mod.rs Outdated
Comment thread storage/src/qmdb/current/batch.rs Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 95a5731. Configure here.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors Current QMDB’s committed activity bitmap from an immutable layered BitmapBatch chain into a single shared SharedBitmap (Arc<RwLock<BitMap<N>>>) so apply_batch, prune, and rewind can mutate the committed bitmap in-place and avoid unbounded layer-chain growth and costly fallback cloning.

Changes:

  • Introduces SharedBitmap<N> and updates Db.status to Arc<SharedBitmap<N>>, with proof paths reading via BitmapReadable.
  • Reworks Db::apply_batch to collect overlay Arcs, drop the batch, then apply overlays under one write lock; updates prune/rewind similarly.
  • Removes flatten usage/tests and adds new regression tests around prune + live batches and extending an applied batch.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
storage/src/qmdb/current/sync/mod.rs Constructs Db with Arc<SharedBitmap<N>> during sync initialization.
storage/src/qmdb/current/mod.rs Updates DB initialization and replaces flatten-focused tests with new shared-bitmap regression tests.
storage/src/qmdb/current/db.rs Updates Db.status type, proof plumbing to use BitmapReadable, and in-place bitmap mutation in apply_batch/prune/rewind.
storage/src/qmdb/current/batch.rs Adds SharedBitmap, updates BitmapBatch::Base to reference it, removes apply_overlay/flatten, and documents stale-read caveat.

Comment thread storage/src/qmdb/current/batch.rs
Comment thread storage/src/qmdb/current/mod.rs Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors Current QMDB’s committed activity bitmap from an immutable, layer-stacking BitmapBatch base to a shared, in-place-mutated Arc<SharedBitmap<N>> guarded by a blocking RwLock, eliminating unbounded parent-chain growth and avoiding expensive bitmap cloning during flattening.

Changes:

  • Introduces SharedBitmap<N> (wrapper around RwLock<BitMap<N>>) and updates the committed bitmap to be Arc<SharedBitmap<N>>.
  • Updates Db::apply_batch, Db::prune, and Db::rewind to mutate the committed bitmap under a write lock (and removes Db::flatten/layer-collapsing behavior).
  • Adjusts proof/grafting helpers to accept &impl BitmapReadable<N> and updates tests to cover the new shared-bitmap behavior and regressions.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
storage/src/qmdb/current/sync/mod.rs Updates DB construction to store the committed bitmap as Arc<SharedBitmap<N>>.
storage/src/qmdb/current/mod.rs Updates initialization similarly and replaces flatten-focused tests with shared-bitmap/prune/extend regression tests.
storage/src/qmdb/current/db.rs Refactors DB state and bitmap application/prune/rewind flows to mutate the committed bitmap under a write lock; makes grafted-tree rebuild generic over BitmapReadable.
storage/src/qmdb/current/batch.rs Adds SharedBitmap, updates BitmapBatch base to reference it, removes overlay-application/flattening APIs, and documents the stale-read caveat.

@roberto-bayardo roberto-bayardo marked this pull request as ready for review April 18, 2026 19:50
@roberto-bayardo roberto-bayardo force-pushed the locked-bitmap branch 3 times, most recently from 4ced04c to 962450a Compare April 18, 2026 21:17
@patrick-ogrady
Copy link
Copy Markdown
Contributor

Performance on existing benchmarks is unchanged. (There is not currently a benchmark that results in the parent-chain-growth issue this change is addressing.)

Before merging this, I think we need such a benchmark?

@roberto-bayardo roberto-bayardo marked this pull request as draft April 20, 2026 05:33
@roberto-bayardo roberto-bayardo force-pushed the locked-bitmap branch 2 times, most recently from 16f087c to 32fc597 Compare April 20, 2026 15:56
@roberto-bayardo
Copy link
Copy Markdown
Collaborator Author

Performance on existing benchmarks is unchanged. (There is not currently a benchmark that results in the parent-chain-growth issue this change is addressing.)

Before merging this, I think we need such a benchmark?

Benchmark added (also created a standalone PR for it here: #3633)

@roberto-bayardo roberto-bayardo force-pushed the locked-bitmap branch 3 times, most recently from 992c028 to 094c8ee Compare April 20, 2026 18:36
@danlaine
Copy link
Copy Markdown
Collaborator

#3635

^ PR to add some tests and tweak a few things but on the whole LGTM

@roberto-bayardo roberto-bayardo marked this pull request as ready for review April 20, 2026 22:40
@roberto-bayardo roberto-bayardo force-pushed the locked-bitmap branch 2 times, most recently from 26982f6 to 4d62cd5 Compare April 21, 2026 17:46
danlaine
danlaine previously approved these changes Apr 21, 2026
Shares the committed bitmap behind Arc<SharedBitmap<N>> so live batches
can hold a reference while apply_batch mutates in place under the write
lock. Removes Db::flatten, drops BitmapBatch's Layer chain in favor of
apply-in-place, and adds valid_targets staleness checks.
@roberto-bayardo roberto-bayardo added this pull request to the merge queue Apr 22, 2026
Merged via the queue into main with commit bb296b7 Apr 22, 2026
179 checks passed
@roberto-bayardo roberto-bayardo deleted the locked-bitmap branch April 22, 2026 01:45
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 96.22642% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.86%. Comparing base (245a099) to head (77af6e3).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
storage/src/qmdb/current/batch.rs 93.70% 7 Missing and 1 partial ⚠️
storage/src/qmdb/current/db.rs 96.36% 0 Missing and 2 partials ⚠️
@@           Coverage Diff           @@
##             main    #3627   +/-   ##
=======================================
  Coverage   95.86%   95.86%           
=======================================
  Files         441      441           
  Lines      171196   171277   +81     
  Branches     4001     4000    -1     
=======================================
+ Hits       164109   164187   +78     
- Misses       5827     5832    +5     
+ Partials     1260     1258    -2     
Files with missing lines Coverage Δ
storage/src/qmdb/current/mod.rs 98.91% <100.00%> (+0.01%) ⬆️
storage/src/qmdb/current/sync/mod.rs 85.32% <100.00%> (ø)
storage/src/qmdb/current/db.rs 93.61% <96.36%> (+0.57%) ⬆️
storage/src/qmdb/current/batch.rs 91.31% <93.70%> (-0.25%) ⬇️

... and 7 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 245a099...77af6e3. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants