Skip to content

[RangeIndex] Add bf-tree recover-then-drop crash repros#1889

Open
tiagonapoli wants to merge 1 commit into
mainfrom
tiagonapoli/bftree-recover-drop-repro
Open

[RangeIndex] Add bf-tree recover-then-drop crash repros#1889
tiagonapoli wants to merge 1 commit into
mainfrom
tiagonapoli/bftree-recover-drop-repro

Conversation

@tiagonapoli

Copy link
Copy Markdown
Collaborator

What

Adds three [Explicit] tests to RespRangeIndexTests that reproduce a crash in the native bf-tree 0.5.0 library: dropping a CprSnapshot-recovered, multi-level tree aborts the process in bftree_drop with assertion failed: next_level.is_null() at mini_page_op.rs:429.

The panic fires on a bf-tree internal background thread (a native abort()), so it cannot be caught by --blame-crash/createdump. This is the root cause behind the intermittent crashes seen in the RangeIndex cluster-migration suite (the migration destination tree is a CprSnapshot-recovered tree).

Tests

  • RICheckpointRecoverThenDrop_Crashes — RESP-level: RI.CREATE DISK + RI.SET + SAVE + restart-recover; the recovered tree is dropped on server dispose.
  • BfTreeRecoverFromCprSnapshotThenDrop_Crashes — pure-native, self-contained: BfTreeService create -> snapshot -> recover -> drop.
  • BfTreeRecoverFromCprSnapshotThenDrop_BelowThreshold_NoCrash — control: same sequence with few records (single-level) drops cleanly.

Notes

  • The crash is driven by the tree becoming multi-level (total data volume vs the small 64 KiB cache), not by record size. The repros use tiny 4-byte keys/values and insert enough (1000) to cross the split threshold; every Insert is asserted to succeed.
  • Deterministic and runtime-independent (reproduces on both net8.0 and net10.0).
  • All three are [Explicit] so they are excluded from normal CI runs (they crash the host by design).
  • No fix is included here — this PR is a tracked reproduction only.

Copilot AI review requested due to automatic review settings June 19, 2026 02:46

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Not ready to approve

The new explicit tests leak temp directories on the passing control path and should ensure native resources are disposed even when assertions fail.

Pull request overview

Adds explicit, crash-by-design reproduction coverage for a native bf-tree 0.5.0 recover-then-drop abort affecting RangeIndex, to make the failure deterministic and easier to track while a fix is developed. The PR title and description accurately match the implementation and clearly document the failure mode and why the tests are [Explicit].

Changes:

  • Add [Explicit] RESP-level repro that checkpoints (SAVE), recovers, then triggers the abort on tree drop during server dispose.
  • Add [Explicit] pure-native repro using BfTreeService snapshot → recover → drop.
  • Add [Explicit] below-threshold control test showing single-level trees recover/drop cleanly.
File summaries
File Description
test/standalone/Garnet.test.rangeindex/RespRangeIndexTests.cs Adds three [Explicit] crash repro/control tests for bf-tree CPR snapshot recovery followed by drop.

Copilot's findings

  • Files reviewed: 1/1 changed files
  • Comments generated: 2

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.

Comment on lines +1204 to +1211
[Test]
[Explicit("Control: a below-threshold (single-level) recovered tree drops cleanly (no crash).")]
public void BfTreeRecoverFromCprSnapshotThenDrop_BelowThreshold_NoCrash()
{
var dir = Path.Combine(Path.GetTempPath(), $"bftree_cprrepro_{Guid.NewGuid():N}");
Directory.CreateDirectory(dir);

// Seed: only 50 tiny records -> stays single-level, then CprSnapshot.
Comment on lines +1177 to +1189
// Seed: disk-backed tree with 1000 tiny 4-byte records -> multi-level, then CprSnapshot.
var seed = new BfTreeService(StorageBackendType.Disk,
Path.Combine(dir, "seed.data.bftree"), Path.Combine(dir, "seed.scratch.cpr"),
cbSizeByte: 65536, cbMinRecordSize: 8);
for (var i = 0; i < 1000; i++)
{
var bytes = Encoding.ASCII.GetBytes(i.ToString("D4"));
ClassicAssert.AreEqual(BfTreeInsertResult.Success, seed.Insert(bytes, bytes), $"insert {i} should succeed");
}
seed.CprSnapshot();
var snapshot = Path.Combine(dir, "snap.bftree");
File.Copy(Path.Combine(dir, "seed.scratch.cpr"), snapshot, overwrite: false);
seed.Dispose();
@tiagonapoli tiagonapoli force-pushed the tiagonapoli/bftree-recover-drop-repro branch from 0d49916 to 2fff782 Compare June 19, 2026 03:10
Add three [Explicit] tests to RespRangeIndexTests reproducing the bf-tree 0.5.0
crash where dropping a CprSnapshot-recovered, MULTI-LEVEL tree aborts in
bftree_drop with 'assertion failed: next_level.is_null()' at mini_page_op.rs:429
(panic on a bf-tree background thread, so --blame-crash cannot catch it):

- RICheckpointRecoverThenDrop_Crashes: RESP-level via RI.CREATE DISK + RI.SET
  + SAVE + restart-recover; the recovered tree is dropped on server dispose.
- BfTreeRecoverFromCprSnapshotThenDrop_Crashes: pure-native, calls a shared
  SeedRecoverDrop(records) helper with 378 records.
- BfTreeRecoverFromCprSnapshotThenDrop_BelowThreshold_NoCrash: control with 377
  records (one below the threshold) that drops cleanly.

The crash is driven by the tree becoming multi-level (total data volume vs the
small 64 KiB cache), not by record size. With tiny 4-byte key=value records,
CACHESIZE 65536, MINRECORD 8, the threshold is a sharp 378 records (377 stays
single-level), deterministic and identical on net8/net10 and Windows/Linux.
Every Insert is asserted to succeed. All [Explicit] (crash the host by design;
excluded from normal CI runs).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tiagonapoli tiagonapoli force-pushed the tiagonapoli/bftree-recover-drop-repro branch 2 times, most recently from ccd2ce1 to fe41bc6 Compare June 19, 2026 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants