Skip to content

Fix handling of deletes and ID minting during concurrent insert/delete#1146

Open
metajack wants to merge 1 commit into
mainfrom
push-oonxtvnsmnol
Open

Fix handling of deletes and ID minting during concurrent insert/delete#1146
metajack wants to merge 1 commit into
mainfrom
push-oonxtvnsmnol

Conversation

@metajack

Copy link
Copy Markdown
Contributor

Garnet cluster testing discovered a race condition with ID minting. The issue was:

  1. Thread A insert does have a free ID so bumps next_id to mint a new ID id0. That ID is now valid, but unused.
  2. Other threads delete and insert simultaneously, causing free IDs to be fully consumed.
  3. Thread B insert detects there might be a free ID, but the fast free list is empty, so it triggers a refill. It sees that id0 is unused so it adds it to the fast free list. Thread B then writes vectors data to id0.
  4. Thread A marks id0 as used and then writes vector data to id0.

Because there is a time period between minting the ID and marking it used, it is possible to give the ID out twice.

One possible fix is to reorder the operations so that we mark IDs used before bumping the next_id atomic, however, this requires a manual CAS loop. Instead, I combined max_block and next_id into the same RwLock so that these operations can be more simple and explicitly controlled.

In addition, there was another race during delete where an ID is marked free during delete, but that ID may be recycled before the function exits. To address this, we remove only the mapping and attributes in delete(), which prevents the vector from being returned from search, and then finish the deletion in the release() call, and finish by marking the node free after the writes complete. However, current diskann does not actually call release(), so a call was added to the end of inplace_delete().

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses concurrency races in the Garnet provider’s ID lifecycle by tightening synchronization around ID minting and by splitting “soft delete” (remove mappings/attributes) from “final release” (delete vector payload + mark ID free), then wiring inplace_delete() to call release().

Changes:

  • Call data_provider.release() at the end of Index::inplace_delete() to complete deletions and safely recycle IDs.
  • Refactor Garnet provider delete/release responsibilities (mappings removed in delete(), payload + fsm.mark_free() in release()), and adjust search post-processing to skip candidates whose external-id mapping is missing.
  • Rework FreeSpaceMap ID minting state (next_id + max_block) under a single RwLock and serialize refills with a dedicated Mutex; bump diskann-garnet version to 2.0.3.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
diskann/src/graph/index.rs Ensures inplace_delete() calls release() after dropping the adjacency list.
diskann-garnet/src/provider.rs Splits delete vs release behavior; updates release to perform final cleanup; search post-process skips missing mappings.
diskann-garnet/src/fsm.rs Consolidates ID minting/block expansion state under one lock and adds a refill mutex to avoid concurrent refills.
diskann-garnet/diskann-garnet.nuspec Version bump to 2.0.3.
diskann-garnet/Cargo.toml Version bump to 2.0.3.
Cargo.lock Locks updated for diskann-garnet 2.0.3.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +828 to 835
// Mark the ID free in the FSM.
if let Err(e) = self.fsm.mark_free(context, id) {
return future::ready(Err(e.into()));
};

if !ok {
return future::ready(Err(GarnetError::Delete.into()));
}
Comment thread diskann-garnet/src/fsm.rs
Comment on lines +279 to 283
let id_minter = self.id_minter.read().unwrap();
if id >= id_minter.next_id {
return Err(FsmError::IdOutOfRange(id));
}

Comment thread diskann-garnet/src/fsm.rs
}

/// Mark an ID according to value (true = used, false = free).
/// Mark an ID according to value (true = used, false = free), but don't check that its in range.
Comment on lines 1109 to 1112
let id = match accessor.provider.to_external_id(accessor.context, n.id) {
Ok(id) => id,
Err(e) => return future::ready(Err(e)),
Err(_) => continue, // vector got deleted; skip
};
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants