Skip to content

fix: optimize batch document deletion - eliminate N redundant KG rebuilds#2812

Closed
yang1002378395-cmyk wants to merge 2 commits intoHKUDS:mainfrom
yang1002378395-cmyk:main
Closed

fix: optimize batch document deletion - eliminate N redundant KG rebuilds#2812
yang1002378395-cmyk wants to merge 2 commits intoHKUDS:mainfrom
yang1002378395-cmyk:main

Conversation

@yang1002378395-cmyk
Copy link
Copy Markdown

Summary

  • Add skip_rebuild parameter to adelete_by_doc_id() (default False for backward compatibility)
  • Modify background_delete_documents() to collect rebuild data from all deletions
  • Call rebuild_knowledge_from_chunks() once after all deletions complete

Performance Impact

  • 75x performance improvement (verified by issue author)
  • Eliminates N redundant LLM calls when deleting N documents sharing entities/relationships
  • Shared entities/relationships rebuilt only once instead of N times

Test Plan

  • Syntax validation passed
  • Backward compatible (default skip_rebuild=False)
  • Integration test with batch deletion (requires review)

Fixes #2795

…ilds

- Add skip_rebuild parameter to adelete_by_doc_id() (default: False)
- Add entities_to_rebuild/relationships_to_rebuild fields to DeletionResult
- Refactor background_delete_documents() to collect all rebuild data
- Call rebuild_knowledge_from_chunks() once after all deletions
- Fixes HKUDS#2795: 75x performance improvement for batch deletions

Backward compatible: skip_rebuild=False maintains existing behavior
for single document deletions and external API callers.
@danielaskdd
Copy link
Copy Markdown
Collaborator

Code review

Found 1 issue:

  1. asdict is used but not imported in document_routes.py — will raise NameError at runtime whenever the Phase 2 batch rebuild executes

The new Phase 2 block calls rebuild_knowledge_from_chunks(global_config=asdict(rag), ...), but asdict from dataclasses is never imported in this file. Every batch deletion that reaches Phase 2 will crash with NameError: name 'asdict' is not defined. Fix: add from dataclasses import asdict to the imports.

relationships_to_rebuild=all_relationships_to_rebuild,
knowledge_graph_inst=rag.chunk_entity_relation_graph,
entities_vdb=rag.entities_vdb,
relationships_vdb=rag.relationships_vdb,
text_chunks_storage=rag.text_chunks,
llm_response_cache=rag.llm_response_cache,
global_config=asdict(rag),
pipeline_status=pipeline_status,

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@danielaskdd
Copy link
Copy Markdown
Collaborator

@jules review

@danielaskdd
Copy link
Copy Markdown
Collaborator

This PR implements the fix and returns entities_to_rebuild and relationships_to_rebuild early from adelete_by_doc_id() if skip_rebuild=True. However, by returning early, it skips "Step 9. Delete from full_entities and full_relations storage" in adelete_by_doc_id(), which is a side effect.

PR 2819 fixes this exact issue. It defers the rebuild step but lets adelete_by_doc_id() finish to "Step 9" and then returns the rebuild data along with the final DeletionResult. It also includes tests (tests/test_batch_delete_deferred_rebuild.py).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Performance]: batch document deletion triggers N redundant knowledge graph rebuilds

2 participants