perf(ingest/gc): batch hard-delete soft-deleted entities#17843
Draft
treff7es wants to merge 1 commit into
Draft
Conversation
The soft-deleted-entity cleanup in datahub-gc previously issued three GMS round-trips per entity (status fetch, hard delete, reference cleanup), making large GC runs slow and CPU-intensive. Buffer eligible URNs and remove them in bulk via a single DELETE /openapi/entities/v1/?urns=...&soft=false request per delete_batch_size (default 1000), exposed as a new DataHubGraph.hard_delete_entities() helper. The per-entity retention check is unchanged; the per-entity delete_references_to_urn call is dropped. The buffer is swapped out under a lock and flushed outside it so the bulk request never blocks worker threads.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Connector Tests ResultsConnector tests failed for commit To skip connector tests, add the Autogenerated by the connector-tests CI pipeline. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
datahub-gc's soft-deleted-entity cleanup deletes entities one-by-one — for each entity it issues three GMS round-trips (status fetch, hard delete, reference cleanup). On large stores (e.g. Query-entity GC) this is slow and CPU/IO-intensive and amplifies the MCL/reindex load.What changed
DataHubGraph.hard_delete_entities(urns): removes a batch of entities in a single request viaDELETE /openapi/entities/v1/?urns=...&soft=false.soft_deleted_entity_cleanup.py: eligible URNs are buffered and removed in bulk batches (delete_batch_size, default1000) instead of per-entity. The buffer is swapped out under a lock and flushed outside it, so the bulk request never blocks worker threads. The per-entity retention check (status.removed+ age vsretention_days) is unchanged.Notes / trade-offs
delete_references_to_urncall is dropped — GC no longer actively scrubs inbound references to a deleted URN. Documented indocs/how/updating-datahub.md; run a separate reference-cleanup pass if you rely on it.delete_batch_sizeis tunable if your GMS rejects very large URN lists.limit_entities_delete) can overshoot by up to one batch, sincenum_hard_deletednow increments at flush time.Checklist
tests/unit/test_gc.py: batch delete + buffer flush)docs/how/updating-datahub.md