Skip to content

perf(ingest/gc): batch hard-delete soft-deleted entities#17843

Draft
treff7es wants to merge 1 commit into
masterfrom
fix/gc-soft-delete-batch
Draft

perf(ingest/gc): batch hard-delete soft-deleted entities#17843
treff7es wants to merge 1 commit into
masterfrom
fix/gc-soft-delete-batch

Conversation

@treff7es

Copy link
Copy Markdown
Contributor

Summary

datahub-gc's soft-deleted-entity cleanup deletes entities one-by-one — for each entity it issues three GMS round-trips (status fetch, hard delete, reference cleanup). On large stores (e.g. Query-entity GC) this is slow and CPU/IO-intensive and amplifies the MCL/reindex load.

What changed

  • DataHubGraph.hard_delete_entities(urns): removes a batch of entities in a single request via DELETE /openapi/entities/v1/?urns=...&soft=false.
  • soft_deleted_entity_cleanup.py: eligible URNs are buffered and removed in bulk batches (delete_batch_size, default 1000) instead of per-entity. The buffer is swapped out under a lock and flushed outside it, so the bulk request never blocks worker threads. The per-entity retention check (status.removed + age vs retention_days) is unchanged.

Notes / trade-offs

  • The per-entity delete_references_to_urn call is dropped — GC no longer actively scrubs inbound references to a deleted URN. Documented in docs/how/updating-datahub.md; run a separate reference-cleanup pass if you rely on it.
  • delete_batch_size is tunable if your GMS rejects very large URN lists.
  • The deletion soft-cap (limit_entities_delete) can overshoot by up to one batch, since num_hard_deleted now increments at flush time.

Checklist

  • PR conforms to the title format
  • Tests added/updated (tests/unit/test_gc.py: batch delete + buffer flush)
  • Notable-change entry added in docs/how/updating-datahub.md

The soft-deleted-entity cleanup in datahub-gc previously issued three GMS
round-trips per entity (status fetch, hard delete, reference cleanup),
making large GC runs slow and CPU-intensive.

Buffer eligible URNs and remove them in bulk via a single
DELETE /openapi/entities/v1/?urns=...&soft=false request per
delete_batch_size (default 1000), exposed as a new
DataHubGraph.hard_delete_entities() helper. The per-entity retention
check is unchanged; the per-entity delete_references_to_urn call is
dropped. The buffer is swapped out under a lock and flushed outside it so
the bulk request never blocks worker threads.
@github-actions github-actions Bot added ingestion PR or Issue related to the ingestion of metadata docs Issues and Improvements to docs labels Jun 10, 2026
@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 70.96774% with 9 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ta-ingestion/src/datahub/ingestion/graph/client.py 14.28% 6 Missing ⚠️
...ingestion/source/gc/soft_deleted_entity_cleanup.py 87.50% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@datahub-connector-tests

Copy link
Copy Markdown

Connector Tests Results

Connector tests failed for commit 1337adb

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Issues and Improvements to docs ingestion PR or Issue related to the ingestion of metadata

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant