Summary
Harbor's garbage collector is purely DB-driven: it finds orphaned blobs by querying the blob table and removes entries not referenced by any project. This works correctly for the blobs Harbor knows about. However, blobs that reach the final storage path (blobs/sha256/<2>/<64>/data) without a corresponding blob table record are permanently invisible to GC and accumulate indefinitely with no cleanup path.
We are calling these storage-only orphans to distinguish them from the DB-tracked orphans that GC already handles.
Evidence
We observed a growing discrepancy between registry storage size and the size Harbor reports in its UI/API on a dedicated BuildKit cache registry instance (~6 weeks old, S3 backend):
| Source |
Blobs |
Size |
| S3 storage |
1,254,732 |
~84 TB |
Harbor blob table |
163,952 |
~14 TB |
| Storage-only orphans |
1,090,780 |
~74 TB |
We confirmed this by listing all blobs/sha256/*/data objects in S3, extracting digests, and cross-referencing against SELECT digest FROM blob. 87% of storage blobs have no DB record.
We also verified that sampled storage-only orphan blobs have no repository link files in repositories/<repo>/_layers/sha256/<digest>/link, meaning they are unreachable through the registry protocol too — not just missing from Harbor's DB.
The orphans accumulate steadily:
2026-04-27: 67,979 blobs (~4.1 TB that day)
2026-04-28: 61,875 blobs (~3.4 TB)
2026-04-29: 56,086 blobs (~3.1 TB)
2026-04-30: 32,594 blobs (~1.6 TB)
Root Cause
Harbor GC's mark phase (uselessBlobs / GetBlobsNotRefedByProjectBlob) queries:
SELECT b.* FROM blob AS b
LEFT JOIN project_blob pb ON b.id = pb.blob_id
WHERE pb.id IS NULL
AND b.update_time <= now() - interval '<time_window> hours'
This only surfaces blobs already in the blob table. Any blob that reaches blobs/sha256/<2>/<64>/data in the storage backend without a corresponding DB row — due to a transient DB write failure, a registry crash between the storage commit and the DB write, or any other failure between those two steps — is permanently outside Harbor GC's scope. The GC has no mechanism to discover it.
The Docker Distribution registry's built-in garbage-collect command handles this correctly: it marks blobs by walking all manifest files in storage (not the DB), then sweeps all storage blobs not in that mark set. Harbor's custom GC replaces this with a DB-driven approach, trading correctness for the ability to avoid read-only mode — but in doing so it loses the ability to clean storage-only orphans.
Proposed Fix: Add a Storage-Scan Phase to Harbor GC
We propose adding an optional GC phase that inverts the current logic: instead of starting from the DB and checking storage, it starts from storage and checks the DB.
Sketch of the new phase:
func (gc *GarbageCollector) sweepStorageOrphans(ctx job.Context) error {
// Walk all blobs in the registry storage backend
err := gc.registryCtlClient.WalkStorage("blobs/sha256/", func(digest string, modTime time.Time) error {
// Respect the same time_window as the existing GC to protect
// blobs that are mid-upload (blob data written, DB write pending)
if time.Since(modTime) < time.Duration(gc.timeWindowHours)*time.Hour {
return nil
}
// Check if Harbor DB knows about this blob
_, err := gc.blobMgr.Get(ctx.SystemContext(), digest)
if errors.IsNotFoundErr(err) {
// Not in DB and old enough — delete from storage only
// (no DB record to clean up)
return gc.registryCtlClient.DeleteBlob(digest)
}
return err
})
return err
}
This would be gated behind a new GC parameter (e.g. scan_storage: true, default false) so operators can opt in. It could also be scheduled less frequently than the main GC given it involves a full storage walk.
What this requires from registryctl: a new WalkStorage endpoint (or reuse of the existing storage driver's Walk function) that streams blob paths from the backend. The storage driver interface in distribution/distribution already supports Walk; it would need to be exposed via the registryctl API.
What this does NOT require: read-only mode, downtime, or changes to Harbor's quota or quota tracking logic (since storage-only orphans have no DB record, nothing needs to be decremented).
Alternative Already in the Codebase
markOrSweepUntaggedBlobs in garbage_collection.go is a related but distinct function — it cleans stale project_blob entries for DB-tracked blobs whose artifacts have been deleted. It is complementary to the proposed fix, not a substitute.
Questions for Maintainers
- Is the failure mode described above (blob reaching storage without a DB record) a known scenario? Are there existing mitigations we may have missed?
- Is there an existing registryctl API or storage driver hook that could support the
WalkStorage operation without additional endpoints?
- Would a PR implementing the storage-scan phase with an opt-in parameter be welcome?
Happy to share more data, provide a proof-of-concept implementation, or discuss the tradeoffs further.
Summary
Harbor's garbage collector is purely DB-driven: it finds orphaned blobs by querying the
blobtable and removes entries not referenced by any project. This works correctly for the blobs Harbor knows about. However, blobs that reach the final storage path (blobs/sha256/<2>/<64>/data) without a correspondingblobtable record are permanently invisible to GC and accumulate indefinitely with no cleanup path.We are calling these storage-only orphans to distinguish them from the DB-tracked orphans that GC already handles.
Evidence
We observed a growing discrepancy between registry storage size and the size Harbor reports in its UI/API on a dedicated BuildKit cache registry instance (~6 weeks old, S3 backend):
blobtableWe confirmed this by listing all
blobs/sha256/*/dataobjects in S3, extracting digests, and cross-referencing againstSELECT digest FROM blob. 87% of storage blobs have no DB record.We also verified that sampled storage-only orphan blobs have no repository link files in
repositories/<repo>/_layers/sha256/<digest>/link, meaning they are unreachable through the registry protocol too — not just missing from Harbor's DB.The orphans accumulate steadily:
Root Cause
Harbor GC's mark phase (
uselessBlobs/GetBlobsNotRefedByProjectBlob) queries:This only surfaces blobs already in the
blobtable. Any blob that reachesblobs/sha256/<2>/<64>/datain the storage backend without a corresponding DB row — due to a transient DB write failure, a registry crash between the storage commit and the DB write, or any other failure between those two steps — is permanently outside Harbor GC's scope. The GC has no mechanism to discover it.The Docker Distribution registry's built-in
garbage-collectcommand handles this correctly: it marks blobs by walking all manifest files in storage (not the DB), then sweeps all storage blobs not in that mark set. Harbor's custom GC replaces this with a DB-driven approach, trading correctness for the ability to avoid read-only mode — but in doing so it loses the ability to clean storage-only orphans.Proposed Fix: Add a Storage-Scan Phase to Harbor GC
We propose adding an optional GC phase that inverts the current logic: instead of starting from the DB and checking storage, it starts from storage and checks the DB.
Sketch of the new phase:
This would be gated behind a new GC parameter (e.g.
scan_storage: true, defaultfalse) so operators can opt in. It could also be scheduled less frequently than the main GC given it involves a full storage walk.What this requires from registryctl: a new
WalkStorageendpoint (or reuse of the existing storage driver'sWalkfunction) that streams blob paths from the backend. The storage driver interface indistribution/distributionalready supportsWalk; it would need to be exposed via the registryctl API.What this does NOT require: read-only mode, downtime, or changes to Harbor's quota or quota tracking logic (since storage-only orphans have no DB record, nothing needs to be decremented).
Alternative Already in the Codebase
markOrSweepUntaggedBlobsingarbage_collection.gois a related but distinct function — it cleans staleproject_blobentries for DB-tracked blobs whose artifacts have been deleted. It is complementary to the proposed fix, not a substitute.Questions for Maintainers
WalkStorageoperation without additional endpoints?Happy to share more data, provide a proof-of-concept implementation, or discuss the tradeoffs further.