GC does not clean blobs that reach storage but have no Harbor DB record ("storage-only orphans")

## Summary

Harbor's garbage collector is purely DB-driven: it finds orphaned blobs by querying the `blob` table and removes entries not referenced by any project. This works correctly for the blobs Harbor knows about. However, blobs that reach the final storage path (`blobs/sha256/<2>/<64>/data`) without a corresponding `blob` table record are permanently invisible to GC and accumulate indefinitely with no cleanup path.

We are calling these **storage-only orphans** to distinguish them from the DB-tracked orphans that GC already handles.

---

## Evidence

We observed a growing discrepancy between registry storage size and the size Harbor reports in its UI/API on a dedicated BuildKit cache registry instance (~6 weeks old, S3 backend):

| Source | Blobs | Size |
|---|---|---|
| S3 storage | 1,254,732 | ~84 TB |
| Harbor `blob` table | 163,952 | ~14 TB |
| **Storage-only orphans** | **1,090,780** | **~74 TB** |

We confirmed this by listing all `blobs/sha256/*/data` objects in S3, extracting digests, and cross-referencing against `SELECT digest FROM blob`. 87% of storage blobs have no DB record.

We also verified that sampled storage-only orphan blobs have **no repository link files** in `repositories/<repo>/_layers/sha256/<digest>/link`, meaning they are unreachable through the registry protocol too — not just missing from Harbor's DB.

The orphans accumulate steadily:

```
2026-04-27: 67,979 blobs  (~4.1 TB that day)
2026-04-28: 61,875 blobs  (~3.4 TB)
2026-04-29: 56,086 blobs  (~3.1 TB)
2026-04-30: 32,594 blobs  (~1.6 TB)
```

---

## Root Cause

Harbor GC's mark phase (`uselessBlobs` / `GetBlobsNotRefedByProjectBlob`) queries:

```sql
SELECT b.* FROM blob AS b
LEFT JOIN project_blob pb ON b.id = pb.blob_id
WHERE pb.id IS NULL
AND b.update_time <= now() - interval '<time_window> hours'
```

This only surfaces blobs **already in the `blob` table**. Any blob that reaches `blobs/sha256/<2>/<64>/data` in the storage backend without a corresponding DB row — due to a transient DB write failure, a registry crash between the storage commit and the DB write, or any other failure between those two steps — is permanently outside Harbor GC's scope. The GC has no mechanism to discover it.

The Docker Distribution registry's built-in `garbage-collect` command handles this correctly: it marks blobs by walking all manifest files in storage (not the DB), then sweeps all storage blobs not in that mark set. Harbor's custom GC replaces this with a DB-driven approach, trading correctness for the ability to avoid read-only mode — but in doing so it loses the ability to clean storage-only orphans.

---

## Proposed Fix: Add a Storage-Scan Phase to Harbor GC

We propose adding an optional GC phase that inverts the current logic: instead of starting from the DB and checking storage, it starts from storage and checks the DB.

**Sketch of the new phase:**

```go
func (gc *GarbageCollector) sweepStorageOrphans(ctx job.Context) error {
    // Walk all blobs in the registry storage backend
    err := gc.registryCtlClient.WalkStorage("blobs/sha256/", func(digest string, modTime time.Time) error {
        // Respect the same time_window as the existing GC to protect
        // blobs that are mid-upload (blob data written, DB write pending)
        if time.Since(modTime) < time.Duration(gc.timeWindowHours)*time.Hour {
            return nil
        }

        // Check if Harbor DB knows about this blob
        _, err := gc.blobMgr.Get(ctx.SystemContext(), digest)
        if errors.IsNotFoundErr(err) {
            // Not in DB and old enough — delete from storage only
            // (no DB record to clean up)
            return gc.registryCtlClient.DeleteBlob(digest)
        }
        return err
    })
    return err
}
```

This would be gated behind a new GC parameter (e.g. `scan_storage: true`, default `false`) so operators can opt in. It could also be scheduled less frequently than the main GC given it involves a full storage walk.

**What this requires from registryctl:** a new `WalkStorage` endpoint (or reuse of the existing storage driver's `Walk` function) that streams blob paths from the backend. The storage driver interface in `distribution/distribution` already supports `Walk`; it would need to be exposed via the registryctl API.

**What this does NOT require:** read-only mode, downtime, or changes to Harbor's quota or quota tracking logic (since storage-only orphans have no DB record, nothing needs to be decremented).

---

## Alternative Already in the Codebase

`markOrSweepUntaggedBlobs` in `garbage_collection.go` is a related but distinct function — it cleans stale `project_blob` entries for DB-tracked blobs whose artifacts have been deleted. It is complementary to the proposed fix, not a substitute.

---

## Questions for Maintainers

1. Is the failure mode described above (blob reaching storage without a DB record) a known scenario? Are there existing mitigations we may have missed?
2. Is there an existing registryctl API or storage driver hook that could support the `WalkStorage` operation without additional endpoints?
3. Would a PR implementing the storage-scan phase with an opt-in parameter be welcome?

Happy to share more data, provide a proof-of-concept implementation, or discuss the tradeoffs further.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC does not clean blobs that reach storage but have no Harbor DB record ("storage-only orphans") #23199

Summary

Evidence

Root Cause

Proposed Fix: Add a Storage-Scan Phase to Harbor GC

Alternative Already in the Codebase

Questions for Maintainers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Source	Blobs	Size
S3 storage	1,254,732	~84 TB
Harbor `blob` table	163,952	~14 TB
Storage-only orphans	1,090,780	~74 TB

GC does not clean blobs that reach storage but have no Harbor DB record ("storage-only orphans") #23199

Description

Summary

Evidence

Root Cause

Proposed Fix: Add a Storage-Scan Phase to Harbor GC

Alternative Already in the Codebase

Questions for Maintainers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions