- Status: Draft
- Authors: Tonbo team
- Created: 2026-03-13
- Updated: 2026-03-16
- Area: Manifest, Compaction, Retention
Define a simple GC model for obsolete SST objects based on one rule: delete only what is unreachable from the manifest root set. The first implementation targets correctness for the single-process Tonbo model: one process owns writes, retention, and GC, while many readers may observe retained manifest versions through that process. WAL GC stays on its current floor-driven path and is not merged into the SST GC design. This RFC is explicitly not a distributed GC design; a distributed Tonbo deployment will require a different mental model rather than a small extension of this one.
Tonbo already publishes new immutable objects through manifest CAS, but the
current SST GC story is still centered on a single mutable GcPlanState. That
shape is not sufficient for the target correctness model:
- a compaction outcome can identify obsolete SST inputs, but that alone does not prove they are safe to delete,
- a single mutable GC plan can be overwritten, which is acceptable for a hint but not for the source of truth,
- single-writer, multiple-reader correctness needs a clear reclaim rule now, before more worker machinery or distributed coordination is added,
- WAL GC is already reasonably isolated and should not be destabilized by trying to force one unified GC framework too early.
The immediate problem is therefore narrower than “full distributed GC”: Tonbo needs a small, defensible reclaim contract that is correct under manifest versioning and can evolve later.
- Define the root set that protects SST objects from deletion.
- Make the root set the only authority for SST deletion.
- Deliver a first implementation that is correct for the single-process Tonbo deployment model.
- Keep WAL GC standalone and floor-driven.
- Separate must-have behavior from later enhancements.
- Require observability and benchmarking as part of each GC story.
- Define a minimal metrics set that lets us measure GC cost and benefit.
- Unifying WAL GC and SST GC under one worker model.
- Designing append-only GC work-item queues in this RFC.
- Adding checkpoint-aware GC semantics now.
- Designing distributed or multi-process GC semantics.
- Finalizing rate limiting, adaptive cadence, or benchmark-tuned defaults.
- Covering orphan cleanup and repair scans in detail.
Today compaction can produce a single GC plan containing obsolete SST and WAL references. That is useful as a local hint, but it is not a sound correctness boundary for SST deletion even in a single-process system:
- readers reason from committed manifest versions,
- compaction only sees one moment in time,
- the mutable GC plan can be replaced by later work.
The consequence is simple: SST GC must not trust the GC plan as deletion authority.
The SST GC root set is:
- the current HEAD version,
- any explicit hard pins that already exist in the single-process deployment.
That is the entire must-have root set for the first implementation.
The first release intentionally does not require:
- explicit reader pin records,
- checkpoint-aware pins,
- distributed reader registries,
- separate pin objects in the manifest.
The first milestone deliberately avoids adding new retention semantics. If later evidence shows the initial root set is too small or too blunt, explicit pins or retention-driven root sets can be added as follow-up work.
This should not be read as a distributed design placeholder. It is a deliberate single-process simplification:
- one process decides retention,
- one process runs GC,
- one process has authority to translate retained versions into reclaimable files.
Once Tonbo supports distributed writers, distributed readers, or independent GC workers, this simplification stops being sufficient.
An SST data file or SST sidecar is eligible for deletion only if it is unreachable from every version in the root set.
Equivalently:
- build the root set,
- collect every SST object path reachable from that root set,
- delete only manifest-published SST objects that are not in that reachable set.
Candidate discovery from compaction is only a hint. It can speed up what the sweeper looks at first, but it cannot decide safety.
This keeps the safety rule simple:
- manifest versions define visibility,
- the current root set defines which versions remain protected,
- reachability defines deletability.
The first GC milestone is correctness for one Tonbo process managing manifest versions, the current root set, and GC.
The contract is:
- the owning process may publish new SSTs and remove old SST references from HEAD,
- SST GC may only delete objects that are unreachable from the current root set.
This means the implementation can be conservative without being wrong. If GC keeps extra files for a while, that is a liveness or cost issue. If it deletes a file still reachable from the current root set, that is a correctness failure.
This RFC intentionally does not claim that the same mental model scales to a distributed Tonbo deployment.
In a distributed deployment, at least these assumptions break:
- the process performing GC cannot assume it sees every active reader,
- the local root set may no longer be a sufficient proxy for reader protection,
- independent workers may race on planning, deletion, and version visibility,
- reclaim safety may depend on explicit distributed coordination rather than on local reachability plus a locally computed root set.
That means the distributed design is not just “this RFC plus a reader pin.” It will require a different mental model with different safety boundaries.
Version-retention policy is intentionally not part of the first SST GC milestone.
That means:
- no
max_ttlrequirement in this RFC, - no
max_versionsrequirement in this RFC, - no promise yet about time-travel retention as part of the initial GC design.
If Tonbo later wants retention-driven root sets, that should be added as a follow-up change after the basic single-process GC path is working and measured.
WAL GC remains floor-driven and standalone.
This RFC does not change the current WAL model:
WalFloorin the manifest remains the protection boundary,- WAL candidate discovery remains sequence-based,
- WAL pruning remains independent of SST reachability sweeps.
The only shared concept between WAL GC and SST GC is the manifest as metadata authority. Trying to merge them now would make the SST design more complicated without improving the immediate correctness target.
The initial SST GC execution model is intentionally minimal:
- load HEAD and the current root set from the manifest,
- compute the reachable SST object set,
- compare that set with known manifest-published SST objects,
- delete unreachable objects,
- treat missing objects as already reclaimed.
This may run in a simple in-process worker or via a manually triggered path. The implementation may use compaction output as an optimization hint, but not as deletion authority.
Writers do not need to wait for GC bookkeeping for correctness. If the first implementation blocks briefly around local cleanup because that is the simplest way to ship, that is acceptable, provided that:
- publish visibility is still decided by manifest CAS,
- delete failures only delay reclaim,
- GC never becomes part of the commit success condition.
| Category | Must have now | Enhancement later |
|---|---|---|
| Safety authority | Root set from HEAD and existing hard pins | Retention-driven root sets, explicit pins, reader registries, leases |
| Reader protection | Single-process protection from the current root set | Fine-grained pinned versions |
| SST discovery | Reachability over manifest-published SSTs | Durable per-event GC queues |
| WAL GC | Keep current floor-driven path | Optional operational convergence only if proven useful |
| Benchmarking | Baseline-first benchmark story before enablement | Remote-store simulations, cadence tuning |
| Retention policy | Deferred | max_ttl, max_versions, time-travel retention |
| Checkpoints | Out of GC scope; copy a chosen version elsewhere | Rich checkpoint/tag integration |
| Metrics | Minimal cost/benefit metrics | Full backlog, lease, and rate-limit dashboards |
| Rate limiting | Undefined for now | Delete throttling and service isolation |
For clarity, the right reading of the “enhancement later” column is:
- items that strengthen the single-process design without changing its core mental model,
- not a promise that distributed GC can be reached by incrementally extending this RFC.
The first implementation should measure only what helps judge whether GC is worth running and whether it harms the foreground path:
- bytes deleted,
- SST objects deleted,
- reclaim run duration,
- write latency before/after enabling GC,
- read latency before/after enabling GC,
- physical size before/after GC.
These metrics are enough to answer the practical question: what storage savings do we get, and what latency cost do we pay?
GC should not be enabled by default based only on correctness tests. Each GC milestone should start with a benchmark-baseline story that captures the pre-change behavior, then re-run the same scenarios after implementation.
The benchmark workflow should follow two simple rules:
- capture a baseline before the GC change or before enabling the new path,
- compare against that baseline after the change using the same workload.
The initial benchmark set should answer:
- what is write latency before and after GC,
- what is read latency before and after GC,
- how many bytes and objects GC reclaims,
- how much persistent size remains before and after GC,
- how much extra manifest or object-store work GC causes.
This is intentionally close to how neighboring projects already work:
aislekeeps explicit baseline docs and Criterion comparisons,fusio-manifestwraps benchmark storage paths with counters so I/O behavior is directly measured.
Tonbo should do the same for GC: define the baseline first, then use that baseline to justify shipping or enabling the feature.
Observability is part of the definition of done for every GC story.
At minimum, each story should specify:
- what counters or timings are added,
- how benchmark runs will capture them,
- which correctness or cost question those measurements answer.
If a story changes GC behavior but does not say how to observe or benchmark that change, the story is incomplete.
When Tonbo moves beyond the single-process deployment model, GC should be re-specified in a new RFC instead of treating this one as the distributed baseline.
That future RFC should start by redefining:
- who owns reader protection,
- who owns delete authority,
- what metadata forms the distributed root set,
- what coordination model replaces the single-process simplifications used here.
GC does not need checkpoint-specific semantics in the first design.
A checkpoint can stay simple: copy a chosen version and its referenced data to a separate location, bucket, or prefix. Once copied, that checkpoint is outside the online GC contract for the source database.
The narrow failure model for the first implementation is:
- if GC crashes before delete, nothing is reclaimed,
- if GC crashes after deleting some files, retry is safe because missing files are treated as already deleted,
- if the root set is computed incorrectly, SST GC can become unsafe,
- if the root set is conservative, reclaim is delayed.
The main correctness work is therefore not “build more GC machinery”; it is “compute and retain the right root set”.
Rejected because compaction does not define the full reader-visible protection surface. It can suggest likely garbage, but it cannot replace root-set reachability.
Rejected because WAL GC already has a simpler and more stable safety boundary. Forcing a shared worker or queue model now would make the SST design harder to ship without improving correctness.
Rejected for now because it delays the first correct implementation. The first milestone should not pull in extra reader-protection semantics before the basic single-process GC path exists.
- Should the first SST sweep run only as an explicit maintenance action, or as a lightweight in-process background loop?
- Should Tonbo keep GC benchmarks in the main crate, or in a dedicated bench workspace similar to adjacent repos once the scenarios grow?
- Explicit pinned versions for long-lived readers.
- Retention-driven root sets using
max_ttlandmax_versions. - Distributed reader protection for shared-storage multi-process deployments.
- Durable GC work queues if liveness or observability requires them.
- Checkpoint tags and named pins if the product needs in-place retained copies.
- Delete throttling once benchmark data exists.
- Orphan discovery and repair scans as a separate maintenance lane.