What happened
When using a time-based cutoff (-c P3D), the sweep phase can delete files that are still referenced by live snapshots. This happens when the same Iceberg table base location is associated with multiple content IDs in the Nessie commit history.
The sweep operates per content-id independently. When an older content ID and a newer content ID share the same S3 base location, the older content ID's sweep builds a bloom filter containing only its own snapshot's files. It then lists all files in the shared base location and marks everything not in its bloom filter as expired — including files that belong to the newer (live) content ID's snapshots.
When deferred-deletes runs, it deletes the union of all deferred files, causing data loss.
How to reproduce it
- Have Iceberg tables managed by Nessie with active Flink streaming jobs continuously committing snapshots
- Run
nessie-gc.jar mark-live with -c P3D (3-day cutoff)
- Run
nessie-gc.jar sweep --defer-deletes
- Run
nessie-gc.jar deferred-deletes
Nessie server type (docker/uber-jar/built from source) and version
nessie-gc 0.107.5
Client type (Ex: UI/Spark/pynessie ...) and version
No response
Impact
After deferred-deletes completed:
- All Iceberg tables became unreadable (NoSuchKeyException on data files and metadata)
- Flink streaming jobs failed with file-not-found errors
- Downstream query engines (Doris) returned S3 404 errors on all tables
Expected behavior
The sweep phase should aggregate live files across ALL content IDs that share a base location before determining which files are orphaned. A file should only be considered expired if it is not referenced by ANY live content version for that base location.
Workaround
Using -c NONE (default, no cutoff) avoids the issue because all commits are considered live, so all content versions' files are protected. However, this means old snapshots and their data files are never cleaned up, which defeats the purpose of running GC.
Additional finding
The data loss only affected tables with upsert (equality delete / copy-on-write) enabled. Append-only tables were unaffected.
his is consistent with the root cause: upsert tables rewrite data files on updates, so old snapshots reference different files than new snapshots for the same base location. When the sweep processes an old content ID, its bloom filter does not contain the rewritten files from newer snapshots, marking them as expired.
Append-only tables are safe because each snapshot only adds files — older snapshots reference a subset of the current files, so no content ID's sweep will mark current files as expired.
Environment
- nessie-gc 0.107.5
- Iceberg tables with S3FileIO
- Flink streaming jobs continuously writing to tables
- PostgreSQL as JDBC backend for GC live-content-sets
- Tested on AWS S3
What happened
When using a time-based cutoff (
-c P3D), the sweep phase can delete files that are still referenced by live snapshots. This happens when the same Iceberg table base location is associated with multiple content IDs in the Nessie commit history.The sweep operates per content-id independently. When an older content ID and a newer content ID share the same S3 base location, the older content ID's sweep builds a bloom filter containing only its own snapshot's files. It then lists all files in the shared base location and marks everything not in its bloom filter as expired — including files that belong to the newer (live) content ID's snapshots.
When
deferred-deletesruns, it deletes the union of all deferred files, causing data loss.How to reproduce it
nessie-gc.jar mark-livewith-c P3D(3-day cutoff)nessie-gc.jar sweep --defer-deletesnessie-gc.jar deferred-deletesNessie server type (docker/uber-jar/built from source) and version
nessie-gc 0.107.5
Client type (Ex: UI/Spark/pynessie ...) and version
No response
Impact
After
deferred-deletescompleted:Expected behavior
The sweep phase should aggregate live files across ALL content IDs that share a base location before determining which files are orphaned. A file should only be considered expired if it is not referenced by ANY live content version for that base location.
Workaround
Using
-c NONE(default, no cutoff) avoids the issue because all commits are considered live, so all content versions' files are protected. However, this means old snapshots and their data files are never cleaned up, which defeats the purpose of running GC.Additional finding
The data loss only affected tables with upsert (equality delete / copy-on-write) enabled. Append-only tables were unaffected.
his is consistent with the root cause: upsert tables rewrite data files on updates, so old snapshots reference different files than new snapshots for the same base location. When the sweep processes an old content ID, its bloom filter does not contain the rewritten files from newer snapshots, marking them as expired.
Append-only tables are safe because each snapshot only adds files — older snapshots reference a subset of the current files, so no content ID's sweep will mark current files as expired.
Environment