Skip to content

[Bug]: Sweep phase deletes live files when multiple content IDs share the same base location #12363

@djdroy

Description

@djdroy

What happened

When using a time-based cutoff (-c P3D), the sweep phase can delete files that are still referenced by live snapshots. This happens when the same Iceberg table base location is associated with multiple content IDs in the Nessie commit history.

The sweep operates per content-id independently. When an older content ID and a newer content ID share the same S3 base location, the older content ID's sweep builds a bloom filter containing only its own snapshot's files. It then lists all files in the shared base location and marks everything not in its bloom filter as expired — including files that belong to the newer (live) content ID's snapshots.

When deferred-deletes runs, it deletes the union of all deferred files, causing data loss.

How to reproduce it

  1. Have Iceberg tables managed by Nessie with active Flink streaming jobs continuously committing snapshots
  2. Run nessie-gc.jar mark-live with -c P3D (3-day cutoff)
  3. Run nessie-gc.jar sweep --defer-deletes
  4. Run nessie-gc.jar deferred-deletes

Nessie server type (docker/uber-jar/built from source) and version

nessie-gc 0.107.5

Client type (Ex: UI/Spark/pynessie ...) and version

No response

Impact

After deferred-deletes completed:

  • All Iceberg tables became unreadable (NoSuchKeyException on data files and metadata)
  • Flink streaming jobs failed with file-not-found errors
  • Downstream query engines (Doris) returned S3 404 errors on all tables

Expected behavior

The sweep phase should aggregate live files across ALL content IDs that share a base location before determining which files are orphaned. A file should only be considered expired if it is not referenced by ANY live content version for that base location.

Workaround

Using -c NONE (default, no cutoff) avoids the issue because all commits are considered live, so all content versions' files are protected. However, this means old snapshots and their data files are never cleaned up, which defeats the purpose of running GC.

Additional finding

The data loss only affected tables with upsert (equality delete / copy-on-write) enabled. Append-only tables were unaffected.

his is consistent with the root cause: upsert tables rewrite data files on updates, so old snapshots reference different files than new snapshots for the same base location. When the sweep processes an old content ID, its bloom filter does not contain the rewritten files from newer snapshots, marking them as expired.

Append-only tables are safe because each snapshot only adds files — older snapshots reference a subset of the current files, so no content ID's sweep will mark current files as expired.

Environment

  • nessie-gc 0.107.5
  • Iceberg tables with S3FileIO
  • Flink streaming jobs continuously writing to tables
  • PostgreSQL as JDBC backend for GC live-content-sets
  • Tested on AWS S3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions