Skip to content

[Feature Request] [Kernel] During Active-AddFile-Log-Replay do not pass the RemoveFile to the parquet (checkpoint) reader #4102

Closed
@scottsand-db

Description

ActiveAddFilesIterator.java is the class in Kernel responsible for replaying the delta log and figuring out which AddFiles are indeed active at the given version of the table (i.e. they have not been logically deleted or "tombstoned" by a RemoveFile).

See here for an explanation and summary of the reverse-log-replay logic implemented.

Note that we only look at the RemoveFiles that are from Delta commit (.json) files. We do not look at any RemoveFiles from checkpoint (parquet) files. This is because: if we are looking at a given AddFile X and want to determine if X is still present in a version of the table, then we need to cover two cases.

  1. X was read from a json file. Then there may have been a RemoveFile later (also in a json) that removed it. Hence, we must keep track of RemoveFiles fromjson files
  2. X was read from a checkpoint parquet file. Well, if X was written to the checkpoint file, then it was by definition active at that version of the table. Note that X still could be deleted by a RemoveFile later in a .json, just like in the case above, but there is certainly no RemoveFile in the checkpoint parquet file that removed it.

This means that: we do not need to read any RemoveFiles when we read checkpoint parquet files during active-add-file-log-replay.

The feature request: avoid passing in the RemoveFile as part of the read schema to the parquet reader, here during active-add-file-log-replay.

The expected result here is: better performance when reading checkpoint files during active-add-file-log-replay.

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions