Description
ActiveAddFilesIterator.java is the class in Kernel responsible for replaying the delta log and figuring out which AddFile
s are indeed active at the given version of the table (i.e. they have not been logically deleted or "tombstoned" by a RemoveFile
).
See here for an explanation and summary of the reverse-log-replay logic implemented.
Note that we only look at the RemoveFile
s that are from Delta commit (.json
) files. We do not look at any RemoveFile
s from checkpoint (parquet
) files. This is because: if we are looking at a given AddFile
X
and want to determine if X
is still present in a version of the table, then we need to cover two cases.
X
was read from ajson
file. Then there may have been aRemoveFile
later (also in ajson
) that removed it. Hence, we must keep track ofRemoveFile
s fromjson
filesX
was read from a checkpointparquet
file. Well, ifX
was written to the checkpoint file, then it was by definition active at that version of the table. Note thatX
still could be deleted by aRemoveFile
later in a.json
, just like in the case above, but there is certainly noRemoveFile
in the checkpointparquet
file that removed it.
This means that: we do not need to read any RemoveFile
s when we read checkpoint parquet
files during active-add-file-log-replay.
The feature request: avoid passing in the RemoveFile
as part of the read schema to the parquet reader, here during active-add-file-log-replay.
The expected result here is: better performance when reading checkpoint files during active-add-file-log-replay.