Description
I'm looking to start work on proper handling of delete files in table scans and so I'd like to open an issue to discuss some of the design decisions.
A core tenet of our approach so far has been to ensure that the tasks produced by the file plan are small, independent and self-contained, so that they can be easily distributed in architectures where the service that generates the file plan could be on a different machine to the service(s) that perform the file reads.
TheFileScanTask
struct represents these individual units of work at present. Currently though, it's shape is focussed on Data files and it does not cater for including information on Delete files that are produced by the scan. Here's how it looks now, for reference:
iceberg-rust/crates/iceberg/src/scan.rs
Lines 859 to 886 in cde35ab
In order to properly process delete files as part of executing a scan task, executors will now need to load in any applicable delete files along with the data file that they are processing. I'll outline what happens now, and follow that by my proposed approach.
Current TableScan Synopsis
The current structure pushes all manifest file entries from the manifest list into a stream which we then process concurrently in order to retrieve their associated manifests. Once retrieved, each manifest then has each of it's manifest entries extracted and pushed onto a channel so that they can be processed in parallel. Each is embedded inside a context object that contains the relevant information that is needed for processing of the manifest entry. Tokio tasks listening to the channel then execute TableScan::process_manifest_entry
on these objects, where we filter out any entries that do not match the scan filter predicate.
At this point, a FileScanTask
is created for each of those entries that match the scan predicate. The FileScanTask
s are then pushed into a channel that produces the stream of FileScanTask
s that is returned to the original caller of plan_files
.
Changes to TableScan
FileScanTask
Each FileScanTask
represents a scan to be performed on a single data file. However, multiple delete files may need to be applied to any one data file. Additionally, the scope of applicability of delete files is any data file within the same partition of the delete file - i.e. the same delete file can need to be applied to multiple data files. Thus an executor needs to know not just the data file that it is processing, but all of the delete files that are applicable to that data file.
The first part of the set of changes that I'm proposing is refactor FileScanTask
so that it represents a single data file and zero or more delete files.
- The
data_file_content
property would be removed - each task is implicitly about a file of typeData
. - A new struct,
DeleteFileEntry
, would be added. It would look something like this:struct DeleteFileEntry { path: String, format: DataFileFormat }
- A
delete_files
property of typVec<DeleteFileEntry>
would be added toFileScanTask
to represent the delete files that are applicable to it's data file.
TableScan::plan_files
and associated methods
We need to update this logic in order to ensure that we can properly populate this new delete_files
property. Each ManifestEntryContext
will need the list delete files so that if the manifest entry that it encapsulates passes the filtering steps, it can populate the new delete_files
property when it constructs FileScanTask
.
A naive approach may be to simply build a list of all of the delete files referred to by the top-level manifest list and give references to this list to all ManifestEntryContext
s so that, if any delete files are present then all of them are included in every FileScanTask
. This would be a good first step - code that works inefficiently is better than code that does not work at all! It would also permit work to proceed on the execution side.
Improvements could then be made to refine this approach to filter out inapplicable delete files that goes into each FileScanTask
's delete_files
property.
How does this sound so far, @liurenjie1024, @Xuanwo, @ZENOTME, @Fokko?