Skip to content

Scan Delete Support Part 4: Delete File Loading; Skeleton for Processing #982

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

sdd
Copy link
Contributor

@sdd sdd commented Feb 21, 2025

Extends the DeleteFileManager introduced in #950 To include loading of delete files, storage and retrieval of parsed delete files from shared state, and the outline for how parsing will connect up to this new work.

Issue: #630

@sdd sdd force-pushed the feat/delete-fila-manager-loading branch 5 times, most recently from edb1d27 to 8e90bdd Compare February 23, 2025 14:55
@sdd sdd marked this pull request as ready for review February 26, 2025 09:20
@sdd sdd force-pushed the feat/delete-fila-manager-loading branch 4 times, most recently from ec8e7c1 to 06f0df5 Compare March 5, 2025 19:53
Copy link
Contributor

@jonathanc-n jonathanc-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, will look at the parsed records next.

struct DeleteFileManagerState {
// delete vectors and positional deletes get merged when loaded into a single delete vector
// per data file
delete_vectors: HashMap<String, RoaringTreemap>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a function should be included for enabling deletion vectors for the future when a property is added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @jonathanc-n, I'm a bit confused as to what you mean, could you explain?

@sdd sdd force-pushed the feat/delete-fila-manager-loading branch 6 times, most recently from 5530bc3 to e997fc6 Compare March 31, 2025 17:27
@sdd
Copy link
Contributor Author

sdd commented Apr 3, 2025

@liurenjie1024, @Xuanwo, @Fokko - this is ready for re-review, if you could take a look that would be great!

@sdd sdd force-pushed the feat/delete-fila-manager-loading branch from e997fc6 to 056e73f Compare April 3, 2025 07:28
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sdd for this pr. There are some missing points in current design. Also I would suggest not putting too much in DeleteFilterManager. I suppose DeleterFilterManager acting more like a delete loader, which manages the io and caching of record batch. The actual filtering part, could delegate to DeleteFilter, WDYT? I think a good reference implementation is java's DeleteFilter, see https://github.com/apache/iceberg/blob/af8e3f5a40f4f36bbe1d868146749e2341471586/data/src/main/java/org/apache/iceberg/data/DeleteFilter.java#L50

/// Parses record batch streams from individual equality delete files
///
/// Returns an unbound Predicate for each batch stream
async fn parse_equality_deletes_record_batch_stream(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transforming eq deletes to predicate maybe in efficient for arrow. In theory, eq delete filter is just an null safe anti join, maybe calling some arrow kernel would be more efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing as this is around efficiency rather than correctness, would you be ok with deferring a more efficient implementation to a follow-up PR? I'd prefer to have some correct support for delete files sooner and then refine it to perform better afterwards, if that is ok with you.

@sdd
Copy link
Contributor Author

sdd commented Apr 14, 2025

Thanks for the review @liurenjie1024 - much appreciated. Will come back with a revised design.

@sdd sdd force-pushed the feat/delete-fila-manager-loading branch 2 times, most recently from bd33aa5 to 39a26ab Compare April 17, 2025 06:39
@sdd sdd force-pushed the feat/delete-fila-manager-loading branch 2 times, most recently from 58b0f07 to 5739a46 Compare April 23, 2025 20:51
@sdd sdd force-pushed the feat/delete-fila-manager-loading branch from 5739a46 to 52cf8b9 Compare April 23, 2025 21:07
@sdd
Copy link
Contributor Author

sdd commented Apr 23, 2025

Back to you @liurenjie1024 - I've made the changes around missing functionality. Still the open question of if you are ok to defer the structural / performance changes to a follow-up so that we can make more incremental progress rather than adding yet more changes into this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants