Skip to content

fix: ignore temp log entries #3423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

corwinjoy
Copy link

Description

Fix delta-rs methods so that delta_log entries in subdirectories, representing pending but uncommited transactions, are ignored.

Related Issue(s)

Closes load_with_datetime

Documentation

As per the bug report only root level delta files should be included:

According to the delta protocol:
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#delta-log-entries
"Delta files are stored as JSON in a directory at the root of the table named _delta_log, and together with checkpoints make up the log of all changes that have occurred to a table."

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label May 10, 2025
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@corwinjoy corwinjoy changed the title Ignore temp log entries fix: Ignore temp log entries May 10, 2025
@@ -480,6 +480,7 @@ async fn list_log_files_with_checkpoint(
let version_prefix = format!("{:020}", cp.version);
let start_from = log_root.child(version_prefix.as_str());

// QUESTION: DOES THIS NEED TO BE FILTERED TO ELIMINATE temp subdirs?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function, plus the one below may need to be fixed as well to ignore temp subdirectories but I could use some feedback here + pointers to the associated unit test.

@@ -0,0 +1,2 @@
{"commitInfo":{"timestamp":1587968626637,"operation":"WRITE","operationParameters":{"mode":"Overwrite","partitionBy":"[]"},"readVersion":4,"isBlindAppend":true}}
{"add":{"path":"part-00000-c1777d7d-89d9-4790-b38a-6ee7e24456b1-c001.snappy.parquet","partitionValues":{},"size":262,"modificationTime":1587968626600,"dataChange":true}}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the pending operation in the .tmp subfolder + parquet file below.
Essentially just copied from existing transactions.

@corwinjoy
Copy link
Author

@adamreeve @EnricoMi

@corwinjoy corwinjoy changed the title fix: Ignore temp log entries fix: ignore temp log entries May 10, 2025
corwinjoy added 5 commits May 9, 2025 18:03
Upgrade load_with_datetime to ignore any uncommited deltas in any subdirectory of delta_log.

Signed-off-by: Corwin Joy <[email protected]>
@corwinjoy corwinjoy force-pushed the ignore_temp_log_entries branch from 9ad4e09 to c319ae4 Compare May 10, 2025 01:03
@EnricoMi
Copy link

What about ignoring any . file (not only .tmp) unless the file is specifically supported (e.g. .*.crc)?

Prefixing a file with a . expresses "please ignore".

@adamreeve
Copy link
Contributor

What about ignoring any . file (not only .tmp) unless the file is specifically supported (e.g. .*.crc)?

This fix will already ignore any files or directories beginning with ., as it excludes files in any subdirectories regardless of the subdirectory name, and the existing code skips any files where the name doesn't match the log file name format.

I wonder if it would make sense for object-store to support a non-recursive list operation as a longer-term fix? This seems like it would be a pretty common operation, and that way this operation could be optimised in object store implementations where possible, rather than needing to filter on the client side. Although for this scenario in Delta Lake I guess there won't often be many nested results to filter out.

@@ -13,6 +15,11 @@ async fn time_travel_by_ds() {
("00000000000000000002.json", "2020-05-03T22:47:31-07:00"),
("00000000000000000003.json", "2020-05-04T22:47:31-07:00"),
("00000000000000000004.json", "2020-05-05T22:47:31-07:00"),
// Final file is uncommitted by Spark and is in a .tmp subdir

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this fix ignores any sub-directory, the test should also contain a testcase of a sub-directory not starting with a dot.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need a second test case? The name of the subdirectory does not matter, any and all files in subdirectories are ignored. I only used the name .tmp to clarify what is going on / better match what spark does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

delta-rs includes pending versions written by spark
3 participants