-
Notifications
You must be signed in to change notification settings - Fork 475
fix: ignore temp log entries #3423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
@@ -480,6 +480,7 @@ async fn list_log_files_with_checkpoint( | |||
let version_prefix = format!("{:020}", cp.version); | |||
let start_from = log_root.child(version_prefix.as_str()); | |||
|
|||
// QUESTION: DOES THIS NEED TO BE FILTERED TO ELIMINATE temp subdirs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function, plus the one below may need to be fixed as well to ignore temp subdirectories but I could use some feedback here + pointers to the associated unit test.
@@ -0,0 +1,2 @@ | |||
{"commitInfo":{"timestamp":1587968626637,"operation":"WRITE","operationParameters":{"mode":"Overwrite","partitionBy":"[]"},"readVersion":4,"isBlindAppend":true}} | |||
{"add":{"path":"part-00000-c1777d7d-89d9-4790-b38a-6ee7e24456b1-c001.snappy.parquet","partitionValues":{},"size":262,"modificationTime":1587968626600,"dataChange":true}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the pending operation in the .tmp subfolder + parquet file below.
Essentially just copied from existing transactions.
Upgrade load_with_datetime to ignore any uncommited deltas in any subdirectory of delta_log. Signed-off-by: Corwin Joy <[email protected]>
…tion. Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
9ad4e09
to
c319ae4
Compare
What about ignoring any Prefixing a file with a |
This fix will already ignore any files or directories beginning with I wonder if it would make sense for |
@@ -13,6 +15,11 @@ async fn time_travel_by_ds() { | |||
("00000000000000000002.json", "2020-05-03T22:47:31-07:00"), | |||
("00000000000000000003.json", "2020-05-04T22:47:31-07:00"), | |||
("00000000000000000004.json", "2020-05-05T22:47:31-07:00"), | |||
// Final file is uncommitted by Spark and is in a .tmp subdir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this fix ignores any sub-directory, the test should also contain a testcase of a sub-directory not starting with a dot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we need a second test case? The name of the subdirectory does not matter, any and all files in subdirectories are ignored. I only used the name .tmp
to clarify what is going on / better match what spark does.
Co-authored-by: Adam Reeve <[email protected]>
Co-authored-by: Adam Reeve <[email protected]>
Description
Fix delta-rs methods so that delta_log entries in subdirectories, representing pending but uncommited transactions, are ignored.
Related Issue(s)
Closes load_with_datetime
Documentation
As per the bug report only root level delta files should be included:
According to the delta protocol:
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#delta-log-entries
"Delta files are stored as JSON in a directory at the root of the table named _delta_log, and together with checkpoints make up the log of all changes that have occurred to a table."