-
Couldn't load subscription status.
- Fork 537
chore: modified add_actions_table to stream #3836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…d since 0.22.4 Signed-off-by: R. Tyler Croy <[email protected]>
This makes things a little cleaner when reviewing this code and preparing for refactors Signed-off-by: R. Tyler Croy <[email protected]>
Signed-off-by: Ion Koutsouris <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: R. Tyler Croy <[email protected]>
Signed-off-by: R. Tyler Croy <[email protected]>
Signed-off-by: R. Tyler Croy <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: R. Tyler Croy <[email protected]>
Signed-off-by: Ion Koutsouris <[email protected]>
Signed-off-by: Ion Koutsouris <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Ion Koutsouris <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Florian VALEYE <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Add convenient methods to set table description and name through the Python API. Signed-off-by: Florian VALEYE <[email protected]>
…ation in Rust Signed-off-by: Florian VALEYE <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Ion Koutsouris <[email protected]>
# Description Now if the user has permissions to do writes (and actually does a write) we will request a write permission first instead of just read only permissions. When this fails we will go back to the normal path of requesting a read-only cred. # Related Issue(s) None I'm aware of. # Documentation https://docs.databricks.com/api/workspace/temporarytablecredentials/generatetemporarytablecredentials --------- Signed-off-by: Stephen Carman <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3836 +/- ##
===========================================
- Coverage 74.37% 24.98% -49.39%
===========================================
Files 147 120 -27
Lines 39670 20270 -19400
Branches 39670 20270 -19400
===========================================
- Hits 29503 5065 -24438
- Misses 8777 14834 +6057
+ Partials 1390 371 -1019 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
python/src/lib.rs
Outdated
| let stream = snapshot.add_actions_table(flatten); | ||
|
|
||
| // Collect batches from stream | ||
| let batches: Vec<RecordBatch> = rt() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This action already materializes all batches into memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed that.
I did make some changes to give BoxStream a 'static lifetime parameter and tried to have it stream instead of collecting it all.
… reading Signed-off-by: Florian Valeye <[email protected]>
Signed-off-by: Florian Valeye <[email protected]>
# Description Slight change in the find_files by only cloning the string for the path (from this [PR](#3826)) # Benchmark - Before optimization: Time: 668.73 µs Clones all fields: path, partition_values HashMap, stats strings, etc. - After optimization: Time: 366.62 µs Only clones the path String once Moves the Add struct instead of deep cloning Signed-off-by: Florian Valeye <[email protected]>
This change prevents race conditions where concurrent writer's uncommitted files could be deleted before the transaction is committed. Now files that are not referenced in the log but are younger than the retention period will be protected from deletion during vacuum operations in Full mode. Added test to verify the behavior of protecting recent uncommitted files. Signed-off-by: Manish Sogiyawar <[email protected]> Signed-off-by: zen-zap <[email protected]>
Signed-off-by: zen-zap <[email protected]>
Signed-off-by: zen-zap <[email protected]>
…hon bindings Signed-off-by: zen-zap <[email protected]>
# Description To better understand performance in the `delta-rs` crate, I added additional tracing to capture more detailed debug-level performance information. Python now uses `OpenTelemetry` to collect tracing data emitted from Rust. With this change, we gain true end-to-end visibility: Python spans can serve as parents of Rust spans (and vice versa), ensuring a continuous trace across both runtimes. # Related Issue(s) - close #3641 # Documentation - https://docs.rs/tracing/latest/tracing/ --------- Signed-off-by: Florian Valeye <[email protected]> Co-authored-by: Ion Koutsouris <[email protected]>
#3840) # Description This redoes the merge-based benchmark in crates/benchmark, replacing it with `divan` as a real harness combined with adding a script that can be used for profiling. # Related Issue(s) Closes #3839 # Documentation Documentation is included in the updated README --------- Signed-off-by: Abhi Agarwal <[email protected]>
# Description The description of the main changes of your pull request # Related Issue(s) <!--- For example: - closes #106 ---> # Documentation <!--- Share links to useful documentation --->
crates/core/src/table/state.rs
Outdated
| let files = self.files.clone(); | ||
| stream::iter(files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this PR is safe.
Sure, the 'static lifetime helps avoid the lifetime issues, but at the cost of essentially cloning the table's file state at "a point in time". This means that if you get the record batch stream, then perform operations on the delta table (which may delete a file, say a compact) and then you try to consume the stream, you could be reading a file that got invalidated.
Someone should double-check my logic, but a 'static lifetime is a bit scary unless you can prove it really does live for the duration of the program.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that these are log files, it's pretty rare they get deleted, but it still would get stale table state if another action added some files. Maybe that's the right abstraction on the python side, but definitely not the rust side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I could take a snapshot like - get the current version of the delta table and start the stream from that? If I tie it to the version, it should be okay I think..
Do you have any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only real way to do it in a safe way would be to have the BoxStream lifetime be 'a which is tied into &'a self, so the boxstream borrows from self.
pub fn add_actions_table<'a>(
&'a self,
flatten: bool,
) -> BoxStream<'a, DeltaResult<arrow::record_batch::RecordBatch>> {
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about it, in theory, the self.files parameter should never change per snapshot so perhaps this is safe, but it's not enforced in the type system. On the python side though, this race condition absolutely does exist, though I'm not sure what the proper semantics are anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think version pinning would help dealing with the race condition on rust side. The python side uses this so that should get fixed. Let me know if you have any ideas about this. Thanks!
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: zen-zap <[email protected]>
8bfa9a0 to
a09edaa
Compare
Description
Changed
add_actions_tableto return BoxStream incrates/core/src/table/state.rs. This helps by making sure we don't load everything into memory all at once. AddedBoxStreamToReaderAdapterinpython/src/reader.rsto convert the BoxStreams into RecordBatchReader. Modified the python bindings and tests to use.read_all()to fetch them all at once.Related Issue(s)
Some tests fail I think. I got this:
438 passed, 4 skipped, 2 xfailed, 5 xpassed, 22 warnings in 14.72s
add_actions_tableonto stream #3811