Skip to content

Conversation

@zen-zap
Copy link

@zen-zap zen-zap commented Oct 11, 2025

Description

Changed add_actions_table to return BoxStream in crates/core/src/table/state.rs. This helps by making sure we don't load everything into memory all at once. Added BoxStreamToReaderAdapter in python/src/reader.rs to convert the BoxStreams into RecordBatchReader. Modified the python bindings and tests to use .read_all() to fetch them all at once.

Related Issue(s)

Some tests fail I think. I got this:
438 passed, 4 skipped, 2 xfailed, 5 xpassed, 22 warnings in 14.72s

rtyler and others added 30 commits May 17, 2025 14:49
This makes things a little cleaner when reviewing this code and
preparing for refactors

Signed-off-by: R. Tyler Croy <[email protected]>
Signed-off-by: Ion Koutsouris <[email protected]>
Signed-off-by: Ion Koutsouris <[email protected]>
Signed-off-by: Ion Koutsouris <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Add convenient methods to set table description and name through the Python API.

Signed-off-by: Florian VALEYE <[email protected]>
@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Oct 11, 2025
# Description
Now if the user has permissions to do writes (and actually does a write)
we will request a write permission first instead of just read only
permissions. When this fails we will go back to the normal path of
requesting a read-only cred.

# Related Issue(s)
None I'm aware of.

# Documentation

https://docs.databricks.com/api/workspace/temporarytablecredentials/generatetemporarytablecredentials

---------

Signed-off-by: Stephen Carman <[email protected]>
@codecov
Copy link

codecov bot commented Oct 11, 2025

Codecov Report

❌ Patch coverage is 0% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 24.98%. Comparing base (f1727c9) to head (b05a9ed).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/core/src/table/state.rs 0.00% 37 Missing ⚠️
crates/core/src/delta_datafusion/find_files.rs 0.00% 5 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (f1727c9) and HEAD (b05a9ed). Click for more details.

HEAD has 3 uploads less than BASE
Flag BASE (f1727c9) HEAD (b05a9ed)
8 5
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #3836       +/-   ##
===========================================
- Coverage   74.37%   24.98%   -49.39%     
===========================================
  Files         147      120       -27     
  Lines       39670    20270    -19400     
  Branches    39670    20270    -19400     
===========================================
- Hits        29503     5065    -24438     
- Misses       8777    14834     +6057     
+ Partials     1390      371     -1019     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

let stream = snapshot.add_actions_table(flatten);

// Collect batches from stream
let batches: Vec<RecordBatch> = rt()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This action already materializes all batches into memory

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed that.

I did make some changes to give BoxStream a 'static lifetime parameter and tried to have it stream instead of collecting it all.

@zen-zap zen-zap requested a review from ion-elgreco October 11, 2025 15:49
fvaleye and others added 5 commits October 12, 2025 20:26
# Description
Slight change in the find_files by only cloning the string for the path
(from this [PR](#3826))

# Benchmark
- Before optimization:
Time: 668.73 µs
Clones all fields: path, partition_values HashMap, stats strings, etc.

- After optimization:
Time: 366.62 µs
Only clones the path String once
Moves the Add struct instead of deep cloning

Signed-off-by: Florian Valeye <[email protected]>
This change prevents race conditions where concurrent writer's uncommitted
files could be deleted before the transaction is committed. Now files that
are not referenced in the log but are younger than the retention period
will be protected from deletion during vacuum operations in Full mode.

Added test to verify the behavior of protecting recent uncommitted files.

Signed-off-by: Manish Sogiyawar <[email protected]>
Signed-off-by: zen-zap <[email protected]>
zen-zap and others added 4 commits October 13, 2025 00:09
# Description
To better understand performance in the `delta-rs` crate, I added
additional tracing to capture more detailed debug-level performance
information.

Python now uses `OpenTelemetry` to collect tracing data emitted from
Rust.
With this change, we gain true end-to-end visibility: Python spans can
serve as parents of Rust spans (and vice versa), ensuring a continuous
trace across both runtimes.

# Related Issue(s)
-  close #3641

# Documentation
- https://docs.rs/tracing/latest/tracing/

---------

Signed-off-by: Florian Valeye <[email protected]>
Co-authored-by: Ion Koutsouris <[email protected]>
#3840)

# Description

This redoes the merge-based benchmark in crates/benchmark, replacing it
with `divan` as a real harness combined with adding a script that can be
used for profiling.

# Related Issue(s)

Closes #3839 

# Documentation

Documentation is included in the updated README

---------

Signed-off-by: Abhi Agarwal <[email protected]>
# Description
The description of the main changes of your pull request

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
Comment on lines 381 to 382
let files = self.files.clone();
stream::iter(files)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this PR is safe.

Sure, the 'static lifetime helps avoid the lifetime issues, but at the cost of essentially cloning the table's file state at "a point in time". This means that if you get the record batch stream, then perform operations on the delta table (which may delete a file, say a compact) and then you try to consume the stream, you could be reading a file that got invalidated.

Someone should double-check my logic, but a 'static lifetime is a bit scary unless you can prove it really does live for the duration of the program.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that these are log files, it's pretty rare they get deleted, but it still would get stale table state if another action added some files. Maybe that's the right abstraction on the python side, but definitely not the rust side.

Copy link
Author

@zen-zap zen-zap Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I could take a snapshot like - get the current version of the delta table and start the stream from that? If I tie it to the version, it should be okay I think..

Do you have any suggestions?

Copy link
Contributor

@abhiaagarwal abhiaagarwal Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only real way to do it in a safe way would be to have the BoxStream lifetime be 'a which is tied into &'a self, so the boxstream borrows from self.

pub fn add_actions_table<'a>(
    &'a self,
    flatten: bool,
) -> BoxStream<'a, DeltaResult<arrow::record_batch::RecordBatch>> {
    ...
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it, in theory, the self.files parameter should never change per snapshot so perhaps this is safe, but it's not enforced in the type system. On the python side though, this race condition absolutely does exist, though I'm not sure what the proper semantics are anyways

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think version pinning would help dealing with the race condition on rust side. The python side uses this so that should get fixed. Let me know if you have any ideas about this. Thanks!

@zen-zap zen-zap marked this pull request as draft October 14, 2025 08:23
@zen-zap zen-zap force-pushed the main branch 2 times, most recently from 8bfa9a0 to a09edaa Compare October 15, 2025 06:57
@github-actions github-actions bot added ci delta-inspect documentation Improvements or additions to documentation proofs labels Oct 15, 2025
@zen-zap zen-zap closed this Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/python Issues for the Python package binding/rust Issues for the Rust crate ci delta-inspect documentation Improvements or additions to documentation proofs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Move add_actions_table onto stream