Skip to content

Conversation

@itamarst
Copy link

@itamarst itamarst commented Jul 7, 2025

Previously the files would be loaded on a conflict even if the user requested they not be loaded, now they're never loaded.

This is a optimization, reducing the performance for loading unneeded files when in append mode.

For my reproducer benchmark (see #3528), this speeds things up by about 30%.

IMPORTANT: It is not clear to me if this is semantically correct! On the one hand, if you're doing append only, I don't see why you'd need to load the files, this shouldn't be any different than e.g. restarting from scratch by reopening the database. On the other hand, maybe this code path is used in other situations beyond appends?

Are there other tests I should write?

Related Issue(s)

Fixes #3528

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Jul 7, 2025
@itamarst itamarst marked this pull request as ready for review July 7, 2025 18:33
@codecov
Copy link

codecov bot commented Jul 8, 2025

Codecov Report

Attention: Patch coverage is 82.53968% with 11 lines in your changes missing coverage. Please review.

Project coverage is 74.48%. Comparing base (0372bff) to head (53ba1e8).

Files with missing lines Patch % Lines
crates/core/src/kernel/snapshot/mod.rs 86.04% 5 Missing and 1 partial ⚠️
python/src/lib.rs 0.00% 3 Missing ⚠️
crates/core/src/lib.rs 60.00% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3580      +/-   ##
==========================================
- Coverage   74.51%   74.48%   -0.03%     
==========================================
  Files         146      146              
  Lines       44875    44898      +23     
  Branches    44875    44898      +23     
==========================================
+ Hits        33437    33441       +4     
- Misses       9245     9250       +5     
- Partials     2193     2207      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ion-elgreco
Copy link
Collaborator

@itamarst we can put this to the test ofcourse whether this is semantically is correct. Add some tests that does concurrent overwrites with appends and then concurrent schema evolutions. It should reject the append write

@roeap feel free to correct me if my above statement is wrong :)

@itamarst
Copy link
Author

itamarst commented Jul 8, 2025

It's difficult for concurrent tests to be a thorough test of correctness, since there are many possible interleaving execution paths.

@rtyler
Copy link
Member

rtyler commented Jul 8, 2025

@itamarst I think I should be able to come up with a test which doesn't necessarily require the concurrency that your linked issue has. If you're okay with this pull request sitting open for a few days, I will summon the brainpower and come up with something 😄

@itamarst
Copy link
Author

itamarst commented Jul 8, 2025

Of course, appreciate your feedback!

@rtyler rtyler self-assigned this Jul 10, 2025
@rtyler rtyler marked this pull request as draft July 10, 2025 04:26
Previously the files would be loaded on a conflict, now they're never loaded.

Signed-off-by: Itamar Turner-Trauring <[email protected]>
@rtyler rtyler force-pushed the append-should-not-conflict branch from 7fdd69f to 7a464f4 Compare July 14, 2025 19:32
@rtyler rtyler marked this pull request as ready for review July 14, 2025 19:32
@rtyler rtyler enabled auto-merge July 14, 2025 19:32
…e read

This makes it easier to understand whether or not the function has
actually _read_ the ReplayStream

Signed-off-by: R. Tyler Croy <[email protected]>
@rtyler rtyler force-pushed the append-should-not-conflict branch from 7a464f4 to 53ba1e8 Compare July 14, 2025 19:41
@rtyler rtyler requested a review from wjones127 as a code owner July 14, 2025 19:41
@github-actions github-actions bot added the binding/python Issues for the Python package label Jul 14, 2025
@roeap
Copy link
Collaborator

roeap commented Jul 14, 2025

@itamarst @rtyler - Currently I am working on take 3 of getting log replay via kernel in, which touches a lot of the same fields as this PR.

More generally speaking, the require_files flag as it is used today should not exist 😆. Or maybe the flqg should exist, but not the way we treat it. I.e. a snapshot without files is the LazySnapshot, so rather than emulating the lazy snapshot via optional files on the eager one, we would ideally just use the lazy one ...

However DeltaTableState is not quite ready yet to handle both snapshots. And while there is a plan, I do not have a timeline yet. All I can say is that the third time is teh most promising attempt yet 😆.

Comment on lines +467 to +470
if self.files.is_none() {
self.process_visitors(visitors)?;
return Ok(read_data);
}
Copy link
Collaborator

@roeap roeap Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may actually break things in unexpected ways. Specifically process_visitors does nothing without doing a replay. Its main use case right now is idempotent writes (Txn actions ...).

However also not sure how this would work with require_files right now, we may just have an existing bug ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roeap anything is possible so sure it might break things but we have no tests to prove otherwise 🤷

@itamarst
Copy link
Author

Using LazySnapshot when you actually want laziness does sound better, yeah 😁

So the main goal here from my perspective is just "allow scaling appends". So a different approach is to maybe... reopen the database on a conflict in "append" mode, so the code sticks to the fast path? Which current abstraction layers may not allow or may just not make sense. I'll take a look later. And I guess worst case there's the hopefully successful refactor.

@rtyler rtyler marked this pull request as draft July 14, 2025 23:35
auto-merge was automatically disabled July 14, 2025 23:35

Pull request was converted to draft

@itamarst
Copy link
Author

itamarst commented Jul 24, 2025

So looking at the code, the alternative solution of "reopen the database on conflict" doesn't seem to feasible without a bunch of potentially difficult refactoring. The place you'd want to do it is PreparedCommit but that just has a dyn reference, and not even necessarily to the table. So there's a bunch of abstraction boundaries in the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/python Issues for the Python package binding/rust Issues for the Rust crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

without_files=True doesn't apply as soon as version conflicts happen, leading to performance hit

5 participants