Skip to content

Conversation

@fvaleye
Copy link
Collaborator

@fvaleye fvaleye commented Oct 9, 2025

Description

Following this PR, I took more time to investigate how to optimize the JSON parsing in get_actions().
I added a bench to evaluate the performance gains by comparing the baseline to the new implementation (leveraging Deserializer::from_slice from serde_json).

Benchmark results

Only for theget_actions() method
My hardware: Mac Book Pro M1
The performance boost of 13-27% depending on JSON complexity

Performance Results (150 samples, 10s measurement)

  1. Simple Actions (1000 add actions)
    Baseline: ~360-398 µs (2.51-2.77 Melem/s)
    New version: ~311-312 µs (3.20-3.21 Melem/s)
    Improvement: ~17-18% faster

  2. With Stats (1000 actions with stats)
    Baseline: ~940 µs (1.06 Melem/s)
    New version: ~688-717 µs (1.39-1.45 Melem/s)
    Improvement: ~26-27% faster

  3. Full Complexity (1000 complex actions)
    Baseline: ~1.41 ms (708-710 Kelem/s)
    New version: ~1.22 ms (817-819 Kelem/s)
    Improvement: ~13-15% faster

Changes

  • Leverage the streaming Deserializer::from_slice()
  • Avoids allocating intermediate String objects for each line
  • Passing &bytes::Bytes avoids atomic ref-counting overhead

Side notes: I will remove the get_actions_baseline(), but we could keep the bench test

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Oct 9, 2025
@fvaleye fvaleye force-pushed the performance/json-parsing branch from 4c40147 to 994ffca Compare October 9, 2025 09:28
@codecov
Copy link

codecov bot commented Oct 9, 2025

Codecov Report

❌ Patch coverage is 79.06977% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.29%. Comparing base (c24396c) to head (04f45e2).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/core/src/operations/load_cdf.rs 25.00% 0 Missing and 3 partials ⚠️
crates/core/src/operations/write/mod.rs 0.00% 0 Missing and 3 partials ⚠️
crates/core/src/kernel/snapshot/mod.rs 50.00% 0 Missing and 1 partial ⚠️
...es/core/src/kernel/transaction/conflict_checker.rs 0.00% 0 Missing and 1 partial ⚠️
crates/core/src/logstore/mod.rs 96.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3830      +/-   ##
==========================================
- Coverage   74.30%   74.29%   -0.01%     
==========================================
  Files         147      147              
  Lines       39701    39695       -6     
  Branches    39701    39695       -6     
==========================================
- Hits        29500    29493       -7     
- Misses       8810     8812       +2     
+ Partials     1391     1390       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@fvaleye fvaleye force-pushed the performance/json-parsing branch from 994ffca to d50077b Compare October 9, 2025 09:44
Comment on lines 567 to 570
pub async fn get_actions(
version: i64,
commit_log_bytes: bytes::Bytes,
commit_log_bytes: &bytes::Bytes,
) -> Result<Vec<Action>, DeltaTableError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why this function is async. There's nothing async inside of it. Not your fault as the base function was also async, but probably some legacy tech debt? I can imagine there was a world where this function took an async bytes stream instead of all the bytes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per git blame, this function was implemented two years ago as async even though there were no async in it at any time.

Copy link
Collaborator Author

@fvaleye fvaleye Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I kept the function async.
Removing async would be a minor breaking change, as it would also require removing .await from the callers.
Let's see what @roeap and @rtyler think about this!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more generally speaking, I see most call sites disappearing short term, since log replay nor produces record batches that we extract data from (i.e. LogFileView et. al.) avoiding copies whenver possible.

Exception being calling this in commit_infos.

Sine we are passing in Bytes, I see no reason why we should be doing IO in this function, and with that also little reason for it to be async ... maybe in a follow-up we can make it sync.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, let's see when we have fully kernelized conflict resolution. There might be a few surprises lurking 😆.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, let's see when we have fully kernelized conflict resolution. There might be a few surprises lurking 😆.

Yay!
Let's keep it like this for now and make it sync later.
I will create an issue for tracking this need.

@fvaleye fvaleye merged commit 08ad211 into delta-io:main Oct 11, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/rust Issues for the Rust crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants