Spark DataSource V2 read and write benchmarks? #13955

geserdugarov · 2025-09-22T03:31:42Z

geserdugarov
Sep 22, 2025

Integration of Spark Datasource V2 was done in RFC-38. However, there were multiple issues with advertising a Hudi table as V2 without actually implementing certain APIs, and with using custom relation rule to fall back to V1 API. As a result, the current implementation of HoodieCatalog and Spark3DefaultSource returns a V1Table instead of HoodieInternalV2Table, in order to address performance regressions.

Performance issues were not revealed in the initial PR due to the absence of proper benchmarking for such changes. Therefore, to restart this work, it is important first to decide how to benchmark the changes. Among other things, Datasource V1 allows custom logic, such as the use of Hudi indexes, which is not straightforward to implement in Datasource V2. So we need to consider cases like this in the benchmarking scenarios.

If anybody has already gone down this path, please share your insights. Any suggestions about scenarios that should be considered are also welcome.

vinothchandar · 2025-09-24T00:24:42Z

vinothchandar
Sep 24, 2025
Collaborator

@leesf tagging you in case you have some old context to add/capture here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark DataSource V2 read and write benchmarks? #13955

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Spark DataSource V2 read and write benchmarks? #13955

Uh oh!

geserdugarov Sep 22, 2025

Replies: 1 comment

Uh oh!

vinothchandar Sep 24, 2025 Collaborator

geserdugarov
Sep 22, 2025

vinothchandar
Sep 24, 2025
Collaborator