Spark DataSource V2 read and write benchmarks? #13955
geserdugarov
started this conversation in
General Discussions
Replies: 1 comment
-
@leesf tagging you in case you have some old context to add/capture here. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Integration of Spark Datasource V2 was done in RFC-38. However, there were multiple issues with advertising a Hudi table as V2 without actually implementing certain APIs, and with using custom relation rule to fall back to V1 API. As a result, the current implementation of
HoodieCatalog
andSpark3DefaultSource
returns aV1Table
instead ofHoodieInternalV2Table
, in order to address performance regressions.Performance issues were not revealed in the initial PR due to the absence of proper benchmarking for such changes. Therefore, to restart this work, it is important first to decide how to benchmark the changes. Among other things, Datasource V1 allows custom logic, such as the use of Hudi indexes, which is not straightforward to implement in Datasource V2. So we need to consider cases like this in the benchmarking scenarios.
If anybody has already gone down this path, please share your insights. Any suggestions about scenarios that should be considered are also welcome.
Beta Was this translation helpful? Give feedback.
All reactions