-
Couldn't load subscription status.
- Fork 1.9k
Description
Feature request
Which Delta project/connector is this regarding?
- [x ] Spark
- Standalone
- Flink
- Kernel
- Other (fill in here)
Overview
Delta Writer In Spark is inefficent when multiple spark clusters are writing to same table which is in append only mode.
Motivation
Running multiple EMR spark clusters to process big data is common best practice. When DL table is written from multiple clusters at same time, the write performance should be same, esp when the table is in append only mode.
Further details
In concurrent write scenario, when DL sees a conflict, it will download all the data files committed by another cluster, refresh the current table snapshot and then commit the new data files generated by current cluster. The issue becomes worse when the no. of concurrent writers increase. Because, most of time, the writers is simply downloading the data written by others and streaming micro-batch is not moving fast enough.
While I understand the reason to refresh table snapshot from other writers (like ACID guarantees), this is not required when the table is in append only mode. Here, we just need to check if schema still remains same from other writers, no need to download the actual data.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
- Yes. I can contribute this feature independently.
- Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
- [ x] No. I cannot contribute this feature at this time.