[Feature Request][Spark] DeltaLake Concurrent Writes are Inefficient

## Feature request

#### Which Delta project/connector is this regarding?


- [x ] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

### Overview

Delta Writer In Spark is inefficent when multiple spark clusters are writing to same table which is in append only mode.


### Motivation



Running multiple EMR spark clusters to process big data is common best practice.  When DL table is written from multiple clusters at same time, the write performance should be same, esp when the table is in append only mode.

### Further details



In concurrent write scenario, when DL sees a conflict, it will download all the data files committed by another cluster, refresh the current table snapshot and then commit the new data files generated by current cluster.  The issue becomes worse when the no. of concurrent writers increase. Because, most of time, the writers is simply downloading the data written by others and streaming micro-batch is not moving fast enough.

While I understand the reason to refresh table snapshot from other writers (like ACID guarantees),  this is not required when the table is in append only mode. Here, we just need to check if schema still remains same from other writers, no need to download the actual data.


### Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

- [ ] Yes. I can contribute this feature independently.
- [ ] Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
- [ x] No. I cannot contribute this feature at this time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature Request][Spark] DeltaLake Concurrent Writes are Inefficient #5385

Feature request

Which Delta project/connector is this regarding?

Overview

Motivation

Further details

Willingness to contribute

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature Request][Spark] DeltaLake Concurrent Writes are Inefficient #5385

Description

Feature request

Which Delta project/connector is this regarding?

Overview

Motivation

Further details

Willingness to contribute

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions