Skip to content

[Feature Request][Spark] DeltaLake Concurrent Writes are Inefficient #5385

@aravinds03

Description

@aravinds03

Feature request

Which Delta project/connector is this regarding?

  • [x ] Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Overview

Delta Writer In Spark is inefficent when multiple spark clusters are writing to same table which is in append only mode.

Motivation

Running multiple EMR spark clusters to process big data is common best practice. When DL table is written from multiple clusters at same time, the write performance should be same, esp when the table is in append only mode.

Further details

In concurrent write scenario, when DL sees a conflict, it will download all the data files committed by another cluster, refresh the current table snapshot and then commit the new data files generated by current cluster. The issue becomes worse when the no. of concurrent writers increase. Because, most of time, the writers is simply downloading the data written by others and streaming micro-batch is not moving fast enough.

While I understand the reason to refresh table snapshot from other writers (like ACID guarantees), this is not required when the table is in append only mode. Here, we just need to check if schema still remains same from other writers, no need to download the actual data.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • [ x] No. I cannot contribute this feature at this time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions