Skip to content

Conversation

@skoschik
Copy link
Contributor

Feature or Bugfix

Feature

Detail

This PR introduces wr.s3.to_deltalake_streaming, a helper that allows users to write large datasets to Delta Lake on S3 in a single atomic commit.

The existing wr.s3.to_deltalake writes one Delta version per call.
When data must be processed or loaded in chunks (e.g., due to memory limits), this creates multiple versions and complicates time travel.

The new to_deltalake_streaming function:

  • Accepts an iterable or generator of Pandas DataFrames.
  • Streams data through Arrow’s RecordBatchReader into one Delta transaction.
  • Supports partitioning, schema overwrite/merge, and S3/DynamoDB locking.
  • Produces exactly one Delta version per run, even for multi-chunk writes.

Example

def generate_data():
    for i in range(10):
        yield pd.DataFrame({"id": range(i * 1000, (i + 1) * 1000), "value": [i] * 1000})

wr.s3.to_deltalake_streaming(
    dfs=generate_data(),
    path="s3://bucket/delta-table/",
    mode="overwrite",
    s3_allow_unsafe_rename=True
    # lock_dynamodb_table="delta-locks",  # optional
)

Result: one Delta commit per rebuild, instead of multiple chunked versions.

Relates

Enhances Delta Lake integration for S3 single-commit writes.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: f02f895
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: f02f895
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: ec1bc7e
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 530620f
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: ec1bc7e
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: 530620f
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@kukushking
Copy link
Contributor

Thanks @skoschik ! There's some minor linting errors, and would be great to have a test case for the streaming write, otherwise looks great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants