fix: preserve column names with spaces in wr.redshift.copy() by hirenkumar-n-dholariya · Pull Request #3298 · aws/aws-sdk-pandas

hirenkumar-n-dholariya · 2026-04-10T18:44:27Z

Problem

wr.redshift.copy() silently renames columns with spaces (e.g. "my col" → "my_col")
because the internal s3.to_parquet call defaults to pyarrow flavor='spark',
which sanitizes column names.

Fix

Explicitly pass pyarrow_additional_kwargs={"flavor": None} in the internal
s3.to_parquet call to preserve original column names.

Fixes #3293

Passes flavor=None to internal s3.to_parquet call to prevent pyarrow spark flavor from sanitizing column names (spaces → underscores). Fixes aws#3293

kukushking · 2026-04-10T18:47:15Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: fdccae4
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

kukushking · 2026-04-10T19:02:54Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
Commit ID: fdccae4
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

hirenkumar-dholariya · 2026-04-10T21:36:08Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS

Commit ID: fdccae4

Result: FAILED

Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@kukushking Could you confirm this failure is pre-existing and unrelated to the fix? Happy to address any other feedback!
The CI failure is unrelated to this fix. The GitHubDistributedCodeBuild failure is caused by a pre-existing incompatibility between modin==0.37.1 and pandas==3.0.1 — pandas.read_gbq was removed in pandas 3.x, which causes an AttributeError when loading modin.

The GitHubCodeBuild (non-distributed) pipeline passed successfully.
This issue exists independently in the main branch and is not introduced by this PR.

hirenkumar-n-dholariya · 2026-04-10T22:00:32Z

@kukushking Could you confirm this failure is pre-existing and unrelated to the fix? Happy to address any other feedback! The CI failure is unrelated to this fix. The GitHubDistributedCodeBuild failure is caused by a pre-existing incompatibility between modin==0.37.1 and pandas==3.0.1 — pandas.read_gbq was removed in pandas 3.x, which causes an AttributeError when loading modin.

The GitHubCodeBuild (non-distributed) pipeline passed successfully. This issue exists independently in the main branch and is not introduced by this PR.

@kukushking Could you confirm this failure is pre-existing and unrelated to the fix? Happy to address any other feedback!
The CI failure is unrelated to this fix. The GitHubDistributedCodeBuild failure is caused by a pre-existing incompatibility between modin==0.37.1 and pandas==3.0.1 — pandas.read_gbq was removed in pandas 3.x, which causes an AttributeError when loading modin.

The GitHubCodeBuild (non-distributed) pipeline passed successfully.
This issue exists independently in the main branch and is not introduced by this PR.

hirenkumar-n-dholariya · 2026-04-27T20:49:43Z

@kukushking Could you confirm this failure is pre-existing and unrelated to the fix? Happy to address any other feedback! The CI failure is unrelated to this fix. The GitHubDistributedCodeBuild failure is caused by a pre-existing incompatibility between modin==0.37.1 and pandas==3.0.1 — pandas.read_gbq was removed in pandas 3.x, which causes an AttributeError when loading modin.
The GitHubCodeBuild (non-distributed) pipeline passed successfully. This issue exists independently in the main branch and is not introduced by this PR.

@kukushking Could you confirm this failure is pre-existing and unrelated to the fix? Happy to address any other feedback! The CI failure is unrelated to this fix. The GitHubDistributedCodeBuild failure is caused by a pre-existing incompatibility between modin==0.37.1 and pandas==3.0.1 — pandas.read_gbq was removed in pandas 3.x, which causes an AttributeError when loading modin.

The GitHubCodeBuild (non-distributed) pipeline passed successfully. This issue exists independently in the main branch and is not introduced by this PR.

Hi @kukushking
Hope you are doing well. Could you please take a look at the comments and help to share your feedback/approval on the PR. Thank you so much in advance for your time.

…umn-space-rename

kukushking · 2026-05-04T17:14:18Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: 857326e
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

hirenkumar-n-dholariya · 2026-05-04T17:15:43Z

@kukushking Could you please take a look at this PR when you get a chance?

Quick summary:

The fix passes flavor=None to the internal s3.to_parquet call to preserve column names with spaces in wr.redshift.copy()
The GitHubCodeBuild pipeline passed
The GitHubDistributedCodeBuild failure is a pre-existing issue caused by modin==0.37.1 incompatibility with pandas==3.0.1 (pandas.read_gbq` was removed in pandas 3.x) -> unrelated to this fix

Would appreciate your review!

kukushking · 2026-05-04T17:27:14Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
Commit ID: 857326e
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

kukushking · 2026-05-08T14:43:22Z

Hi @hirenkumar-n-dholariya yes - the failure you are referring to is pre-existing, no reason to worry about.

With regards to your change, this is a breaking change of default behavior for any current redshift user will be impacted. We will consider this for the next major version.

hirenkumar-n-dholariya · 2026-05-08T16:15:36Z

@kukushking Thank you for the feedback! That's a fair point about the breaking change concern.

Would it make sense to make the behavior configurable via an optional parameter, so existing users are not impacted by default?

For example:

def copy(
    df,
    ...
    sanitize_column_names: bool = True,  # preserves backward compatibility
):

This way:

Existing users are unaffected (default = True keeps current behavior)
Users who need to preserve column names with spaces can opt in with sanitize_column_names=False

Happy to implement this if it sounds like a good direction.

kukushking · 2026-05-09T00:10:30Z

@hirenkumar-n-dholariya yes, that would makes sense! Please also consider adding a test case to test the new behavior. Thank you!

Problem: wr.redshift.copy() internally calls s3.to_parquet() which defaults to pyarrow flavor='spark'. This causes column names with spaces to be silently renamed (e.g. "my col" → "my_col"), leading to a mismatch between the DataFrame schema and the Redshift table schema. Solution: Add an optional sanitize_column_names parameter (default=True) to wr.redshift.copy() that controls whether pyarrow sanitizes column names. - sanitize_column_names=True (default): preserves existing behavior, column names are sanitized for backward compatibility. - sanitize_column_names=False: passes flavor=None to the internal s3.to_parquet() call, preserving original column names including spaces. This is a non-breaking change — existing users are unaffected since the default value maintains the current behavior. Changes: - Added sanitize_column_names: bool = True parameter to copy() - Updated pyarrow_additional_kwargs in s3.to_parquet() call accordingly - Added docstring for the new parameter - Added test case for sanitize_column_names=False behavior Fixes aws#3293

test: add test for sanitize_column_names=False in wr.redshift.copy()

style: fix ruff formatting - remove trailing whitespace

style: fix ruff formatting - remove trailing whitespace in _write.py

style: fix ruff formatting in test_redshift.py

fix: add required blank line in docstring for ruff D410/D411

fix: remove trailing whitespace in sanitize_column_names docstring

hirenkumar-n-dholariya · 2026-05-26T22:32:36Z

@kukushking Both files are now formatted correctly.
The sanitize_column_names parameter added with default=True for backward compatibility, test case added, and ruff
formatting fixed.

Could you please review, give your approval/feeedback to proceed with merge.

…umn-space-rename

kukushking

Second look at the current API, I think it would be actually better to expose pyarrow_additional_kwargs directly instead of creating another flag (sanitize_column_names) proposed in this PR. Especially since s3_additional_kwargs is already exposed we would follow suit. This will give users an escape hatch for any pyarrow option and give you tools to cover your use case.

Let me know what you think.

…umn-space-rename

…_column_names

…umn-space-rename

hirenkumar-n-dholariya · 2026-06-11T12:04:01Z

@kukushking Thanks for the suggestion! I have updated the implementation to expose pyarrow_additional_kwargs directly instead of sanitize_column_names, which is more flexible and consistent with the existing s3_additional_kwargs pattern.

I have made changes in test file as well. Please review and help me to approve/merge the PR if everything else looks good.

…umn-space-rename

hirenkumar-dholariya · 2026-06-16T10:44:48Z

Thank you so much @kukushking!
Sincerely appreciate your guidance, help and approval on this.

fix: preserve column names with spaces in wr.redshift.copy()

fdccae4

Passes flavor=None to internal s3.to_parquet call to prevent pyarrow spark flavor from sanitizing column names (spaces → underscores). Fixes aws#3293

Merge branch 'main' into hirenkumar-n-dholariya-fix/redshift-copy-col…

857326e

…umn-space-rename

hirenkumar-n-dholariya mentioned this pull request May 4, 2026

wr.redshift.copy() silently renames columns with spaces due to pyarrow defaulting to flavor='spark' in internal s3.to_parquet call #3293

Closed

reachbujji6 approved these changes May 9, 2026

View reviewed changes

hirenkumar-n-dholariya added 7 commits May 26, 2026 17:49

test: add test for sanitize_column_names=False in wr.redshift.copy()

5e3bd43

test: add test for sanitize_column_names=False in wr.redshift.copy()

style: fix ruff formatting - remove trailing whitespace

61c5448

style: fix ruff formatting - remove trailing whitespace

style: fix ruff formatting - remove trailing whitespace in _write.py

a3f885f

style: fix ruff formatting - remove trailing whitespace in _write.py

style: fix ruff formatting in test_redshift.py

5ec45f0

style: fix ruff formatting in test_redshift.py

fix: add required blank line in docstring for ruff D410/D411

c6d80da

fix: add required blank line in docstring for ruff D410/D411

fix: remove trailing whitespace in sanitize_column_names docstring

df76692

fix: remove trailing whitespace in sanitize_column_names docstring

Merge branch 'main' into hirenkumar-n-dholariya-fix/redshift-copy-col…

1dbb5f6

…umn-space-rename

hirenkumar-n-dholariya requested a review from reachbujji6 May 29, 2026 16:25

kukushking requested changes Jun 9, 2026

View reviewed changes

kukushking and others added 4 commits June 9, 2026 17:00

Merge branch 'main' into hirenkumar-n-dholariya-fix/redshift-copy-col…

dc01574

…umn-space-rename

fix: replace sanitize_column_names with pyarrow_additional_kwargs

f32c92e

fix: update test to use pyarrow_additional_kwargs instead of sanitize…

1e52bdc

…_column_names

Merge branch 'main' into hirenkumar-n-dholariya-fix/redshift-copy-col…

e35ca98

…umn-space-rename

hirenkumar-n-dholariya requested a review from kukushking June 11, 2026 21:52

Merge branch 'main' into hirenkumar-n-dholariya-fix/redshift-copy-col…

ef50816

…umn-space-rename

kukushking approved these changes Jun 16, 2026

View reviewed changes

Merge branch 'main' into hirenkumar-n-dholariya-fix/redshift-copy-col…

5ae4d4d

…umn-space-rename

kukushking merged commit 785b815 into aws:main Jun 16, 2026
28 checks passed

Uh oh!

Conversation

hirenkumar-n-dholariya commented Apr 10, 2026

Problem

Fix

Uh oh!

kukushking commented Apr 10, 2026

AWS CodeBuild CI Report

Uh oh!

kukushking commented Apr 10, 2026

AWS CodeBuild CI Report

Uh oh!

hirenkumar-dholariya commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AWS CodeBuild CI Report

Uh oh!

hirenkumar-n-dholariya commented Apr 10, 2026

Uh oh!

hirenkumar-n-dholariya commented Apr 27, 2026

Uh oh!

kukushking commented May 4, 2026

AWS CodeBuild CI Report

Uh oh!

hirenkumar-n-dholariya commented May 4, 2026

Uh oh!

kukushking commented May 4, 2026

AWS CodeBuild CI Report

Uh oh!

kukushking commented May 8, 2026

Uh oh!

hirenkumar-n-dholariya commented May 8, 2026

Uh oh!

kukushking commented May 9, 2026

Uh oh!

hirenkumar-n-dholariya commented May 26, 2026

Uh oh!

kukushking left a comment

Choose a reason for hiding this comment

Uh oh!

hirenkumar-n-dholariya commented Jun 11, 2026

Uh oh!

Uh oh!

hirenkumar-dholariya commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hirenkumar-dholariya commented Apr 10, 2026 •

edited

Loading