Skip to content

fix: preserve column names with spaces in wr.redshift.copy()#3298

Merged
kukushking merged 16 commits into
aws:mainfrom
hirenkumar-n-dholariya:hirenkumar-n-dholariya-fix/redshift-copy-column-space-rename
Jun 16, 2026
Merged

fix: preserve column names with spaces in wr.redshift.copy()#3298
kukushking merged 16 commits into
aws:mainfrom
hirenkumar-n-dholariya:hirenkumar-n-dholariya-fix/redshift-copy-column-space-rename

Conversation

@hirenkumar-n-dholariya

Copy link
Copy Markdown
Contributor

Problem

wr.redshift.copy() silently renames columns with spaces (e.g. "my col" → "my_col")
because the internal s3.to_parquet call defaults to pyarrow flavor='spark',
which sanitizes column names.

Fix

Explicitly pass pyarrow_additional_kwargs={"flavor": None} in the internal
s3.to_parquet call to preserve original column names.

Fixes #3293

Passes flavor=None to internal s3.to_parquet call to prevent pyarrow spark flavor from sanitizing column names (spaces → underscores). Fixes aws#3293
@kukushking

Copy link
Copy Markdown
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: fdccae4
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@kukushking

Copy link
Copy Markdown
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: fdccae4
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@hirenkumar-dholariya

hirenkumar-dholariya commented Apr 10, 2026

Copy link
Copy Markdown

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: fdccae4
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@kukushking Could you confirm this failure is pre-existing and unrelated to the fix? Happy to address any other feedback!
The CI failure is unrelated to this fix. The GitHubDistributedCodeBuild failure is caused by a pre-existing incompatibility between modin==0.37.1 and pandas==3.0.1pandas.read_gbq was removed in pandas 3.x, which causes an AttributeError when loading modin.

The GitHubCodeBuild (non-distributed) pipeline passed successfully.
This issue exists independently in the main branch and is not introduced by this PR.

@hirenkumar-n-dholariya

Copy link
Copy Markdown
Contributor Author

@kukushking Could you confirm this failure is pre-existing and unrelated to the fix? Happy to address any other feedback! The CI failure is unrelated to this fix. The GitHubDistributedCodeBuild failure is caused by a pre-existing incompatibility between modin==0.37.1 and pandas==3.0.1pandas.read_gbq was removed in pandas 3.x, which causes an AttributeError when loading modin.

The GitHubCodeBuild (non-distributed) pipeline passed successfully. This issue exists independently in the main branch and is not introduced by this PR.

@kukushking Could you confirm this failure is pre-existing and unrelated to the fix? Happy to address any other feedback!
The CI failure is unrelated to this fix. The GitHubDistributedCodeBuild failure is caused by a pre-existing incompatibility between modin==0.37.1 and pandas==3.0.1 — pandas.read_gbq was removed in pandas 3.x, which causes an AttributeError when loading modin.

The GitHubCodeBuild (non-distributed) pipeline passed successfully.
This issue exists independently in the main branch and is not introduced by this PR.

@hirenkumar-n-dholariya

Copy link
Copy Markdown
Contributor Author

@kukushking Could you confirm this failure is pre-existing and unrelated to the fix? Happy to address any other feedback! The CI failure is unrelated to this fix. The GitHubDistributedCodeBuild failure is caused by a pre-existing incompatibility between modin==0.37.1 and pandas==3.0.1pandas.read_gbq was removed in pandas 3.x, which causes an AttributeError when loading modin.
The GitHubCodeBuild (non-distributed) pipeline passed successfully. This issue exists independently in the main branch and is not introduced by this PR.

@kukushking Could you confirm this failure is pre-existing and unrelated to the fix? Happy to address any other feedback! The CI failure is unrelated to this fix. The GitHubDistributedCodeBuild failure is caused by a pre-existing incompatibility between modin==0.37.1 and pandas==3.0.1 — pandas.read_gbq was removed in pandas 3.x, which causes an AttributeError when loading modin.

The GitHubCodeBuild (non-distributed) pipeline passed successfully. This issue exists independently in the main branch and is not introduced by this PR.

Hi @kukushking
Hope you are doing well. Could you please take a look at the comments and help to share your feedback/approval on the PR. Thank you so much in advance for your time.

@kukushking

Copy link
Copy Markdown
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: 857326e
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@hirenkumar-n-dholariya

Copy link
Copy Markdown
Contributor Author

@kukushking Could you please take a look at this PR when you get a chance?

Quick summary:

  • The fix passes flavor=None to the internal s3.to_parquet call to preserve column names with spaces in wr.redshift.copy()
  • The GitHubCodeBuild pipeline passed
  • The GitHubDistributedCodeBuild failure is a pre-existing issue caused by modin==0.37.1 incompatibility with pandas==3.0.1 (pandas.read_gbq` was removed in pandas 3.x) -> unrelated to this fix

Would appreciate your review!

@kukushking

Copy link
Copy Markdown
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 857326e
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@kukushking

Copy link
Copy Markdown
Collaborator

Hi @hirenkumar-n-dholariya yes - the failure you are referring to is pre-existing, no reason to worry about.

With regards to your change, this is a breaking change of default behavior for any current redshift user will be impacted. We will consider this for the next major version.

@hirenkumar-n-dholariya

Copy link
Copy Markdown
Contributor Author

@kukushking Thank you for the feedback! That's a fair point about the breaking change concern.

Would it make sense to make the behavior configurable via an optional parameter, so existing users are not impacted by default?

For example:

def copy(
    df,
    ...
    sanitize_column_names: bool = True,  # preserves backward compatibility
):

This way:

  • Existing users are unaffected (default = True keeps current behavior)
  • Users who need to preserve column names with spaces can opt in with sanitize_column_names=False

Happy to implement this if it sounds like a good direction.

@kukushking

Copy link
Copy Markdown
Collaborator

@hirenkumar-n-dholariya yes, that would makes sense! Please also consider adding a test case to test the new behavior. Thank you!

Problem:
wr.redshift.copy() internally calls s3.to_parquet() which defaults to
pyarrow flavor='spark'. This causes column names with spaces to be
silently renamed (e.g. "my col" → "my_col"), leading to a mismatch
between the DataFrame schema and the Redshift table schema.

Solution:
Add an optional sanitize_column_names parameter (default=True) to
wr.redshift.copy() that controls whether pyarrow sanitizes column names.

- sanitize_column_names=True (default): preserves existing behavior,
  column names are sanitized for backward compatibility.
- sanitize_column_names=False: passes flavor=None to the internal
  s3.to_parquet() call, preserving original column names including spaces.

This is a non-breaking change — existing users are unaffected since
the default value maintains the current behavior.

Changes:
- Added sanitize_column_names: bool = True parameter to copy()
- Updated pyarrow_additional_kwargs in s3.to_parquet() call accordingly
- Added docstring for the new parameter
- Added test case for sanitize_column_names=False behavior

Fixes aws#3293
test: add test for sanitize_column_names=False in wr.redshift.copy()
style: fix ruff formatting - remove trailing whitespace
style: fix ruff formatting - remove trailing whitespace in _write.py
style: fix ruff formatting in test_redshift.py
fix: add required blank line in docstring for ruff D410/D411
fix: remove trailing whitespace in sanitize_column_names docstring
@hirenkumar-n-dholariya

Copy link
Copy Markdown
Contributor Author

@kukushking Both files are now formatted correctly.
The sanitize_column_names parameter added with default=True for backward compatibility, test case added, and ruff
formatting fixed.

Could you please review, give your approval/feeedback to proceed with merge.

@kukushking kukushking left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second look at the current API, I think it would be actually better to expose pyarrow_additional_kwargs directly instead of creating another flag (sanitize_column_names) proposed in this PR. Especially since s3_additional_kwargs is already exposed we would follow suit. This will give users an escape hatch for any pyarrow option and give you tools to cover your use case.

Let me know what you think.

@hirenkumar-n-dholariya

Copy link
Copy Markdown
Contributor Author

@kukushking Thanks for the suggestion! I have updated the implementation to expose pyarrow_additional_kwargs directly instead of sanitize_column_names, which is more flexible and consistent with the existing s3_additional_kwargs pattern.

I have made changes in test file as well. Please review and help me to approve/merge the PR if everything else looks good.

@kukushking kukushking merged commit 785b815 into aws:main Jun 16, 2026
28 checks passed
@hirenkumar-dholariya

Copy link
Copy Markdown

Thank you so much @kukushking!
Sincerely appreciate your guidance, help and approval on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

wr.redshift.copy() silently renames columns with spaces due to pyarrow defaulting to flavor='spark' in internal s3.to_parquet call

4 participants