fix: preserve column names with spaces in wr.redshift.copy()#3298
Conversation
Passes flavor=None to internal s3.to_parquet call to prevent pyarrow spark flavor from sanitizing column names (spaces → underscores). Fixes aws#3293
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
@kukushking Could you confirm this failure is pre-existing and unrelated to the fix? Happy to address any other feedback! The |
@kukushking Could you confirm this failure is pre-existing and unrelated to the fix? Happy to address any other feedback! The GitHubCodeBuild (non-distributed) pipeline passed successfully. |
Hi @kukushking |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
|
@kukushking Could you please take a look at this PR when you get a chance? Quick summary:
Would appreciate your review! |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
|
Hi @hirenkumar-n-dholariya yes - the failure you are referring to is pre-existing, no reason to worry about. With regards to your change, this is a breaking change of default behavior for any current redshift user will be impacted. We will consider this for the next major version. |
|
@kukushking Thank you for the feedback! That's a fair point about the breaking change concern. Would it make sense to make the behavior configurable via an optional parameter, so existing users are not impacted by default? For example: def copy(
df,
...
sanitize_column_names: bool = True, # preserves backward compatibility
):This way:
Happy to implement this if it sounds like a good direction. |
|
@hirenkumar-n-dholariya yes, that would makes sense! Please also consider adding a test case to test the new behavior. Thank you! |
Problem: wr.redshift.copy() internally calls s3.to_parquet() which defaults to pyarrow flavor='spark'. This causes column names with spaces to be silently renamed (e.g. "my col" → "my_col"), leading to a mismatch between the DataFrame schema and the Redshift table schema. Solution: Add an optional sanitize_column_names parameter (default=True) to wr.redshift.copy() that controls whether pyarrow sanitizes column names. - sanitize_column_names=True (default): preserves existing behavior, column names are sanitized for backward compatibility. - sanitize_column_names=False: passes flavor=None to the internal s3.to_parquet() call, preserving original column names including spaces. This is a non-breaking change — existing users are unaffected since the default value maintains the current behavior. Changes: - Added sanitize_column_names: bool = True parameter to copy() - Updated pyarrow_additional_kwargs in s3.to_parquet() call accordingly - Added docstring for the new parameter - Added test case for sanitize_column_names=False behavior Fixes aws#3293
test: add test for sanitize_column_names=False in wr.redshift.copy()
style: fix ruff formatting - remove trailing whitespace
style: fix ruff formatting - remove trailing whitespace in _write.py
style: fix ruff formatting in test_redshift.py
fix: add required blank line in docstring for ruff D410/D411
fix: remove trailing whitespace in sanitize_column_names docstring
|
@kukushking Both files are now formatted correctly. Could you please review, give your approval/feeedback to proceed with merge. |
kukushking
left a comment
There was a problem hiding this comment.
Second look at the current API, I think it would be actually better to expose pyarrow_additional_kwargs directly instead of creating another flag (sanitize_column_names) proposed in this PR. Especially since s3_additional_kwargs is already exposed we would follow suit. This will give users an escape hatch for any pyarrow option and give you tools to cover your use case.
Let me know what you think.
|
@kukushking Thanks for the suggestion! I have updated the implementation to expose I have made changes in test file as well. Please review and help me to approve/merge the PR if everything else looks good. |
|
Thank you so much @kukushking! |
Problem
wr.redshift.copy()silently renames columns with spaces (e.g. "my col" → "my_col")because the internal
s3.to_parquetcall defaults to pyarrowflavor='spark',which sanitizes column names.
Fix
Explicitly pass
pyarrow_additional_kwargs={"flavor": None}in the internals3.to_parquetcall to preserve original column names.Fixes #3293