Skip to content

Conversation

@fvaleye
Copy link
Collaborator

@fvaleye fvaleye commented Aug 22, 2025

Description

This PR fixes type inconsistencies in Pandas DataFrames by implementing automatic conversion of object dtype columns containing null values into their correct schema-defined types (when available).

Previously, object dtype columns with null values could remain untyped, leading to mismatches with schema-defined types and causing validation or serialization errors.

Now, these columns are automatically converted, ensuring consistency between DataFrame values and the schema while reducing type-related issues.

Related Issue(s)

Copilot AI review requested due to automatic review settings August 22, 2025 13:29
@fvaleye fvaleye requested a review from ion-elgreco as a code owner August 22, 2025 13:29
@github-actions github-actions bot added the binding/python Issues for the Python package label Aug 22, 2025

This comment was marked as outdated.

@codecov
Copy link

codecov bot commented Aug 22, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.37%. Comparing base (ef9e077) to head (dcb808d).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3695   +/-   ##
=======================================
  Coverage   75.36%   75.37%           
=======================================
  Files         144      144           
  Lines       43607    43607           
  Branches    43607    43607           
=======================================
+ Hits        32866    32868    +2     
+ Misses       9133     9131    -2     
  Partials     1608     1608           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@fvaleye fvaleye force-pushed the feature/implement-automatic-conversion-for-pandas-null-types branch 4 times, most recently from 0a75592 to e62e814 Compare August 22, 2025 13:49
@fvaleye fvaleye requested a review from Copilot August 22, 2025 13:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements automatic conversion of Pandas DataFrames containing null-typed columns to match their corresponding schema-defined types in Delta Lake. This fixes type inconsistencies that occur when object dtype columns with null values are converted to null types in PyArrow, which can cause validation or serialization errors when appending to existing tables.

  • Modifies the _convert_arro3_schema_to_delta function to accept an optional existing schema parameter for null type conversion
  • Updates the writer to pass the existing table schema when available for type conversion
  • Adds comprehensive test coverage for null column conversion scenarios

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
python/deltalake/writer/writer.py Passes existing table schema to conversion function for null type handling
python/deltalake/writer/_conversion.py Implements null type conversion logic with recursion prevention
python/tests/test_conversion.py Adds test cases for null column conversion and edge cases

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@fvaleye fvaleye force-pushed the feature/implement-automatic-conversion-for-pandas-null-types branch 6 times, most recently from fd8d945 to 20775be Compare August 22, 2025 14:47
- implement the conversion of object dtype columns (null) when we found the corresponding type existing in the schema
- works recursively for nested types

Signed-off-by: Florian Valeye <[email protected]>
@fvaleye fvaleye force-pushed the feature/implement-automatic-conversion-for-pandas-null-types branch from 20775be to 0f3ed7f Compare August 22, 2025 14:55
@ion-elgreco ion-elgreco self-assigned this Aug 22, 2025
@FrankPortman
Copy link
Contributor

Thanks for jumping on this so quickly!

def dtype_to_delta_dtype(
dtype: DataType, field_name: str | None = None
) -> DataType:
if DataType.is_null(dtype) and existing_schema and field_name:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer we also check here if is not None, it's s bit more clear. Rest of code lgtm!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean:
if dtype and DataType.is_null(dtype) and existing_schema and field_name:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, rather field_,name is not None and schema is not none

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see!
Thanks for your review @ion-elgreco 🙏
Let's merge this after this change!

@fvaleye fvaleye force-pushed the feature/implement-automatic-conversion-for-pandas-null-types branch from 6f0c6ca to dcb808d Compare August 24, 2025 18:59
@fvaleye fvaleye enabled auto-merge (rebase) August 24, 2025 19:00
@fvaleye fvaleye merged commit b89d7a4 into delta-io:main Aug 24, 2025
28 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/python Issues for the Python package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python: Automatically convert Pandas null types to valid Delta Lake types in write_deltalake()

3 participants