Skip to content

feat(pyspark): expose merge_schema option in create_table #11071

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

NickCrews
Copy link
Contributor

Closes #10984

I didn't add any tests because I don't understand what this option does.
One of the requesters will need to explain this to me, and propose a general strategy for testing.

Another option would be to instead forward all kwargs opaquely as **backend_specific_kwargs, but I went with explicitly listing this as a named kwarg because that is what is already done with partition_by. But open to change this.

@github-actions github-actions bot added the pyspark The Apache PySpark backend label Apr 2, 2025
format=format,
mode=mode,
partitionBy=partition_by,
mergeSchema=merge_schema,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this option make sense without mode="append" (which currently is never used in our API)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea. Hoping that the original requesters will comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will run without being in append mode, but it won't work as intended. I didn't realize append mode was never used in the API. We're eventually trying to implement delta upserts in our kedro pipeline (uses ibis throughout) but were trying to append with merge schema as a small step in that direction 😬

Copy link
Contributor

@jakepenzak jakepenzak Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd need mode=append for this option to be meaningful.

@NickCrews, as you mentioned, instead of mapping mergeSchema directly, I'd recommend we allow users to pass **kwargs to create_table method to get mapped to saveAsTable. This is what was done for to_parquet & to_delta methods (via save method). If we did that, we could remove partition_by and format args, albeit this would be a breaking change for users of partition_by arg (format would continue to pass through unscathed). In retrospect, I should of done this back in #10850 for more flexibility in create_table and consistency w/ the to_* methods.

However, we are still left with the question on how to handle mode. I'd recommend letting mode take precedence over overwrite, if the user passes it as a kwarg to create_table. That said, passing mode=append alone to create_table would result in similar behavior to insert method, potentially introducing confusion in api usage - although, one would have to understand this functionality in pyspark to begin with. This behavior already exists in pyspark methods - in this proposed solution, the create_table method would align more closely with saveAsTablemethod and insert would continue to align with insertInto method. I believe this is the cleanest solution & opens up the most flexibility for advanced pyspark users.

If this proposal sounds good, I'd be happy to take a stab at this if you'd prefer to hand this off @NickCrews.

Copy link
Contributor Author

@NickCrews NickCrews Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend we allow users to pass **kwargs to create_table method to get mapped to

Sounds good to me, I'm fine if we are breaking, I would prioritize consistency and ease over stability. IDK how @cpcloud feels about this balance.

However, we are still left with the question on how to handle mode

How about you submit a PR with your proposed solution, and in the PR description you describe/compare/contrast the alternative behavior. Having something concrete will help me understand this better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pyspark The Apache PySpark backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: add **options to pyspark backend create_table
4 participants