-
Notifications
You must be signed in to change notification settings - Fork 638
feat(pyspark): expose merge_schema option in create_table #11071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
NickCrews
wants to merge
1
commit into
ibis-project:main
Choose a base branch
from
NickCrews:pyspark-merge-schema
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this option make sense without
mode="append"
(which currently is never used in our API)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea. Hoping that the original requesters will comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will run without being in append mode, but it won't work as intended. I didn't realize append mode was never used in the API. We're eventually trying to implement delta upserts in our kedro pipeline (uses ibis throughout) but were trying to append with merge schema as a small step in that direction 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd need
mode=append
for this option to be meaningful.@NickCrews, as you mentioned, instead of mapping
mergeSchema
directly, I'd recommend we allow users to pass**kwargs
tocreate_table
method to get mapped tosaveAsTable
. This is what was done forto_parquet
&to_delta
methods (viasave
method). If we did that, we could removepartition_by
andformat
args, albeit this would be a breaking change for users ofpartition_by
arg (format
would continue to pass through unscathed). In retrospect, I should of done this back in #10850 for more flexibility increate_table
and consistency w/ theto_*
methods.However, we are still left with the question on how to handle
mode
. I'd recommend lettingmode
take precedence overoverwrite
, if the user passes it as a kwarg tocreate_table
. That said, passingmode=append
alone tocreate_table
would result in similar behavior toinsert
method, potentially introducing confusion in api usage - although, one would have to understand this functionality in pyspark to begin with. This behavior already exists in pyspark methods - in this proposed solution, thecreate_table
method would align more closely withsaveAsTable
method andinsert
would continue to align withinsertInto
method. I believe this is the cleanest solution & opens up the most flexibility for advanced pyspark users.If this proposal sounds good, I'd be happy to take a stab at this if you'd prefer to hand this off @NickCrews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, I'm fine if we are breaking, I would prioritize consistency and ease over stability. IDK how @cpcloud feels about this balance.
How about you submit a PR with your proposed solution, and in the PR description you describe/compare/contrast the alternative behavior. Having something concrete will help me understand this better.