-
Notifications
You must be signed in to change notification settings - Fork 637
feat(pyspark): expose merge_schema option in create_table #11071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
format=format, | ||
mode=mode, | ||
partitionBy=partition_by, | ||
mergeSchema=merge_schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this option make sense without mode="append"
(which currently is never used in our API)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea. Hoping that the original requesters will comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will run without being in append mode, but it won't work as intended. I didn't realize append mode was never used in the API. We're eventually trying to implement delta upserts in our kedro pipeline (uses ibis throughout) but were trying to append with merge schema as a small step in that direction 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd need mode=append
for this option to be meaningful.
@NickCrews, as you mentioned, instead of mapping mergeSchema
directly, I'd recommend we allow users to pass **kwargs
to create_table
method to get mapped to saveAsTable
. This is what was done for to_parquet
& to_delta
methods (via save
method). If we did that, we could remove partition_by
and format
args, albeit this would be a breaking change for users of partition_by
arg (format
would continue to pass through unscathed). In retrospect, I should of done this back in #10850 for more flexibility in create_table
and consistency w/ the to_*
methods.
However, we are still left with the question on how to handle mode
. I'd recommend letting mode
take precedence over overwrite
, if the user passes it as a kwarg to create_table
. That said, passing mode=append
alone to create_table
would result in similar behavior to insert
method, potentially introducing confusion in api usage - although, one would have to understand this functionality in pyspark to begin with. This behavior already exists in pyspark methods - in this proposed solution, the create_table
method would align more closely with saveAsTable
method and insert
would continue to align with insertInto
method. I believe this is the cleanest solution & opens up the most flexibility for advanced pyspark users.
If this proposal sounds good, I'd be happy to take a stab at this if you'd prefer to hand this off @NickCrews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend we allow users to pass **kwargs to create_table method to get mapped to
Sounds good to me, I'm fine if we are breaking, I would prioritize consistency and ease over stability. IDK how @cpcloud feels about this balance.
However, we are still left with the question on how to handle mode
How about you submit a PR with your proposed solution, and in the PR description you describe/compare/contrast the alternative behavior. Having something concrete will help me understand this better.
Closes #10984
I didn't add any tests because I don't understand what this option does.
One of the requesters will need to explain this to me, and propose a general strategy for testing.
Another option would be to instead forward all kwargs opaquely as
**backend_specific_kwargs
, but I went with explicitly listing this as a named kwarg because that is what is already done withpartition_by
. But open to change this.