Skip to content

feat(pyspark): expose merge_schema option in create_table #11071

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion ibis/backends/pyspark/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -606,6 +606,7 @@ def create_table(
overwrite: bool = False,
format: str = "parquet",
partition_by: str | list[str] | None = None,
merge_schema: bool = False,
) -> ir.Table:
"""Create a new table in Spark.

Expand All @@ -631,6 +632,8 @@ def create_table(
Format of the table on disk
partition_by
Name(s) of partitioning column(s)
merge_schema
Forwarded to pyspark's `saveAsTable` method as the `mergeSchema` option.

Returns
-------
Expand Down Expand Up @@ -660,7 +663,11 @@ def create_table(
self._run_pre_execute_hooks(table)
df = self._session.sql(query)
df.write.saveAsTable(
name, format=format, mode=mode, partitionBy=partition_by
name,
format=format,
mode=mode,
partitionBy=partition_by,
mergeSchema=merge_schema,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this option make sense without mode="append" (which currently is never used in our API)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea. Hoping that the original requesters will comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will run without being in append mode, but it won't work as intended. I didn't realize append mode was never used in the API. We're eventually trying to implement delta upserts in our kedro pipeline (uses ibis throughout) but were trying to append with merge schema as a small step in that direction 😬

Copy link
Contributor

@jakepenzak jakepenzak Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd need mode=append for this option to be meaningful.

@NickCrews, as you mentioned, instead of mapping mergeSchema directly, I'd recommend we allow users to pass **kwargs to create_table method to get mapped to saveAsTable. This is what was done for to_parquet & to_delta methods (via save method). If we did that, we could remove partition_by and format args, albeit this would be a breaking change for users of partition_by arg (format would continue to pass through unscathed). In retrospect, I should of done this back in #10850 for more flexibility in create_table and consistency w/ the to_* methods.

However, we are still left with the question on how to handle mode. I'd recommend letting mode take precedence over overwrite, if the user passes it as a kwarg to create_table. That said, passing mode=append alone to create_table would result in similar behavior to insert method, potentially introducing confusion in api usage - although, one would have to understand this functionality in pyspark to begin with. This behavior already exists in pyspark methods - in this proposed solution, the create_table method would align more closely with saveAsTablemethod and insert would continue to align with insertInto method. I believe this is the cleanest solution & opens up the most flexibility for advanced pyspark users.

If this proposal sounds good, I'd be happy to take a stab at this if you'd prefer to hand this off @NickCrews.

Copy link
Contributor Author

@NickCrews NickCrews Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend we allow users to pass **kwargs to create_table method to get mapped to

Sounds good to me, I'm fine if we are breaking, I would prioritize consistency and ease over stability. IDK how @cpcloud feels about this balance.

However, we are still left with the question on how to handle mode

How about you submit a PR with your proposed solution, and in the PR description you describe/compare/contrast the alternative behavior. Having something concrete will help me understand this better.

)
elif schema is not None:
schema = ibis.schema(schema)
Expand Down