-
Notifications
You must be signed in to change notification settings - Fork 896
feat(tsql): SPLIT_PART function and conversion to PARSENAME in tsql #4211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hey @daihuynh, thanks for the contribution! Have you checked other dialects as well? When introducing a new expression such as |
Hi @VaggelisD , I've just check the Snowflake doc about this function and don't see any of those are optional. I think Snowflake or other Spark-family engines just use that function directly. However, I'm happy to update the PR if any SQL engine has a different implementation. |
@daihuynh From what I remember there are other dialects that have |
All good @VaggelisD, I put unit tests in those dialects and update the expression's requirement if there's any variant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a better look at the PR and I'm not sure if I agree with it:
SPLIT_PART
is a function that is common across various engines and is used to choose a part after splitting any string by any delimiterPARSENAME
seems T-SQL specific and is used to split an object (e.g. a table name such asdb.schema.foo
), so the string must be (1) separated by dots and (2) have a max part count of 4.
So, representing PARSENAME
as an exp.SplitPart
node seems like a hacky solution and could work only as an extreme subset of it. Could we use T-SQL's STRING_SPLIT here?
Please let me know if I'm misunderstanding your approach; Also, feel free to share any context/examples regarding the need for this change.
Hi @VaggelisD, The context behind this PR is that my team's data modeller uses PARSENAME in many scripts when he works with source databases. His approach is ETL. With this PR: The reason I don't choose STRINGS_SPLIT in T-SQL is that it returns split values in rows and picking a value at index takes extra steps, which will change the structure of the original SQL query. Here is the example.
Hope this gives you more context. |
@VaggelisD Here is the reference table for SPLIT_PART function across supported databases in Sqlglot. The first 3 parameters (string, string, integer) are required in all cases. Apache Drill supports 4 parameters, but the final parameter is optional.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @daihuynh,
Thanks for providing the context & the research. I can now see how your use case could be added as a best-effort transpilation attempt, but will wait for the other maintainers too to decide if this approach is good to go.
In the mean time, I did a pass over the PR and left a few comments.
Hey @daihuynh, could you please let us know if you still have the bandwidth to work on this PR? Otherwise, we can take it to the finish line. |
Hi @VaggelisD, I just get back from public holiday. I just push enhanced codes with @georgesittas' suggestions and yours. Hopefully, it is better this time |
Ah I see, thank you for the immediate response, will go over the changes shortly. |
@daihuynh Thanks for the PR! I will merge it and simplify it a bit wherever possible. |
Thanks @VaggelisD! I learned something new in this PR. |
This PR introduces SPLIT_PART function used in Spark and Databricks as well as the corresponding conversion in TSQL.
In Spark/DataBricks
SELECT SPLIT_PART('1.2.3', '.', 1)
Becomes this in TSQL
SELECT PARSENAME('1.2.3', 3)
Or vice-verse