Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(tsql): SPLIT_PART function and conversion to PARSENAME in tsql #4211

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

daihuynh
Copy link

@daihuynh daihuynh commented Oct 4, 2024

This PR introduces SPLIT_PART function used in Spark and Databricks as well as the corresponding conversion in TSQL.

In Spark/DataBricks
SELECT SPLIT_PART('1.2.3', '.', 1)

Becomes this in TSQL
SELECT PARSENAME('1.2.3', 3)

Or vice-verse

@VaggelisD
Copy link
Collaborator

Hey @daihuynh, thanks for the contribution!

Have you checked other dialects as well? When introducing a new expression such as exp.SplitPart this will impact all SQLGlot dialects as SPLIT_PART(...) is now parsed "implicitly"; This is a problem if in Spark all arguments are required but in some other dialect e.g. Snowflake they're optional.

@daihuynh
Copy link
Author

daihuynh commented Oct 4, 2024

Hi @VaggelisD ,

I've just check the Snowflake doc about this function and don't see any of those are optional. I think Snowflake or other Spark-family engines just use that function directly. However, I'm happy to update the PR if any SQL engine has a different implementation.
https://docs.snowflake.com/en/sql-reference/functions/split_part

@VaggelisD
Copy link
Collaborator

@daihuynh From what I remember there are other dialects that have SPIT_PART too like DuckDB. Could you please do a pass over the important dialects (Postgres, Redshift, BigQuery, DuckDB, MySQL, Presto/Trino etc) to make sure that their signature is equivalent? This would be semantics, arg count, whether they're all required like in Spark and so on.

@daihuynh
Copy link
Author

daihuynh commented Oct 4, 2024

All good @VaggelisD, I put unit tests in those dialects and update the expression's requirement if there's any variant.

Copy link
Collaborator

@VaggelisD VaggelisD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a better look at the PR and I'm not sure if I agree with it:

  • SPLIT_PART is a function that is common across various engines and is used to choose a part after splitting any string by any delimiter
  • PARSENAME seems T-SQL specific and is used to split an object (e.g. a table name such as db.schema.foo), so the string must be (1) separated by dots and (2) have a max part count of 4.

So, representing PARSENAME as an exp.SplitPart node seems like a hacky solution and could work only as an extreme subset of it. Could we use T-SQL's STRING_SPLIT here?

Please let me know if I'm misunderstanding your approach; Also, feel free to share any context/examples regarding the need for this change.

@daihuynh
Copy link
Author

daihuynh commented Oct 4, 2024

Hi @VaggelisD,

The context behind this PR is that my team's data modeller uses PARSENAME in many scripts when he works with source databases. His approach is ETL.
The actual cloud data platform uses ELT to load first transform later so I need to transpile those scripts to Databricks dialect to run the transformation. The current result is that PARSENAME function stays the same and doesn't work in Databricks (apparently). Instead of finding and replacing it with SPLIT_PART in a Text Editor, I chose making this PR.

With this PR:
from TSQL to other dialects that support SPLIT_PART: PARSENAME becomes SPLIT_PART with '.' as the delimiter and reversed part number.
vice versa, SPLIT_PART with only '.' delimiter can be converted to PARSENAME, otherwise it stays the same and requires a user-defined function in T-SQL. I agree that this is an extreme use case of SPLIT_PART.

The reason I don't choose STRINGS_SPLIT in T-SQL is that it returns split values in rows and picking a value at index takes extra steps, which will change the structure of the original SQL query. Here is the example.

WITH CTE_SplitValues AS (
  SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS idx, value
  FROM STRING_SPLIT('1.2.3.4', '.')
)
SELECT value
FROM CTE_SplitValues
WHERE idx = 1

Hope this gives you more context.

@daihuynh
Copy link
Author

daihuynh commented Oct 4, 2024

@VaggelisD Here is the reference table for SPLIT_PART function across supported databases in Sqlglot. The first 3 parameters (string, string, integer) are required in all cases. Apache Drill supports 4 parameters, but the final parameter is optional.

DB Function Reference Link
Athena Same as Trino or Presto
Bigquery Not supported
Clickhouse Not supported/No equivalent function
DataBricks split_part(str, delim, partNum) Link
Doris VARCHAR split_part(VARCHAR content, VARCHAR delimiter, INT field) Link
Drill SPLIT_PART(string, delimiter, start[, end]) Link
DuckDB split_part(string, separator, index) Link
Hive Not supported
MySQL SUBSTRING_INDEX(str,delim,count) Link
Oracle Not supported
PostgreSQL split_part(string text, delimiter text, field int) -> text Link
Presto split_part(string, delimiter, index) -> varchar() Link
PRQL Not supported
Redshift SPLIT_PART(string, delimiter, position) Link
RisingWave split_part( input_string, delimiter_string, int_n ) → varchar Link
Snowflake SPLIT_PART(string, delimiter, partNumber) Link
Spark SPLIT_PART(src, delimiter, partNumber) Link
Sqlite Not supported
StarRock VARCHAR split_part(VARCHAR content, VARCHAR delimiter, INT field) Link
Tableau SPLIT(string, delimiter, token number) Link
TeraData [TD_SYSFNLIB.] STRTOK ( instring , delimiter , tokennum ) Link
Trino split_part(string, delimiter, index) → varchar Link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants