feat(tsql): SPLIT_PART function and conversion to PARSENAME in tsql #4211

daihuynh · 2024-10-04T06:30:17Z

This PR introduces SPLIT_PART function used in Spark and Databricks as well as the corresponding conversion in TSQL.

In Spark/DataBricks
SELECT SPLIT_PART('1.2.3', '.', 1)

Becomes this in TSQL
SELECT PARSENAME('1.2.3', 3)

Or vice-verse

VaggelisD · 2024-10-04T10:34:48Z

Hey @daihuynh, thanks for the contribution!

Have you checked other dialects as well? When introducing a new expression such as exp.SplitPart this will impact all SQLGlot dialects as SPLIT_PART(...) is now parsed "implicitly"; This is a problem if in Spark all arguments are required but in some other dialect e.g. Snowflake they're optional.

daihuynh · 2024-10-04T11:04:21Z

Hi @VaggelisD ,

I've just check the Snowflake doc about this function and don't see any of those are optional. I think Snowflake or other Spark-family engines just use that function directly. However, I'm happy to update the PR if any SQL engine has a different implementation.
https://docs.snowflake.com/en/sql-reference/functions/split_part

VaggelisD · 2024-10-04T11:27:16Z

@daihuynh From what I remember there are other dialects that have SPIT_PART too like DuckDB. Could you please do a pass over the important dialects (Postgres, Redshift, BigQuery, DuckDB, MySQL, Presto/Trino etc) to make sure that their signature is equivalent? This would be semantics, arg count, whether they're all required like in Spark and so on.

daihuynh · 2024-10-04T11:34:29Z

All good @VaggelisD, I put unit tests in those dialects and update the expression's requirement if there's any variant.

VaggelisD

I took a better look at the PR and I'm not sure if I agree with it:

SPLIT_PART is a function that is common across various engines and is used to choose a part after splitting any string by any delimiter
PARSENAME seems T-SQL specific and is used to split an object (e.g. a table name such as db.schema.foo), so the string must be (1) separated by dots and (2) have a max part count of 4.

So, representing PARSENAME as an exp.SplitPart node seems like a hacky solution and could work only as an extreme subset of it. Could we use T-SQL's STRING_SPLIT here?

Please let me know if I'm misunderstanding your approach; Also, feel free to share any context/examples regarding the need for this change.

daihuynh · 2024-10-04T21:35:17Z

Hi @VaggelisD,

The context behind this PR is that my team's data modeller uses PARSENAME in many scripts when he works with source databases. His approach is ETL.
The actual cloud data platform uses ELT to load first transform later so I need to transpile those scripts to Databricks dialect to run the transformation. The current result is that PARSENAME function stays the same and doesn't work in Databricks (apparently). Instead of finding and replacing it with SPLIT_PART in a Text Editor, I chose making this PR.

With this PR:
from TSQL to other dialects that support SPLIT_PART: PARSENAME becomes SPLIT_PART with '.' as the delimiter and reversed part number.
vice versa, SPLIT_PART with only '.' delimiter can be converted to PARSENAME, otherwise it stays the same and requires a user-defined function in T-SQL. I agree that this is an extreme use case of SPLIT_PART.

The reason I don't choose STRINGS_SPLIT in T-SQL is that it returns split values in rows and picking a value at index takes extra steps, which will change the structure of the original SQL query. Here is the example.

WITH CTE_SplitValues AS (
  SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS idx, value
  FROM STRING_SPLIT('1.2.3.4', '.')
)
SELECT value
FROM CTE_SplitValues
WHERE idx = 1

Hope this gives you more context.

daihuynh · 2024-10-04T22:31:25Z

@VaggelisD Here is the reference table for SPLIT_PART function across supported databases in Sqlglot. The first 3 parameters (string, string, integer) are required in all cases. Apache Drill supports 4 parameters, but the final parameter is optional.

DB	Function	Reference Link
Athena	Same as Trino or Presto
Bigquery	Not supported
Clickhouse	Not supported/No equivalent function
DataBricks	split_part(str, delim, partNum)	Link
Doris	VARCHAR split_part(VARCHAR content, VARCHAR delimiter, INT field)	Link
Drill	SPLIT_PART(string, delimiter, start[, end])	Link
DuckDB	split_part(string, separator, index)	Link
Hive	Not supported
MySQL	SUBSTRING_INDEX(str,delim,count)	Link
Oracle	Not supported
PostgreSQL	split_part(string text, delimiter text, field int) -> text	Link
Presto	split_part(string, delimiter, index) -> varchar()	Link
PRQL	Not supported
Redshift	SPLIT_PART(string, delimiter, position)	Link
RisingWave	split_part( input_string, delimiter_string, int_n ) → varchar	Link
Snowflake	SPLIT_PART(string, delimiter, partNumber)	Link
Spark	SPLIT_PART(src, delimiter, partNumber)	Link
Sqlite	Not supported
StarRock	VARCHAR split_part(VARCHAR content, VARCHAR delimiter, INT field)	Link
Tableau	SPLIT(string, delimiter, token number)	Link
TeraData	[TD_SYSFNLIB.] STRTOK ( instring , delimiter , tokennum )	Link
Trino	split_part(string, delimiter, index) → varchar	Link

VaggelisD

Hey @daihuynh,

Thanks for providing the context & the research. I can now see how your use case could be added as a best-effort transpilation attempt, but will wait for the other maintainers too to decide if this approach is good to go.

In the mean time, I did a pass over the PR and left a few comments.

sqlglot/dialects/tsql.py

tests/dialects/test_tsql.py

VaggelisD · 2024-10-08T11:23:11Z

Hey @daihuynh, could you please let us know if you still have the bandwidth to work on this PR? Otherwise, we can take it to the finish line.

…Enhanced)

daihuynh · 2024-10-08T11:37:16Z

Hi @VaggelisD, I just get back from public holiday. I just push enhanced codes with @georgesittas' suggestions and yours. Hopefully, it is better this time

VaggelisD · 2024-10-08T11:41:37Z

Ah I see, thank you for the immediate response, will go over the changes shortly.

VaggelisD · 2024-10-08T15:00:43Z

@daihuynh Thanks for the PR! I will merge it and simplify it a bit wherever possible.

daihuynh · 2024-10-08T21:36:33Z

Thanks @VaggelisD! I learned something new in this PR.

daihuynh added 2 commits October 4, 2024 15:54

feat(tsql): SPLIT_PART function and conversion to PARSENAME in tsql

89e6763

Fix: restore RIGHT in FUNCTIONS

39eb8ee

daihuynh force-pushed the main branch from 8622087 to 39eb8ee Compare October 4, 2024 06:43

VaggelisD reviewed Oct 4, 2024

View reviewed changes

georgesittas requested review from VaggelisD and removed request for VaggelisD October 7, 2024 07:04

VaggelisD reviewed Oct 7, 2024

View reviewed changes

feat(tsql): SPLIT_NAME function and conversion to PARSENAME in tsql (…

688c61f

…Enhanced)

VaggelisD merged commit 163e943 into tobymao:main Oct 8, 2024
6 checks passed

VaggelisD mentioned this pull request Oct 8, 2024

Refactor!: Simplify PARSENAME <-> SPLIT_PART transpilation #4225

Merged

feat(tsql): SPLIT_PART function and conversion to PARSENAME in tsql #4211

feat(tsql): SPLIT_PART function and conversion to PARSENAME in tsql #4211

Uh oh!

Conversation

daihuynh commented Oct 4, 2024

Uh oh!

VaggelisD commented Oct 4, 2024

Uh oh!

daihuynh commented Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VaggelisD commented Oct 4, 2024

Uh oh!

daihuynh commented Oct 4, 2024

Uh oh!

VaggelisD left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daihuynh commented Oct 4, 2024

Uh oh!

daihuynh commented Oct 4, 2024

Uh oh!

VaggelisD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VaggelisD commented Oct 8, 2024

Uh oh!

daihuynh commented Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VaggelisD commented Oct 8, 2024

Uh oh!

VaggelisD commented Oct 8, 2024

Uh oh!

Uh oh!

daihuynh commented Oct 8, 2024

Uh oh!

Uh oh!

daihuynh commented Oct 4, 2024 •

edited

Loading

VaggelisD left a comment •

edited

Loading

daihuynh commented Oct 8, 2024 •

edited

Loading