-
Notifications
You must be signed in to change notification settings - Fork 187
feat: add split_part, chr, and translate string functions #826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9bd7bfe to
0cb547e
Compare
Add three new string manipulation functions to the extensions: - split_part: Split a string using a delimiter and return the nth substring (1-indexed). Returns empty string if field index exceeds available substrings. - chr: Convert an integer codepoint to its corresponding Unicode character. Behavior undefined for invalid Unicode scalar values. - translate: Character-by-character replacement similar to Unix tr command. Maps characters from 'from' string to corresponding positions in 'to' string, removing extra characters if 'to' is shorter.
| name: split_part | ||
| description: >- | ||
| Split a string using a delimiter and return the `field`-th substring (starting at 1). If `field` | ||
| is larger than the number of substrings, an empty string is returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trino returns null if the field is larger than the number of substrings, but in this regard it appears to be the odd one out as a number of other systems return the empty string:
- https://www.postgresql.org/docs/17/functions-string.html
- https://docs.databricks.com/gcp/en/sql/language-manual/functions/split_part
- https://docs.snowflake.com/en/sql-reference/functions/split_part
- https://datafusion.apache.org/user-guide/sql/scalar_functions.html#split-part
What might be good here two would be to set an option for this. Something like
options:
empty_result:
values: [ EMPTY_STRING, NULL ]
I'm not super attached to the empty_result name. There's probably something better we could use.
| - value: "varchar<L2>" | ||
| name: "delimiter" | ||
| - value: i32 | ||
| name: "field" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: position or index might be a clearer name for this.
| name: chr | ||
| description: >- | ||
| Return a single character whose codepoint is the specified integer. Behaviour is undefined if | ||
| the `codepoint` does not correspond to a valid Unicode scalar value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having undefined behaviour by default we could add an option like:
options:
on_invalid_codepoint:
values: [ ERROR, NULL ]
to make it configurable. If it's not set, it's still undefined but we can at least provide some way to pin the behaviour.
A portability concern that we might need to address as well is the range of allowed codepoints.
For example SELECT chr(0); fails in Postgres with
Query Error: requested character too large for encoding: 300
but succeeds in DataFusion and Trino
Ĭ
Add three new string scalar functions to the extensions:
split_part: Split a string using a delimiter and return the nth substring (1-indexed). Returns empty string if field index exceeds available substrings.chr: Convert an integer codepoint to its corresponding Unicode character. Behavior undefined for invalid Unicode scalar values.translate: Character-by-character replacement similar to Unix tr command. Maps characters from 'from' string to corresponding positions in 'to' string, removing extra characters if 'to' is shorter.Those functions are common to several databases/engines (e.g., Postgres, Trino, DataFusion), so I think it warrants being part of the default catalog instead of relying on extensions.