feat: add split_part, chr, and translate string functions #826

bvolpato · 2025-07-15T00:58:14Z

Add three new string scalar functions to the extensions:

split_part: Split a string using a delimiter and return the nth substring (1-indexed). Returns empty string if field index exceeds available substrings.
chr: Convert an integer codepoint to its corresponding Unicode character. Behavior undefined for invalid Unicode scalar values.
translate: Character-by-character replacement similar to Unix tr command. Maps characters from 'from' string to corresponding positions in 'to' string, removing extra characters if 'to' is shorter.

Those functions are common to several databases/engines (e.g., Postgres, Trino, DataFusion), so I think it warrants being part of the default catalog instead of relying on extensions.

Add three new string manipulation functions to the extensions: - split_part: Split a string using a delimiter and return the nth substring (1-indexed). Returns empty string if field index exceeds available substrings. - chr: Convert an integer codepoint to its corresponding Unicode character. Behavior undefined for invalid Unicode scalar values. - translate: Character-by-character replacement similar to Unix tr command. Maps characters from 'from' string to corresponding positions in 'to' string, removing extra characters if 'to' is shorter.

vbarua · 2025-07-15T14:48:56Z

extensions/functions_string.yaml

+    name: split_part
+    description: >-
+      Split a string using a delimiter and return the `field`-th substring (starting at 1). If `field`
+      is larger than the number of substrings, an empty string is returned.


Trino returns null if the field is larger than the number of substrings, but in this regard it appears to be the odd one out as a number of other systems return the empty string:

https://www.postgresql.org/docs/17/functions-string.html

https://docs.databricks.com/gcp/en/sql/language-manual/functions/split_part

https://docs.snowflake.com/en/sql-reference/functions/split_part

https://datafusion.apache.org/user-guide/sql/scalar_functions.html#split-part

What might be good here two would be to set an option for this. Something like

options: empty_result: values: [ EMPTY_STRING, NULL ]

I'm not super attached to the empty_result name. There's probably something better we could use.

vbarua · 2025-07-15T15:43:21Z

extensions/functions_string.yaml

+          - value: "varchar<L2>"
+            name: "delimiter"
+          - value: i32
+            name: "field"


minor: position or index might be a clearer name for this.

vbarua · 2025-07-15T20:41:00Z

extensions/functions_string.yaml

+    name: chr
+    description: >-
+      Return a single character whose codepoint is the specified integer. Behaviour is undefined if
+      the `codepoint` does not correspond to a valid Unicode scalar value.


Instead of having undefined behaviour by default we could add an option like:

options: on_invalid_codepoint: values: [ ERROR, NULL ]

to make it configurable. If it's not set, it's still undefined but we can at least provide some way to pin the behaviour.

A portability concern that we might need to address as well is the range of allowed codepoints.

For example SELECT chr(0); fails in Postgres with

Query Error: requested character too large for encoding: 300

but succeeds in DataFusion and Trino

Ĭ

bvolpato requested review from EpsilonPrime, cpcloud, jacques-n, vbarua and westonpace as code owners July 15, 2025 00:58

bvolpato force-pushed the str-funcs branch 2 times, most recently from 9bd7bfe to 0cb547e Compare July 15, 2025 01:21

bvolpato force-pushed the str-funcs branch from 0b3cdf5 to 3bb2724 Compare July 15, 2025 01:56

vbarua reviewed Jul 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add split_part, chr, and translate string functions #826

feat: add split_part, chr, and translate string functions #826

Uh oh!

bvolpato commented Jul 15, 2025

Uh oh!

vbarua Jul 15, 2025

Uh oh!

vbarua Jul 15, 2025

Uh oh!

vbarua Jul 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add split_part, chr, and translate string functions #826

Are you sure you want to change the base?

feat: add split_part, chr, and translate string functions #826

Uh oh!

Conversation

bvolpato commented Jul 15, 2025

Uh oh!

vbarua Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

vbarua Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

vbarua Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants