Skip to content

Conversation

@bvolpato
Copy link
Member

Add three new string scalar functions to the extensions:

  • split_part: Split a string using a delimiter and return the nth substring (1-indexed). Returns empty string if field index exceeds available substrings.
  • chr: Convert an integer codepoint to its corresponding Unicode character. Behavior undefined for invalid Unicode scalar values.
  • translate: Character-by-character replacement similar to Unix tr command. Maps characters from 'from' string to corresponding positions in 'to' string, removing extra characters if 'to' is shorter.

Those functions are common to several databases/engines (e.g., Postgres, Trino, DataFusion), so I think it warrants being part of the default catalog instead of relying on extensions.

@bvolpato bvolpato force-pushed the str-funcs branch 2 times, most recently from 9bd7bfe to 0cb547e Compare July 15, 2025 01:21
Add three new string manipulation functions to the extensions:

- split_part: Split a string using a delimiter and return the nth substring (1-indexed). Returns empty string if field index exceeds available substrings.
- chr: Convert an integer codepoint to its corresponding Unicode character. Behavior undefined for invalid Unicode scalar values.
- translate: Character-by-character replacement similar to Unix tr command. Maps characters from 'from' string to corresponding positions in 'to' string, removing extra characters if 'to' is shorter.
name: split_part
description: >-
Split a string using a delimiter and return the `field`-th substring (starting at 1). If `field`
is larger than the number of substrings, an empty string is returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trino returns null if the field is larger than the number of substrings, but in this regard it appears to be the odd one out as a number of other systems return the empty string:

What might be good here two would be to set an option for this. Something like

        options:
          empty_result:
            values: [ EMPTY_STRING, NULL ]

I'm not super attached to the empty_result name. There's probably something better we could use.

- value: "varchar<L2>"
name: "delimiter"
- value: i32
name: "field"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: position or index might be a clearer name for this.

name: chr
description: >-
Return a single character whose codepoint is the specified integer. Behaviour is undefined if
the `codepoint` does not correspond to a valid Unicode scalar value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having undefined behaviour by default we could add an option like:

      options:
          on_invalid_codepoint:
            values: [ ERROR, NULL ]

to make it configurable. If it's not set, it's still undefined but we can at least provide some way to pin the behaviour.

A portability concern that we might need to address as well is the range of allowed codepoints.

For example SELECT chr(0); fails in Postgres with

Query Error: requested character too large for encoding: 300

but succeeds in DataFusion and Trino

Ĭ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants