-
Notifications
You must be signed in to change notification settings - Fork 187
feat: add split_part, chr, and translate string functions #826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1558,6 +1558,60 @@ scalar_functions: | |
| dotall: | ||
| values: [ DOTALL_DISABLED, DOTALL_ENABLED ] | ||
| return: "List<string>" | ||
| - | ||
| name: split_part | ||
| description: >- | ||
| Split a string using a delimiter and return the `field`-th substring (starting at 1). If `field` | ||
| is larger than the number of substrings, an empty string is returned. | ||
| impls: | ||
| - args: | ||
| - value: "varchar<L1>" | ||
| name: "input" | ||
| - value: "varchar<L2>" | ||
| name: "delimiter" | ||
| - value: i32 | ||
| name: "field" | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. minor: |
||
| return: "varchar<L1>" | ||
| - args: | ||
| - value: "string" | ||
| name: "input" | ||
| - value: "string" | ||
| name: "delimiter" | ||
| - value: i32 | ||
| name: "field" | ||
| return: "string" | ||
| - | ||
| name: chr | ||
| description: >- | ||
| Return a single character whose codepoint is the specified integer. Behaviour is undefined if | ||
| the `codepoint` does not correspond to a valid Unicode scalar value. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of having undefined behaviour by default we could add an option like: to make it configurable. If it's not set, it's still undefined but we can at least provide some way to pin the behaviour. A portability concern that we might need to address as well is the range of allowed codepoints. For example but succeeds in DataFusion and Trino |
||
| impls: | ||
| - args: | ||
| - value: i64 | ||
| name: "codepoint" | ||
| return: "string" | ||
| - | ||
| name: translate | ||
| description: >- | ||
| Replace each occurrence of characters from `from` with the corresponding character in `to`. | ||
| If `to` is shorter than `from`, extra characters are removed from the result. Similar to the Unix `tr` command. | ||
| impls: | ||
| - args: | ||
| - value: "varchar<L1>" | ||
| name: "input" | ||
| - value: "varchar<L2>" | ||
| name: "from" | ||
| - value: "varchar<L3>" | ||
| name: "to" | ||
| return: "varchar<L1>" | ||
| - args: | ||
| - value: "string" | ||
| name: "input" | ||
| - value: "string" | ||
| name: "from" | ||
| - value: "string" | ||
| name: "to" | ||
| return: "string" | ||
|
|
||
| aggregate_functions: | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,16 +1,16 @@ | ||
| { | ||
| "registry": { | ||
| "dependency_count": 13, | ||
| "extension_count": 13, | ||
| "function_count": 165, | ||
| "num_aggregate_functions": 29, | ||
| "num_scalar_functions": 158, | ||
| "num_window_functions": 11, | ||
| "num_function_overloads": 517 | ||
| }, | ||
| "coverage": { | ||
| "total_test_count": 1086, | ||
| "num_function_variants": 517, | ||
| "num_covered_function_variants": 229 | ||
| } | ||
| "registry": { | ||
| "dependency_count": 13, | ||
| "extension_count": 13, | ||
| "function_count": 165, | ||
| "num_aggregate_functions": 29, | ||
| "num_scalar_functions": 158, | ||
| "num_window_functions": 11, | ||
| "num_function_overloads": 517 | ||
| }, | ||
| "coverage": { | ||
| "total_test_count": 1164, | ||
| "num_function_variants": 532, | ||
| "num_covered_function_variants": 242 | ||
| } | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| ### SUBSTRAIT_SCALAR_TEST: v1.0 | ||
| ### SUBSTRAIT_INCLUDE: '/extensions/functions_string.yaml' | ||
|
|
||
| # basic: Basic examples without any special cases | ||
| chr(65::i64) = 'A'::str | ||
| chr(97::i64) = 'a'::str | ||
| chr(48::i64) = '0'::str | ||
| chr(8364::i64) = '€'::str | ||
| chr(128512::i64) = '😀'::str | ||
|
|
||
| # null_input: Examples with null as input | ||
| chr(null::i64) = null::str |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| ### SUBSTRAIT_SCALAR_TEST: v1.0 | ||
| ### SUBSTRAIT_INCLUDE: '/extensions/functions_string.yaml' | ||
|
|
||
| # basic: Basic examples, no special cases | ||
| split_part('abc,def,ghi'::str, ','::str, 1::i32) = 'abc'::str | ||
| split_part('abc,def,ghi'::str, ','::str, 2::i32) = 'def'::str | ||
| split_part('abc,def,ghi'::str, ','::str, 3::i32) = 'ghi'::str | ||
| split_part('abc,def,ghi'::str, ','::str, 4::i32) = ''::str | ||
| split_part('a|b|c|d'::str, '|'::str, 1::i32) = 'a'::str | ||
| split_part('a|b|c|d'::str, '|'::str, 2::i32) = 'b'::str | ||
| split_part('a|b|c|d'::str, '|'::str, 3::i32) = 'c'::str | ||
| split_part('a|b|c|d'::str, '|'::str, 4::i32) = 'd'::str | ||
| split_part('a|b|c|d'::str, '|'::str, 5::i32) = ''::str | ||
| split_part('hello world test'::str, ' '::str, 1::i32) = 'hello'::str | ||
| split_part('hello world test'::str, ' '::str, 2::i32) = 'world'::str | ||
| split_part('hello world test'::str, ' '::str, 3::i32) = 'test'::str | ||
|
|
||
| # basic_delimiters: Basic examples without any special cases, multi-delimiters | ||
| split_part('abc~@~def~@~ghi'::str, '~@~'::str, 1::i32) = 'abc'::str | ||
| split_part('abc~@~def~@~ghi'::str, '~@~'::str, 2::i32) = 'def'::str | ||
| split_part('abc~@~def~@~ghi'::str, '~@~'::str, 3::i32) = 'ghi'::str | ||
| split_part('abc~@~def~@~ghi'::str, '~@~'::str, 4::i32) = ''::str | ||
|
|
||
| # missing_delimiter: Examples where delimiter not present | ||
| split_part('abc'::str, ','::str, 1::i32) = 'abc'::str | ||
| split_part('abc'::str, ','::str, 2::i32) = ''::str | ||
|
|
||
| # null_input: Examples with null as input | ||
| split_part(null::str, ','::str, 1::i32) = null::str | ||
| split_part('abc,def'::str, null::str, 1::i32) = null::str | ||
| split_part('abc,def'::str, ','::str, null::i32) = null::str |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| ### SUBSTRAIT_SCALAR_TEST: v1.0 | ||
| ### SUBSTRAIT_INCLUDE: '/extensions/functions_string.yaml' | ||
|
|
||
| # basic: Basic examples without any special cases | ||
| translate('banana'::str, 'an'::str, 'oy'::str) = 'boyoyo'::str | ||
| translate('Hello World!'::str, ' !'::str, 'x'::str) = 'HelloxWorld'::str | ||
|
|
||
| # removal: Examples where replacement string shorter than source, resulting in removal | ||
| translate('hello'::str, 'aeiou'::str, ''::str) = 'hll'::str | ||
| translate('aabbcc'::str, 'abc'::str, 'a'::str) = 'aaaaaa'::str | ||
|
|
||
| # null_input: Examples with null as input | ||
| translate(null::str, 'a'::str, 'b'::str) = null::str | ||
| translate('hello'::str, null::str, 'b'::str) = null::str | ||
| translate('hello'::str, 'l'::str, null::str) = null::str | ||
|
|
||
| # unicode: Examples with unicode characters | ||
| translate('àéà'::str, 'à'::str, 'a'::str) = 'aéa'::str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trino returns null if the field is larger than the number of substrings, but in this regard it appears to be the odd one out as a number of other systems return the empty string:
What might be good here two would be to set an option for this. Something like
I'm not super attached to the
empty_resultname. There's probably something better we could use.