Skip to content

[FEA] Accelerate conversion from arrow::StringViewType to arrow::StringType in libcudf interop #15298

Open
@GregoryKimball

Description

@GregoryKimball

Is your feature request related to a problem? Please describe.
The Arrow 15 specification includes a definition of "arrow::StringViewType" - an alternate representation of the "arrow::StringType". You may find "String view" also referred to as Umbra string or prefix string.

A string view consists of two columns:

  1. A column of 16 byte fixed-width elements. First 4 bytes contain the string size
  • If size < 12, then the string is stored inline in the remaining 12 bytes (short string optimization)
  • If size > 12, then the string is stored separately in the second column. Remaining 12 bytes are 8 bytes for pointer to the string + 4 bytes for the first 4 chars of the string
  1. A column of characters storing the suffix strings

String view type enables some performance optimizations:

  • ability to slice strings (e.g. left(10)) in place without a copy
  • ability to replace with smaller strings (e.g. replace("aa", "a")) in place without a copy
  • inlined strings can be written in any order and without knowing the column size
  • better memory access patterns for the first 4 bytes (e.g. startswith("a"))

Describe the solution you'd like
Let's add interop support for string view in from_arrow with CUDA C++ code to accept string views and convert them to libcudf strings columns. We may also want to add string view compatibility to to_arrow, so we can hand off libcudf strings columns to host libraries that expect string views. We should be able to write CUDA C++ code to efficiently transform arrow::StringViewType buffers in to arrow::StringType buffers.

Describe alternatives you've considered
Force libcudf users to convert their string views into strings on the host before passing the data to the device.

Additional context
Velox supports a string view type (ref1, ref2), Polars has switched to a string view representation, and DuckDB supports string view.

We may choose to investigate using string views in libcudf at some point, but for the foreseeable future string view refactoring will be lower priority than supporting large strings and improving performance with long strings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    SparkFunctionality that helps Spark RAPIDSfeature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    Status

    No status

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions