Try just `SEXP`, not `struct str_info` #2119

DavisVaughan · 2025-12-10T21:13:36Z

It has a high memory cost (24 extra bytes per string over just SEXP) and it isn't clear that it gains us that much unless there is a high number of passes, i.e. in the case of a long common prefix, which is somewhat rare

It has a high memory cost (24 extra bytes per string over just `SEXP`) and it isn't clear that it gains us that much unless there is a high number of passes, i.e. in the case of a long common prefix, which is somewhat rare

DavisVaughan · 2025-12-12T21:17:27Z

As shown by some of the "weird scenarios", not caching the pointers/sizes and calling CHAR() or Rf_length() every time was not a viable option (and it turned out that we actually don't need to cache the sizes at all)

# "weird" scenarios

# 100 unique strings each with a suffix 10-20 characters in length,
# total size of 10,000,000
# 10 long common prefixes are pasted on the front, which makes more unique
# strings (typically around 1000) and creates strings with long prefixes that
# we try and "skip" through
#
# This is where you really feel the pain of not having cached sizes and pointers
# and is probably enough to make me want them again
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order",
    "r-lib/vctrs@feature/no-truelength-order-split-info",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling-and-no-sizes",
    "r-lib/vctrs@feature/no-truelength-order-less-memory"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 100
    min_length <- 10
    max_length <- 20

    n_prefixes <- 10
    prefix_size <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    prefixes <- stringi::stri_rand_strings(
      n = n_prefixes,
      length = prefix_size
    )
    prefixes <- sample(prefixes, size = total_size, replace = TRUE)

    w <- paste0(prefixes, w)

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 7 × 7
#>   pkg                    expression     min  median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs            vctrs:::v…  61.2ms  61.6ms     16.2      124MB    64.9 
#> 2 r-lib/vctrs@feature/n… vctrs:::v… 268.2ms 276.4ms      3.55     544MB     4.08
#> 3 r-lib/vctrs@feature/n… vctrs:::v… 284.2ms 289.2ms      3.39     467MB     3.90
#> 4 r-lib/vctrs@feature/n… vctrs:::v… 266.2ms 272.7ms      3.55     334MB     4.96
#> 5 r-lib/vctrs@feature/n… vctrs:::v… 260.4ms 266.9ms      3.61     315MB     5.05
#> 6 r-lib/vctrs@feature/n… vctrs:::v… 220.6ms 227.4ms      4.37     238MB     4.59
#> 7 r-lib/vctrs@feature/n… vctrs:::v… 726.2ms 730.2ms      1.37     238MB     9.13

# - 100 unique strings each with a suffix 10-100 characters in length,
# total size of 10,000,000
# 10 long common prefixes are pasted on the front, which makes more unique
# strings (typically around 1000) and creates strings with long prefixes that
# we try and "skip" through
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order",
    "r-lib/vctrs@feature/no-truelength-order-split-info",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling-and-no-sizes",
    "r-lib/vctrs@feature/no-truelength-order-less-memory"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 100
    min_length <- 10
    max_length <- 100

    n_prefixes <- 10
    prefix_size <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    prefixes <- stringi::stri_rand_strings(
      n = n_prefixes,
      length = prefix_size
    )
    prefixes <- sample(prefixes, size = total_size, replace = TRUE)

    w <- paste0(prefixes, w)

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 7 × 7
#>   pkg                    expression     min  median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs            vctrs:::v…  61.5ms  62.3ms     16.1      124MB    64.2 
#> 2 r-lib/vctrs@feature/n… vctrs:::v… 255.8ms 259.6ms      3.79     544MB     4.35
#> 3 r-lib/vctrs@feature/n… vctrs:::v… 266.7ms 269.8ms      3.62     467MB     4.16
#> 4 r-lib/vctrs@feature/n… vctrs:::v… 246.8ms   252ms      3.82     334MB     5.34
#> 5 r-lib/vctrs@feature/n… vctrs:::v… 244.3ms 253.6ms      3.85     315MB     5.38
#> 6 r-lib/vctrs@feature/n… vctrs:::v… 203.5ms 208.3ms      4.74     238MB     4.98
#> 7 r-lib/vctrs@feature/n… vctrs:::v… 714.1ms 718.3ms      1.39     238MB     9.28

# 1,000,000 unique strings each with a suffix 10-20 characters in length,
# total size of 10,000,000
# single common prefix
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order",
    "r-lib/vctrs@feature/no-truelength-order-split-info",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling-and-no-sizes",
    "r-lib/vctrs@feature/no-truelength-order-less-memory"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 1000000
    min_length <- 10
    max_length <- 20

    n_prefixes <- 1
    prefix_size <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    prefixes <- stringi::stri_rand_strings(
      n = n_prefixes,
      length = prefix_size
    )
    prefixes <- sample(prefixes, size = total_size, replace = TRUE)

    w <- paste0(prefixes, w)

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 7 × 7
#>   pkg                  expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs          vctrs:::v… 481.69ms 484.85ms     1.97      277MB    2.17 
#> 2 r-lib/vctrs@feature… vctrs:::v…    1.67s    1.68s     0.592     544MB    0.621
#> 3 r-lib/vctrs@feature… vctrs:::v…    1.72s    1.75s     0.558     467MB    0.754
#> 4 r-lib/vctrs@feature… vctrs:::v…    1.68s    1.69s     0.592     334MB    6.80 
#> 5 r-lib/vctrs@feature… vctrs:::v…    1.74s    1.75s     0.572     315MB    6.58 
#> 6 r-lib/vctrs@feature… vctrs:::v…    1.26s    1.28s     0.777     238MB    0.816
#> 7 r-lib/vctrs@feature… vctrs:::v…    3.66s    4.33s     0.245     238MB    0.257

Remove str_info

d2f511f

It has a high memory cost (24 extra bytes per string over just `SEXP`) and it isn't clear that it gains us that much unless there is a high number of passes, i.e. in the case of a long common prefix, which is somewhat rare

DavisVaughan mentioned this pull request Dec 12, 2025

Remove TRUELENGTH() backed character radix sort #2118

Merged

DavisVaughan closed this Dec 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Try just `SEXP`, not `struct str_info` #2119

Try just `SEXP`, not `struct str_info` #2119

Uh oh!

DavisVaughan commented Dec 10, 2025

Uh oh!

DavisVaughan commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Try just SEXP, not struct str_info #2119

Try just SEXP, not struct str_info #2119

Uh oh!

Conversation

DavisVaughan commented Dec 10, 2025

Uh oh!

DavisVaughan commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Try just `SEXP`, not `struct str_info` #2119

Try just `SEXP`, not `struct str_info` #2119