Skip to content

Conversation

@DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Dec 10, 2025

This PR removes vec_order_radix()'s reliance on TRUELENGTH() and SET_TRUELENGTH() on CHARSXPs, which is non-API and has always been a bit hacky.

A culmination of this PR, along with iterative improvements in #2120, #2121, #2122, and #2123.

PR #2119 was an additional failed experiment.

Reverse dependency checks (including dplyr revdeps) look good "e77dc1e0-48c1-40bb-b416-6b54f080af8f"


The way this previously worked was:

  • Get x, a character vector
  • Use TRUELENGTH() tricks to isolate x_unique, the unique elements of x
  • Radix order x_unique, typically, but not always, a smaller set of strings (consider data frame columns of strings)
  • Use TRUELENGTH() tricks to create an x_integer_proxy of x with values from the order of x_unique
  • Radix order x_integer_proxy, which generally uses the very fast counting sort method for integers

There were really two possible paths forward:

  • Try and replace the "free" hash map we got from base R's global string pool with our own. This would roughly keep the original algorithm, but we'd replace the TRUELENGTH() steps with our own hash map.
  • Totally give up on the hash map based approach, since it is no longer "free". Instead, just do a character radix order of x directly.

I have opted for approach 2, so the algorithm is now:

  • Get x, a character vector
  • Radix order x

It was reasonably straightforward to switch to "just" radix ordering x since we already had a string radix ordering algorithm for x_unique, but it did require adding NA handling to chr_order_radix(), because previously that was handled by the integer ordering algorithm.

The hard part was then optimizing the hell out of the character radix ordering algorithm on two fronts:

  • Memory
  • Performance

Without a whole bunch of optimization work, radix ordering all of x was going to be a massive performance and memory penalty vs the old TRUELENGTH() approach.

Over the course of several iterations, I believe I have optimized it well enough so that the performance hit is only ~10-30% in most common cases (large vector, few groups, i.e. a typical character column in data science work). In the worst of the worst cases, which is typically when strings have some very long common prefix that we have to work through (URLs are an example of this sometimes), we are up to 3x slower. But this is still orders of magnitude faster than the next fastest sorting method on strings (shell sort in base R's order()), so I am fine with this.

Memory wise, we do use more memory than we used to because we are sorting the whole vector, not just the uniques. But we actually now use exactly the same amount of memory as radix ordering a vector of doubles (both double and const char* are 8 bytes), so I think this memory increase is both unavoidable and totally reasonable.


I spent a lot of time doing some fancy individual optimizations, and I'd like to document them for future me, so I'm going to use the rest of the space below to do that.

Additionally, in the Details here are various iterative benchmarks, run across:

  • Main vctrs with the TRUELENGTH() approach
  • The 5 iterative improvement branches
  • The 1 failed experiment idea

Rows 1 and 6 in each of these are the most important, as that shows where we started and where we ended up.

I've also rerun the benchmarks with just those rows and included them in full in a comment below, but the iterative comparison was also nice for me to track.

Details
# "typical" scenarios

# - old version does better when there are few uniques. with many uniques,
#   new version does better because it doesn't sort twice
# - old version does better with few uniques that have very long common prefixes

# 5 unique strings each 10-20 characters in length,
# total size of 10,000,000
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order",
    "r-lib/vctrs@feature/no-truelength-order-split-info",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling-and-no-sizes",
    "r-lib/vctrs@feature/no-truelength-order-less-memory"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 5
    min_length <- 10
    max_length <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 7 × 7
#>   pkg                     expression    min  median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                   <bch:expr> <bch:> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs             vctrs:::v… 63.9ms  65.3ms     15.2      124MB     15.2
#> 2 r-lib/vctrs@feature/no… vctrs:::v… 90.8ms  97.9ms      9.83     544MB     11.3
#> 3 r-lib/vctrs@feature/no… vctrs:::v… 94.3ms 118.6ms      8.75     467MB     14.4
#> 4 r-lib/vctrs@feature/no… vctrs:::v… 80.8ms 106.1ms     10.2      334MB     17.3
#> 5 r-lib/vctrs@feature/no… vctrs:::v… 81.3ms    84ms     11.6      315MB     12.8
#> 6 r-lib/vctrs@feature/no… vctrs:::v… 72.3ms  74.4ms     13.0      238MB     14.3
#> 7 r-lib/vctrs@feature/no… vctrs:::v… 81.3ms    84ms     11.3      238MB     13.0

# 5 unique strings each 10-20 characters in length,
# mix in some `NA`s, ~20%
# total size of 10,000,000
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order",
    "r-lib/vctrs@feature/no-truelength-order-split-info",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling-and-no-sizes",
    "r-lib/vctrs@feature/no-truelength-order-less-memory"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 5
    min_length <- 10
    max_length <- 20

    w <- sample(
      c(
        stringi::stri_rand_strings(
          n = n_unique,
          length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
        ),
        NA
      ),
      size = total_size,
      replace = TRUE
    )

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 7 × 7
#>   pkg                    expression     min  median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs            vctrs:::v…  96.8ms  97.7ms     10.2      124MB    10.2 
#> 2 r-lib/vctrs@feature/n… vctrs:::v… 114.6ms 121.3ms      7.98     544MB     9.17
#> 3 r-lib/vctrs@feature/n… vctrs:::v… 115.9ms 139.6ms      7.36     467MB    12.1 
#> 4 r-lib/vctrs@feature/n… vctrs:::v… 105.1ms 126.8ms      8.29     334MB    14.1 
#> 5 r-lib/vctrs@feature/n… vctrs:::v… 107.4ms 109.7ms      8.88     315MB     9.77
#> 6 r-lib/vctrs@feature/n… vctrs:::v…  98.5ms 102.6ms      9.36     238MB    10.8 
#> 7 r-lib/vctrs@feature/n… vctrs:::v… 104.8ms 107.6ms      8.92     238MB    10.3

# 100 unique strings each 10-20 characters in length,
# total size of 10,000,000
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order",
    "r-lib/vctrs@feature/no-truelength-order-split-info",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling-and-no-sizes",
    "r-lib/vctrs@feature/no-truelength-order-less-memory"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 100
    min_length <- 10
    max_length <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 7 × 7
#>   pkg                    expression     min  median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs            vctrs:::v…  60.4ms  61.8ms     16.1      124MB    16.1 
#> 2 r-lib/vctrs@feature/n… vctrs:::v… 117.4ms 125.7ms      7.69     544MB     8.85
#> 3 r-lib/vctrs@feature/n… vctrs:::v… 131.4ms 154.1ms      6.69     467MB    11.0 
#> 4 r-lib/vctrs@feature/n… vctrs:::v… 116.6ms 138.3ms      7.61     334MB    12.9 
#> 5 r-lib/vctrs@feature/n… vctrs:::v… 110.7ms   115ms      8.45     315MB     9.30
#> 6 r-lib/vctrs@feature/n… vctrs:::v…  95.8ms  98.3ms      9.96     238MB    11.0 
#> 7 r-lib/vctrs@feature/n… vctrs:::v…   119ms 121.4ms      7.94     238MB     9.13

# 100,000 unique strings each 10-20 characters in length,
# total size of 10,000,000
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order",
    "r-lib/vctrs@feature/no-truelength-order-split-info",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling-and-no-sizes",
    "r-lib/vctrs@feature/no-truelength-order-less-memory"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 100000
    min_length <- 10
    max_length <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 7 × 7
#>   pkg                       expression   min median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                     <bch:expr> <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs               vctrs:::v… 151ms  154ms      6.25     243MB     7.18
#> 2 r-lib/vctrs@feature/no-t… vctrs:::v… 160ms  171ms      5.68     544MB     6.53
#> 3 r-lib/vctrs@feature/no-t… vctrs:::v… 180ms  210ms      4.89     467MB     8.81
#> 4 r-lib/vctrs@feature/no-t… vctrs:::v… 158ms  165ms      5.74     334MB     7.75
#> 5 r-lib/vctrs@feature/no-t… vctrs:::v… 153ms  156ms      6.26     315MB     6.89
#> 6 r-lib/vctrs@feature/no-t… vctrs:::v… 132ms  136ms      7.09     238MB     7.80
#> 7 r-lib/vctrs@feature/no-t… vctrs:::v… 186ms  191ms      5.10     238MB     5.86

# 10,000 unique strings each 10-20 characters in length,
# total size of 100,000,000
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order",
    "r-lib/vctrs@feature/no-truelength-order-split-info",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling-and-no-sizes",
    "r-lib/vctrs@feature/no-truelength-order-less-memory"
  ),
  {
    set.seed(123)

    total_size <- 100000000
    n_unique <- 10000
    min_length <- 10
    max_length <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 7 × 7
#>   pkg                  expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs          vctrs:::v… 713.15ms 722.67ms     1.34     1.21GB    1.41 
#> 2 r-lib/vctrs@feature… vctrs:::v…    1.67s    1.84s     0.535    5.31GB    0.615
#> 3 r-lib/vctrs@feature… vctrs:::v…    1.75s    1.83s     0.552    4.56GB    0.829
#> 4 r-lib/vctrs@feature… vctrs:::v…    1.52s    1.62s     0.624    3.26GB    0.999
#> 5 r-lib/vctrs@feature… vctrs:::v…    1.47s    1.57s     0.648    3.07GB    1.04 
#> 6 r-lib/vctrs@feature… vctrs:::v…    1.25s    1.28s     0.769    2.33GB    0.884
#> 7 r-lib/vctrs@feature… vctrs:::v…    1.74s    1.76s     0.565    2.33GB    0.621

# 10,000 unique strings each 1-100 characters in length,
# total size of 10,000,000
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order",
    "r-lib/vctrs@feature/no-truelength-order-split-info",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling-and-no-sizes",
    "r-lib/vctrs@feature/no-truelength-order-less-memory"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 10000
    min_length <- 1
    max_length <- 100

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 7 × 7
#>   pkg                    expression     min  median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs            vctrs:::v…  68.7ms  70.1ms     14.2      124MB    14.2 
#> 2 r-lib/vctrs@feature/n… vctrs:::v… 145.2ms 150.8ms      6.42     544MB     7.39
#> 3 r-lib/vctrs@feature/n… vctrs:::v… 160.3ms 187.6ms      5.48     467MB     9.04
#> 4 r-lib/vctrs@feature/n… vctrs:::v… 141.1ms 166.2ms      6.23     334MB    10.6 
#> 5 r-lib/vctrs@feature/n… vctrs:::v… 139.1ms 142.7ms      6.91     315MB     7.60
#> 6 r-lib/vctrs@feature/n… vctrs:::v… 121.1ms 122.6ms      7.98     238MB     8.78
#> 7 r-lib/vctrs@feature/n… vctrs:::v… 167.3ms 169.5ms      5.74     238MB     6.60

# 10,000 unique strings each 100 characters in length,
# total size of 10,000,000
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order",
    "r-lib/vctrs@feature/no-truelength-order-split-info",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling-and-no-sizes",
    "r-lib/vctrs@feature/no-truelength-order-less-memory"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 10000
    min_length <- 100
    max_length <- 100

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 7 × 7
#>   pkg                    expression     min  median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs            vctrs:::v…  68.2ms  69.4ms     14.1      124MB    14.1 
#> 2 r-lib/vctrs@feature/n… vctrs:::v… 145.8ms 152.7ms      6.36     544MB     7.32
#> 3 r-lib/vctrs@feature/n… vctrs:::v… 159.9ms 187.4ms      5.50     467MB     9.08
#> 4 r-lib/vctrs@feature/n… vctrs:::v…   141ms   168ms      6.24     334MB    10.6 
#> 5 r-lib/vctrs@feature/n… vctrs:::v… 139.5ms 142.9ms      6.89     315MB     7.58
#> 6 r-lib/vctrs@feature/n… vctrs:::v… 119.9ms 123.5ms      7.96     238MB     8.76
#> 7 r-lib/vctrs@feature/n… vctrs:::v… 166.2ms 169.2ms      5.74     238MB     6.60
# "weird" scenarios

# 100 unique strings each with a suffix 10-20 characters in length,
# total size of 10,000,000
# 10 long common prefixes are pasted on the front, which makes more unique
# strings (typically around 1000) and creates strings with long prefixes that
# we try and "skip" through
#
# This is where you really feel the pain of not having cached sizes and pointers
# and is probably enough to make me want them again
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order",
    "r-lib/vctrs@feature/no-truelength-order-split-info",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling-and-no-sizes",
    "r-lib/vctrs@feature/no-truelength-order-less-memory"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 100
    min_length <- 10
    max_length <- 20

    n_prefixes <- 10
    prefix_size <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    prefixes <- stringi::stri_rand_strings(
      n = n_prefixes,
      length = prefix_size
    )
    prefixes <- sample(prefixes, size = total_size, replace = TRUE)

    w <- paste0(prefixes, w)

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 7 × 7
#>   pkg                    expression     min  median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs            vctrs:::v…  61.2ms  61.6ms     16.2      124MB    64.9 
#> 2 r-lib/vctrs@feature/n… vctrs:::v… 268.2ms 276.4ms      3.55     544MB     4.08
#> 3 r-lib/vctrs@feature/n… vctrs:::v… 284.2ms 289.2ms      3.39     467MB     3.90
#> 4 r-lib/vctrs@feature/n… vctrs:::v… 266.2ms 272.7ms      3.55     334MB     4.96
#> 5 r-lib/vctrs@feature/n… vctrs:::v… 260.4ms 266.9ms      3.61     315MB     5.05
#> 6 r-lib/vctrs@feature/n… vctrs:::v… 220.6ms 227.4ms      4.37     238MB     4.59
#> 7 r-lib/vctrs@feature/n… vctrs:::v… 726.2ms 730.2ms      1.37     238MB     9.13

# - 100 unique strings each with a suffix 10-100 characters in length,
# total size of 10,000,000
# 10 long common prefixes are pasted on the front, which makes more unique
# strings (typically around 1000) and creates strings with long prefixes that
# we try and "skip" through
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order",
    "r-lib/vctrs@feature/no-truelength-order-split-info",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling-and-no-sizes",
    "r-lib/vctrs@feature/no-truelength-order-less-memory"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 100
    min_length <- 10
    max_length <- 100

    n_prefixes <- 10
    prefix_size <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    prefixes <- stringi::stri_rand_strings(
      n = n_prefixes,
      length = prefix_size
    )
    prefixes <- sample(prefixes, size = total_size, replace = TRUE)

    w <- paste0(prefixes, w)

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 7 × 7
#>   pkg                    expression     min  median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs            vctrs:::v…  61.5ms  62.3ms     16.1      124MB    64.2 
#> 2 r-lib/vctrs@feature/n… vctrs:::v… 255.8ms 259.6ms      3.79     544MB     4.35
#> 3 r-lib/vctrs@feature/n… vctrs:::v… 266.7ms 269.8ms      3.62     467MB     4.16
#> 4 r-lib/vctrs@feature/n… vctrs:::v… 246.8ms   252ms      3.82     334MB     5.34
#> 5 r-lib/vctrs@feature/n… vctrs:::v… 244.3ms 253.6ms      3.85     315MB     5.38
#> 6 r-lib/vctrs@feature/n… vctrs:::v… 203.5ms 208.3ms      4.74     238MB     4.98
#> 7 r-lib/vctrs@feature/n… vctrs:::v… 714.1ms 718.3ms      1.39     238MB     9.28

# 1,000,000 unique strings each with a suffix 10-20 characters in length,
# total size of 10,000,000
# single common prefix
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order",
    "r-lib/vctrs@feature/no-truelength-order-split-info",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling-and-no-sizes",
    "r-lib/vctrs@feature/no-truelength-order-less-memory"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 1000000
    min_length <- 10
    max_length <- 20

    n_prefixes <- 1
    prefix_size <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    prefixes <- stringi::stri_rand_strings(
      n = n_prefixes,
      length = prefix_size
    )
    prefixes <- sample(prefixes, size = total_size, replace = TRUE)

    w <- paste0(prefixes, w)

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 7 × 7
#>   pkg                  expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs          vctrs:::v… 481.69ms 484.85ms     1.97      277MB    2.17 
#> 2 r-lib/vctrs@feature… vctrs:::v…    1.67s    1.68s     0.592     544MB    0.621
#> 3 r-lib/vctrs@feature… vctrs:::v…    1.72s    1.75s     0.558     467MB    0.754
#> 4 r-lib/vctrs@feature… vctrs:::v…    1.68s    1.69s     0.592     334MB    6.80 
#> 5 r-lib/vctrs@feature… vctrs:::v…    1.74s    1.75s     0.572     315MB    6.58 
#> 6 r-lib/vctrs@feature… vctrs:::v…    1.26s    1.28s     0.777     238MB    0.816
#> 7 r-lib/vctrs@feature… vctrs:::v…    3.66s    4.33s     0.245     238MB    0.257
# Fuzzing
fuzzing <- FALSE

if (fuzzing) {
  for (i in 1:10000) {
    print(paste("iteration", i))

    n_rows <- sample(0:1000000, size = 1)
    n_cols <- sample(1:10, size = 1)

    cols <- replicate(n = n_cols, simplify = FALSE, {
      n_strings <- sample(0:10000, size = 1)
      string_size <- sample(0:100, n_strings, TRUE)
      optional_na <- if (runif(1) > .5) NA else NULL
      strings <- c(
        if (n_strings == 0L) {
          # stringi bug with `n_strings = 0, string_size = integer()`
          character()
        } else {
          stringi::stri_rand_strings(n_strings, string_size)
        },
        optional_na
      )
      sample(strings, n_rows, TRUE)
    })

    names(cols) <- as.character(seq_along(cols))
    df <- new_data_frame(cols, n = n_rows)

    direction <- sample(c("asc", "desc"), 1)
    na_value <- sample(c("largest", "smallest"), 1)

    if (direction == "asc") {
      if (na_value == "largest") {
        na.last <- TRUE
        decreasing <- FALSE
      } else {
        na.last <- FALSE
        decreasing <- FALSE
      }
    } else {
      if (na_value == "largest") {
        na.last <- FALSE
        decreasing <- TRUE
      } else {
        na.last <- TRUE
        decreasing <- TRUE
      }
    }

    r_order <- base_order(df, na.last = na.last, decreasing = decreasing)
    vctrs_order <- vctrs:::vec_order_radix(
      df,
      direction = direction,
      na_value = na_value
    )

    is_identical <- identical(
      r_order,
      vctrs_order
    )

    if (!is_identical) {
      options <- list(direction = direction, na_value = na_value)
      out <- list(r = r_order, vctrs = vctrs_order, options = options)
      saveRDS(out, file = "failure.rds")
      stop("not identical, check saved rds files")
    }
  }
}

DavisVaughan and others added 9 commits December 9, 2025 15:10
Still need to properly support `NA`, `decreasing`, and `na_last`, but this is a major part of it
* Split `struct str_info` into individual components

The `struct` forces an 8 byte alignment, so it costs us 24 bytes for this struct, when it should really only cost 20 bytes. Since we need 2 vectors of these, thats 8 wasted bytes per string that we can save by splitting them back up, at the cost of more bookkeeping.

* Track `NA`ness rather than whole `SEXP` (#2121)

* Track `NA`ness rather than whole `SEXP`

For 14 bytes less memory per string

* Go back to post check loading

* Handle `NA`s up front (#2122)

* Handle `NA`s up front

Avoiding the need for `p_x_string_nas_aux` altogether, and nicely simplifying and speeding up the radix and insertion paths

* Remove `p_lazy_x_string_nas` entirely

By having the `SEXP` be available in both `chr_order()` and `chr_order_chunk()`

* Split extraction and order rearrangement

* Simplify aux placement code

* Remove `x_string_sizes` as well (#2123)

* Remove size related vectors

* Advance `p_x[i]` in place, rather than accessing with `[pass]`

This keeps `p_x[i]` up to date after each pass, always pointing to the next byte.

This didn't seem to have much of a performance impact, but it nicely cleans up `chr_insertion_order()` and `chr_all_same_byte()` because they no longer need to worry about the `pass`, the pointers are always advanced to `pass` already
@DavisVaughan
Copy link
Member Author

I used a rather nice fuzzing script that created data frames of character vectors with various random structures, varing:

  • Number of rows (0 -> 1 million)
  • Number of columns (1 -> 10)
  • Whether NA are included or not (yes/no)
  • Length of individual strings (0 -> 100)
  • Number of unique strings to pull from (0 -> 10000)

I then ran this for lots of iterations that compared against base R to ensure we have the right result

This helped me feel quite confident about the results.

# Fuzzing
fuzzing <- FALSE

if (fuzzing) {
  # Force radix method for character comparisons
  base_order <- function(x, na.last = TRUE, decreasing = FALSE) {
    if (is.data.frame(x)) {
      x <- unname(x)
    } else {
      x <- list(x)
    }

    args <- list(
      na.last = na.last,
      decreasing = decreasing,
      method = "radix"
    )

    args <- c(x, args)

    rlang::exec("order", !!!args)
  }

  for (i in 1:10000) {
    print(paste("iteration", i))

    n_rows <- sample(0:1000000, size = 1)
    n_cols <- sample(1:10, size = 1)

    cols <- replicate(n = n_cols, simplify = FALSE, {
      n_strings <- sample(0:10000, size = 1)
      string_size <- sample(0:100, n_strings, TRUE)
      optional_na <- if (runif(1) > .5) NA else NULL
      strings <- c(
        if (n_strings == 0L) {
          # stringi bug with `n_strings = 0, string_size = integer()`
          character()
        } else {
          stringi::stri_rand_strings(n_strings, string_size)
        },
        optional_na
      )
      sample(strings, n_rows, TRUE)
    })

    names(cols) <- as.character(seq_along(cols))
    df <- new_data_frame(cols, n = n_rows)

    direction <- sample(c("asc", "desc"), 1)
    na_value <- sample(c("largest", "smallest"), 1)

    if (direction == "asc") {
      if (na_value == "largest") {
        na.last <- TRUE
        decreasing <- FALSE
      } else {
        na.last <- FALSE
        decreasing <- FALSE
      }
    } else {
      if (na_value == "largest") {
        na.last <- FALSE
        decreasing <- TRUE
      } else {
        na.last <- TRUE
        decreasing <- TRUE
      }
    }

    r_order <- base_order(df, na.last = na.last, decreasing = decreasing)
    vctrs_order <- vctrs:::vec_order_radix(
      df,
      direction = direction,
      na_value = na_value
    )

    is_identical <- identical(
      r_order,
      vctrs_order
    )

    if (!is_identical) {
      options <- list(direction = direction, na_value = na_value)
      out <- list(r = r_order, vctrs = vctrs_order, options = options)
      saveRDS(out, file = "failure.rds")
      stop("not identical, check saved rds files")
    }
  }
}

@DavisVaughan
Copy link
Member Author

Bench mark of various common and weird scenarios comparing r-lib/vctrs@main against this PR

# "typical" scenarios

# 5 unique strings each 10-20 characters in length,
# total size of 10,000,000
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 5
    min_length <- 10
    max_length <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 2 × 7
#>   pkg                      expression    min median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                    <bch:expr> <bch:> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs              vctrs:::v… 64.8ms 65.9ms      15.1     124MB     15.1
#> 2 r-lib/vctrs@feature/no-… vctrs:::v… 73.4ms 75.2ms      12.8     238MB     14.1

# 5 unique strings each 10-20 characters in length,
# mix in some `NA`s, ~20%
# total size of 10,000,000
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 5
    min_length <- 10
    max_length <- 20

    w <- sample(
      c(
        stringi::stri_rand_strings(
          n = n_unique,
          length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
        ),
        NA
      ),
      size = total_size,
      replace = TRUE
    )

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 2 × 7
#>   pkg                     expression    min  median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                   <bch:expr> <bch:> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs             vctrs:::v… 97.4ms  98.3ms     10.2      124MB     10.2
#> 2 r-lib/vctrs@feature/no… vctrs:::v… 98.9ms 102.4ms      9.37     238MB     10.8

# 100 unique strings each 10-20 characters in length,
# total size of 10,000,000
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 100
    min_length <- 10
    max_length <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 2 × 7
#>   pkg                      expression    min median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                    <bch:expr> <bch:> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs              vctrs:::v… 61.1ms 61.9ms     16.1      124MB     16.1
#> 2 r-lib/vctrs@feature/no-… vctrs:::v… 96.5ms 99.1ms      9.89     238MB     10.9

# 100,000 unique strings each 10-20 characters in length,
# total size of 10,000,000
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 100000
    min_length <- 10
    max_length <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 2 × 7
#>   pkg                       expression   min median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                     <bch:expr> <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs               vctrs:::v… 151ms  153ms      6.27     243MB     7.21
#> 2 r-lib/vctrs@feature/no-t… vctrs:::v… 133ms  137ms      7.03     238MB     7.74

# 10,000 unique strings each 10-20 characters in length,
# total size of 100,000,000
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order"
  ),
  {
    set.seed(123)

    total_size <- 100000000
    n_unique <- 10000
    min_length <- 10
    max_length <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 2 × 7
#>   pkg                  expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs          vctrs:::v… 702.22ms 716.12ms     1.36     1.21GB    1.43 
#> 2 r-lib/vctrs@feature… vctrs:::v…    1.25s    1.28s     0.771    2.33GB    0.887

# 10,000 unique strings each 1-100 characters in length,
# total size of 10,000,000
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 10000
    min_length <- 1
    max_length <- 100

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 2 × 7
#>   pkg                    expression     min  median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs            vctrs:::v…  68.6ms  70.3ms     14.2      124MB    14.2 
#> 2 r-lib/vctrs@feature/n… vctrs:::v…   119ms 121.6ms      8.07     238MB     8.87

# 10,000 unique strings each 100 characters in length,
# total size of 10,000,000
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 10000
    min_length <- 100
    max_length <- 100

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 2 × 7
#>   pkg                    expression     min  median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs            vctrs:::v…  68.1ms  68.7ms     14.5      124MB    14.5 
#> 2 r-lib/vctrs@feature/n… vctrs:::v… 119.8ms   122ms      7.95     238MB     8.75
# "weird" scenarios

# 100 unique strings each with a suffix 10-20 characters in length,
# total size of 10,000,000
# 10 long common prefixes are pasted on the front, which makes more unique
# strings (typically around 1000) and creates strings with long prefixes that
# we try and "skip" through
#
# (This is where you really feel the pain of not having cached sizes and pointers,
#  which was another option we considered)
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 100
    min_length <- 10
    max_length <- 20

    n_prefixes <- 10
    prefix_size <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    prefixes <- stringi::stri_rand_strings(
      n = n_prefixes,
      length = prefix_size
    )
    prefixes <- sample(prefixes, size = total_size, replace = TRUE)

    w <- paste0(prefixes, w)

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 2 × 7
#>   pkg                    expression     min  median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs            vctrs:::v…  60.6ms  61.3ms     16.3      124MB    65.3 
#> 2 r-lib/vctrs@feature/n… vctrs:::v… 219.1ms 225.2ms      4.41     238MB     4.63

# - 100 unique strings each with a suffix 10-100 characters in length,
# total size of 10,000,000
# 10 long common prefixes are pasted on the front, which makes more unique
# strings (typically around 1000) and creates strings with long prefixes that
# we try and "skip" through
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 100
    min_length <- 10
    max_length <- 100

    n_prefixes <- 10
    prefix_size <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    prefixes <- stringi::stri_rand_strings(
      n = n_prefixes,
      length = prefix_size
    )
    prefixes <- sample(prefixes, size = total_size, replace = TRUE)

    w <- paste0(prefixes, w)

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 2 × 7
#>   pkg                      expression   min  median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                    <bch:expr> <bch> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs              vctrs:::v…  60ms  60.7ms     16.5      124MB    65.9 
#> 2 r-lib/vctrs@feature/no-… vctrs:::v… 202ms 207.5ms      4.72     238MB     4.95

# 1,000,000 unique strings each with a suffix 10-20 characters in length,
# total size of 10,000,000
# single common prefix
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 1000000
    min_length <- 10
    max_length <- 20

    n_prefixes <- 1
    prefix_size <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    prefixes <- stringi::stri_rand_strings(
      n = n_prefixes,
      length = prefix_size
    )
    prefixes <- sample(prefixes, size = total_size, replace = TRUE)

    w <- paste0(prefixes, w)

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 2 × 7
#>   pkg                   expression      min  median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                 <bch:expr> <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs           vctrs:::v… 476.44ms 489.5ms     1.94      277MB    2.14 
#> 2 r-lib/vctrs@feature/… vctrs:::v…    1.25s   1.26s     0.785     238MB    0.825

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Dec 12, 2025

Now to talk about some optimizations I made along the way

Storing const char*, not SEXP

In the radix code we need to repeatedly get access to the const char* underlying each SEXP with CHAR(). This is very expensive because hit each element multiple times.

It becomes critical that we cache the const char* somewhere (as seen by the failed #2119 experiment), but it has high memory requirements, 8 bytes per element.

The key here is to only store the const char* pointers in our working memory, where previously we'd store the CHARSXP SEXP values in our working memory. This would historically be hard, because as soon as you swap the CHARSXP in favor of its underlying const char*, then you lose access to its length and its NA-ness, which are properties of the SEXP. But with the 'NA "up front" handling' trick and the 'Don't store the sizes' trick, we no longer need either of these!

So we can store the faster-to-work-with const char* for the same memory cost as what we were doing before, storing the SEXP string elements.

NA "up front" handling

Done in #2122

An incredibly useful optimization I worked myself to was to handle NA "up front" rather than inside the radix ordering code. This keeps the radix ordering code much simpler, and it's also a very hot recursive loop, so it runs faster when there are less things to check.

It also means we don't need to "track" NAs in any way, which is great because in the radix code we don't have access to that info, since we hold an array of const char* instead of a vector of CHARSXP SEXPs and can't determine if an element is missing or not from inside the radix code.

The key is that if you have a string vector like this

# Vector size 4
c("x", "y", NA, "a")

# Initial ordering vector before any sorting
c(1, 2, 3, 4)

If NA go at the start of the vector then you drop the NAs from the vector and push their ordering value to the start of the ordering vector

# Vector size 3
c("x", "y", "a")

c(3, 1, 2, 4)
#    ^ treat this as `p_o[0]` to align with the new shorter `x` vector 

If NA go at the end of the vector then you drop the NAs from the vector and push their ordering value to the end of the ordering vector

# Vector size 3
c("x", "y", "a")

c(1, 2, 4, 3)
# ^ still treat this as `p_o[0]`, but note the new size is 3 so we never touch the
# last element where we put the `NA` ordering

Don't store the sizes

Done in #2123

Another black magic trick is that we can also get away without storing the string sizes, i.e. the result of calling Rf_length() on a CHARSXP.

Previously, we'd need to know the string_size to know when to stop indexing into it. For example, "ab" has a string_size = 2, so if we also had "abcd" with a max_string_size = 4, then on pass = 2 (C indexed) we had a check that would ensure that we didn't index OOB on "ab" by checking that pass < string_size before indexing into the string.

HOWEVER! Note that all R strings are really just C strings with a nul terminator. i.e. R always allocates an extra 1 byte for the nul terminator, but it never reports that to you in its Rf_length() size:
https://github.com/wch/r-source/blob/17e2bae0a990586ca01fd62664522f2e252886ca/src/main/memory.c#L2759
https://github.com/wch/r-source/blob/17e2bae0a990586ca01fd62664522f2e252886ca/src/main/memory.c#L2974

But it is always there! And it is not undefined behavior to index into the nul terminator of a const char*.

Also note that an R string can never have a nul terminator in it. i.e. of the 256 possible bytes in a char array, 0, the nul terminator, can never show up in an R string, because it is reserved for the end of the string. It is simply impossible to make an R string with one of these anywhere else in it.

We can use this fact! As we radix sort "ab" alongside "abcd", note we are really radix sorting "ab\0" and "abcd\0", with max_string_size=4. So on pass = 2, for "ab\0" we extract the \0 as a 0 byte and it gets placed into the 0 bucket during the histogram pass!

Then as we create subgroups based on these byte buckets, chr_all_same() notices that everything in bucket 0 is the exact same string. They have to be, because the only way to get into bucket 0 is to be on the nul terminator, and we know all elements up to this pass are also the same. So then we early exit rather than recursing to the next pass and indexing past the nul terminator, which would be undefined behavior.

chr_all_same()

In the subgroup loop, we have a new chr_all_same() check that loops through the elements of the upcoming subgroup and checks if all of the const char* pointers are equal. If so, we know we know this subgroup is composed of a single string, and we process it immediately.

This is a particularly important helper because it also lets us early exit on "short" strings, i.e. with c("abc", "a", "a") the max_string_length = 3, and after the 2nd pass we can group c("a", "a") into the same subgroup. Rather than recursing a 3rd time on this subgroup, which would cause it to index outside the bounds of the "a" string, chr_all_same() lets us finish this subgroup instead.

chr_all_same_byte()

The integer and double algorithms have a nice check after the histogram where they see if all of the bytes in this pass went into the same bucket. If so, we don't learn anything from this pass and move to the next one.

In the string algorithm, we move this check before the histogramming for two reasons:

  • It is faster, because strings often have a long common prefix, and this lets us jump by that pretty quicky
  • It helps avoid stack issues when there is a long common prefix, because the stack array r_ssize p_counts[UINT8_MAX_SIZE] = { 0 }; isn't allocated for passes that we skip due to this.

@DavisVaughan DavisVaughan merged commit 58f2c2b into main Dec 15, 2025
13 checks passed
@DavisVaughan DavisVaughan deleted the feature/no-truelength-order branch December 15, 2025 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants