Skip to content

Conversation

@DavisVaughan
Copy link
Member

It has a high memory cost (24 extra bytes per string over just SEXP) and it isn't clear that it gains us that much unless there is a high number of passes, i.e. in the case of a long common prefix, which is somewhat rare

It has a high memory cost (24 extra bytes per string over just `SEXP`) and it isn't clear that it gains us that much unless there is a high number of passes, i.e. in the case of a long common prefix, which is somewhat rare
@DavisVaughan
Copy link
Member Author

As shown by some of the "weird scenarios", not caching the pointers/sizes and calling CHAR() or Rf_length() every time was not a viable option (and it turned out that we actually don't need to cache the sizes at all)

# "weird" scenarios

# 100 unique strings each with a suffix 10-20 characters in length,
# total size of 10,000,000
# 10 long common prefixes are pasted on the front, which makes more unique
# strings (typically around 1000) and creates strings with long prefixes that
# we try and "skip" through
#
# This is where you really feel the pain of not having cached sizes and pointers
# and is probably enough to make me want them again
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order",
    "r-lib/vctrs@feature/no-truelength-order-split-info",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling-and-no-sizes",
    "r-lib/vctrs@feature/no-truelength-order-less-memory"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 100
    min_length <- 10
    max_length <- 20

    n_prefixes <- 10
    prefix_size <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    prefixes <- stringi::stri_rand_strings(
      n = n_prefixes,
      length = prefix_size
    )
    prefixes <- sample(prefixes, size = total_size, replace = TRUE)

    w <- paste0(prefixes, w)

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 7 × 7
#>   pkg                    expression     min  median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs            vctrs:::v…  61.2ms  61.6ms     16.2      124MB    64.9 
#> 2 r-lib/vctrs@feature/n… vctrs:::v… 268.2ms 276.4ms      3.55     544MB     4.08
#> 3 r-lib/vctrs@feature/n… vctrs:::v… 284.2ms 289.2ms      3.39     467MB     3.90
#> 4 r-lib/vctrs@feature/n… vctrs:::v… 266.2ms 272.7ms      3.55     334MB     4.96
#> 5 r-lib/vctrs@feature/n… vctrs:::v… 260.4ms 266.9ms      3.61     315MB     5.05
#> 6 r-lib/vctrs@feature/n… vctrs:::v… 220.6ms 227.4ms      4.37     238MB     4.59
#> 7 r-lib/vctrs@feature/n… vctrs:::v… 726.2ms 730.2ms      1.37     238MB     9.13

# - 100 unique strings each with a suffix 10-100 characters in length,
# total size of 10,000,000
# 10 long common prefixes are pasted on the front, which makes more unique
# strings (typically around 1000) and creates strings with long prefixes that
# we try and "skip" through
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order",
    "r-lib/vctrs@feature/no-truelength-order-split-info",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling-and-no-sizes",
    "r-lib/vctrs@feature/no-truelength-order-less-memory"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 100
    min_length <- 10
    max_length <- 100

    n_prefixes <- 10
    prefix_size <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    prefixes <- stringi::stri_rand_strings(
      n = n_prefixes,
      length = prefix_size
    )
    prefixes <- sample(prefixes, size = total_size, replace = TRUE)

    w <- paste0(prefixes, w)

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 7 × 7
#>   pkg                    expression     min  median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs            vctrs:::v…  61.5ms  62.3ms     16.1      124MB    64.2 
#> 2 r-lib/vctrs@feature/n… vctrs:::v… 255.8ms 259.6ms      3.79     544MB     4.35
#> 3 r-lib/vctrs@feature/n… vctrs:::v… 266.7ms 269.8ms      3.62     467MB     4.16
#> 4 r-lib/vctrs@feature/n… vctrs:::v… 246.8ms   252ms      3.82     334MB     5.34
#> 5 r-lib/vctrs@feature/n… vctrs:::v… 244.3ms 253.6ms      3.85     315MB     5.38
#> 6 r-lib/vctrs@feature/n… vctrs:::v… 203.5ms 208.3ms      4.74     238MB     4.98
#> 7 r-lib/vctrs@feature/n… vctrs:::v… 714.1ms 718.3ms      1.39     238MB     9.28

# 1,000,000 unique strings each with a suffix 10-20 characters in length,
# total size of 10,000,000
# single common prefix
cross::bench_versions(
  pkgs = c(
    "r-lib/vctrs",
    "r-lib/vctrs@feature/no-truelength-order",
    "r-lib/vctrs@feature/no-truelength-order-split-info",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling",
    "r-lib/vctrs@feature/no-truelength-order-split-info-and-bool-nas-and-early-na-handling-and-no-sizes",
    "r-lib/vctrs@feature/no-truelength-order-less-memory"
  ),
  {
    set.seed(123)

    total_size <- 10000000
    n_unique <- 1000000
    min_length <- 10
    max_length <- 20

    n_prefixes <- 1
    prefix_size <- 20

    w <- sample(
      stringi::stri_rand_strings(
        n = n_unique,
        length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
      ),
      size = total_size,
      replace = TRUE
    )

    prefixes <- stringi::stri_rand_strings(
      n = n_prefixes,
      length = prefix_size
    )
    prefixes <- sample(prefixes, size = total_size, replace = TRUE)

    w <- paste0(prefixes, w)

    bench::mark(
      vctrs:::vec_order_radix(w),
      iterations = 20
    )
  }
)
#> # A tibble: 7 × 7
#>   pkg                  expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <chr>                <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 r-lib/vctrs          vctrs:::v… 481.69ms 484.85ms     1.97      277MB    2.17 
#> 2 r-lib/vctrs@feature… vctrs:::v…    1.67s    1.68s     0.592     544MB    0.621
#> 3 r-lib/vctrs@feature… vctrs:::v…    1.72s    1.75s     0.558     467MB    0.754
#> 4 r-lib/vctrs@feature… vctrs:::v…    1.68s    1.69s     0.592     334MB    6.80 
#> 5 r-lib/vctrs@feature… vctrs:::v…    1.74s    1.75s     0.572     315MB    6.58 
#> 6 r-lib/vctrs@feature… vctrs:::v…    1.26s    1.28s     0.777     238MB    0.816
#> 7 r-lib/vctrs@feature… vctrs:::v…    3.66s    4.33s     0.245     238MB    0.257

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants