-
Notifications
You must be signed in to change notification settings - Fork 72
Remove TRUELENGTH() backed character radix sort
#2118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Still need to properly support `NA`, `decreasing`, and `na_last`, but this is a major part of it
* Split `struct str_info` into individual components The `struct` forces an 8 byte alignment, so it costs us 24 bytes for this struct, when it should really only cost 20 bytes. Since we need 2 vectors of these, thats 8 wasted bytes per string that we can save by splitting them back up, at the cost of more bookkeeping. * Track `NA`ness rather than whole `SEXP` (#2121) * Track `NA`ness rather than whole `SEXP` For 14 bytes less memory per string * Go back to post check loading * Handle `NA`s up front (#2122) * Handle `NA`s up front Avoiding the need for `p_x_string_nas_aux` altogether, and nicely simplifying and speeding up the radix and insertion paths * Remove `p_lazy_x_string_nas` entirely By having the `SEXP` be available in both `chr_order()` and `chr_order_chunk()` * Split extraction and order rearrangement * Simplify aux placement code * Remove `x_string_sizes` as well (#2123) * Remove size related vectors * Advance `p_x[i]` in place, rather than accessing with `[pass]` This keeps `p_x[i]` up to date after each pass, always pointing to the next byte. This didn't seem to have much of a performance impact, but it nicely cleans up `chr_insertion_order()` and `chr_all_same_byte()` because they no longer need to worry about the `pass`, the pointers are always advanced to `pass` already
|
I used a rather nice fuzzing script that created data frames of character vectors with various random structures, varing:
I then ran this for lots of iterations that compared against base R to ensure we have the right result This helped me feel quite confident about the results. # Fuzzing
fuzzing <- FALSE
if (fuzzing) {
# Force radix method for character comparisons
base_order <- function(x, na.last = TRUE, decreasing = FALSE) {
if (is.data.frame(x)) {
x <- unname(x)
} else {
x <- list(x)
}
args <- list(
na.last = na.last,
decreasing = decreasing,
method = "radix"
)
args <- c(x, args)
rlang::exec("order", !!!args)
}
for (i in 1:10000) {
print(paste("iteration", i))
n_rows <- sample(0:1000000, size = 1)
n_cols <- sample(1:10, size = 1)
cols <- replicate(n = n_cols, simplify = FALSE, {
n_strings <- sample(0:10000, size = 1)
string_size <- sample(0:100, n_strings, TRUE)
optional_na <- if (runif(1) > .5) NA else NULL
strings <- c(
if (n_strings == 0L) {
# stringi bug with `n_strings = 0, string_size = integer()`
character()
} else {
stringi::stri_rand_strings(n_strings, string_size)
},
optional_na
)
sample(strings, n_rows, TRUE)
})
names(cols) <- as.character(seq_along(cols))
df <- new_data_frame(cols, n = n_rows)
direction <- sample(c("asc", "desc"), 1)
na_value <- sample(c("largest", "smallest"), 1)
if (direction == "asc") {
if (na_value == "largest") {
na.last <- TRUE
decreasing <- FALSE
} else {
na.last <- FALSE
decreasing <- FALSE
}
} else {
if (na_value == "largest") {
na.last <- FALSE
decreasing <- TRUE
} else {
na.last <- TRUE
decreasing <- TRUE
}
}
r_order <- base_order(df, na.last = na.last, decreasing = decreasing)
vctrs_order <- vctrs:::vec_order_radix(
df,
direction = direction,
na_value = na_value
)
is_identical <- identical(
r_order,
vctrs_order
)
if (!is_identical) {
options <- list(direction = direction, na_value = na_value)
out <- list(r = r_order, vctrs = vctrs_order, options = options)
saveRDS(out, file = "failure.rds")
stop("not identical, check saved rds files")
}
}
} |
|
Bench mark of various common and weird scenarios comparing # "typical" scenarios
# 5 unique strings each 10-20 characters in length,
# total size of 10,000,000
cross::bench_versions(
pkgs = c(
"r-lib/vctrs",
"r-lib/vctrs@feature/no-truelength-order"
),
{
set.seed(123)
total_size <- 10000000
n_unique <- 5
min_length <- 10
max_length <- 20
w <- sample(
stringi::stri_rand_strings(
n = n_unique,
length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
),
size = total_size,
replace = TRUE
)
bench::mark(
vctrs:::vec_order_radix(w),
iterations = 20
)
}
)
#> # A tibble: 2 × 7
#> pkg expression min median `itr/sec` mem_alloc `gc/sec`
#> <chr> <bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl>
#> 1 r-lib/vctrs vctrs:::v… 64.8ms 65.9ms 15.1 124MB 15.1
#> 2 r-lib/vctrs@feature/no-… vctrs:::v… 73.4ms 75.2ms 12.8 238MB 14.1
# 5 unique strings each 10-20 characters in length,
# mix in some `NA`s, ~20%
# total size of 10,000,000
cross::bench_versions(
pkgs = c(
"r-lib/vctrs",
"r-lib/vctrs@feature/no-truelength-order"
),
{
set.seed(123)
total_size <- 10000000
n_unique <- 5
min_length <- 10
max_length <- 20
w <- sample(
c(
stringi::stri_rand_strings(
n = n_unique,
length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
),
NA
),
size = total_size,
replace = TRUE
)
bench::mark(
vctrs:::vec_order_radix(w),
iterations = 20
)
}
)
#> # A tibble: 2 × 7
#> pkg expression min median `itr/sec` mem_alloc `gc/sec`
#> <chr> <bch:expr> <bch:> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 r-lib/vctrs vctrs:::v… 97.4ms 98.3ms 10.2 124MB 10.2
#> 2 r-lib/vctrs@feature/no… vctrs:::v… 98.9ms 102.4ms 9.37 238MB 10.8
# 100 unique strings each 10-20 characters in length,
# total size of 10,000,000
cross::bench_versions(
pkgs = c(
"r-lib/vctrs",
"r-lib/vctrs@feature/no-truelength-order"
),
{
set.seed(123)
total_size <- 10000000
n_unique <- 100
min_length <- 10
max_length <- 20
w <- sample(
stringi::stri_rand_strings(
n = n_unique,
length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
),
size = total_size,
replace = TRUE
)
bench::mark(
vctrs:::vec_order_radix(w),
iterations = 20
)
}
)
#> # A tibble: 2 × 7
#> pkg expression min median `itr/sec` mem_alloc `gc/sec`
#> <chr> <bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl>
#> 1 r-lib/vctrs vctrs:::v… 61.1ms 61.9ms 16.1 124MB 16.1
#> 2 r-lib/vctrs@feature/no-… vctrs:::v… 96.5ms 99.1ms 9.89 238MB 10.9
# 100,000 unique strings each 10-20 characters in length,
# total size of 10,000,000
cross::bench_versions(
pkgs = c(
"r-lib/vctrs",
"r-lib/vctrs@feature/no-truelength-order"
),
{
set.seed(123)
total_size <- 10000000
n_unique <- 100000
min_length <- 10
max_length <- 20
w <- sample(
stringi::stri_rand_strings(
n = n_unique,
length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
),
size = total_size,
replace = TRUE
)
bench::mark(
vctrs:::vec_order_radix(w),
iterations = 20
)
}
)
#> # A tibble: 2 × 7
#> pkg expression min median `itr/sec` mem_alloc `gc/sec`
#> <chr> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl>
#> 1 r-lib/vctrs vctrs:::v… 151ms 153ms 6.27 243MB 7.21
#> 2 r-lib/vctrs@feature/no-t… vctrs:::v… 133ms 137ms 7.03 238MB 7.74
# 10,000 unique strings each 10-20 characters in length,
# total size of 100,000,000
cross::bench_versions(
pkgs = c(
"r-lib/vctrs",
"r-lib/vctrs@feature/no-truelength-order"
),
{
set.seed(123)
total_size <- 100000000
n_unique <- 10000
min_length <- 10
max_length <- 20
w <- sample(
stringi::stri_rand_strings(
n = n_unique,
length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
),
size = total_size,
replace = TRUE
)
bench::mark(
vctrs:::vec_order_radix(w),
iterations = 20
)
}
)
#> # A tibble: 2 × 7
#> pkg expression min median `itr/sec` mem_alloc `gc/sec`
#> <chr> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 r-lib/vctrs vctrs:::v… 702.22ms 716.12ms 1.36 1.21GB 1.43
#> 2 r-lib/vctrs@feature… vctrs:::v… 1.25s 1.28s 0.771 2.33GB 0.887
# 10,000 unique strings each 1-100 characters in length,
# total size of 10,000,000
cross::bench_versions(
pkgs = c(
"r-lib/vctrs",
"r-lib/vctrs@feature/no-truelength-order"
),
{
set.seed(123)
total_size <- 10000000
n_unique <- 10000
min_length <- 1
max_length <- 100
w <- sample(
stringi::stri_rand_strings(
n = n_unique,
length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
),
size = total_size,
replace = TRUE
)
bench::mark(
vctrs:::vec_order_radix(w),
iterations = 20
)
}
)
#> # A tibble: 2 × 7
#> pkg expression min median `itr/sec` mem_alloc `gc/sec`
#> <chr> <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 r-lib/vctrs vctrs:::v… 68.6ms 70.3ms 14.2 124MB 14.2
#> 2 r-lib/vctrs@feature/n… vctrs:::v… 119ms 121.6ms 8.07 238MB 8.87
# 10,000 unique strings each 100 characters in length,
# total size of 10,000,000
cross::bench_versions(
pkgs = c(
"r-lib/vctrs",
"r-lib/vctrs@feature/no-truelength-order"
),
{
set.seed(123)
total_size <- 10000000
n_unique <- 10000
min_length <- 100
max_length <- 100
w <- sample(
stringi::stri_rand_strings(
n = n_unique,
length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
),
size = total_size,
replace = TRUE
)
bench::mark(
vctrs:::vec_order_radix(w),
iterations = 20
)
}
)
#> # A tibble: 2 × 7
#> pkg expression min median `itr/sec` mem_alloc `gc/sec`
#> <chr> <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 r-lib/vctrs vctrs:::v… 68.1ms 68.7ms 14.5 124MB 14.5
#> 2 r-lib/vctrs@feature/n… vctrs:::v… 119.8ms 122ms 7.95 238MB 8.75# "weird" scenarios
# 100 unique strings each with a suffix 10-20 characters in length,
# total size of 10,000,000
# 10 long common prefixes are pasted on the front, which makes more unique
# strings (typically around 1000) and creates strings with long prefixes that
# we try and "skip" through
#
# (This is where you really feel the pain of not having cached sizes and pointers,
# which was another option we considered)
cross::bench_versions(
pkgs = c(
"r-lib/vctrs",
"r-lib/vctrs@feature/no-truelength-order"
),
{
set.seed(123)
total_size <- 10000000
n_unique <- 100
min_length <- 10
max_length <- 20
n_prefixes <- 10
prefix_size <- 20
w <- sample(
stringi::stri_rand_strings(
n = n_unique,
length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
),
size = total_size,
replace = TRUE
)
prefixes <- stringi::stri_rand_strings(
n = n_prefixes,
length = prefix_size
)
prefixes <- sample(prefixes, size = total_size, replace = TRUE)
w <- paste0(prefixes, w)
bench::mark(
vctrs:::vec_order_radix(w),
iterations = 20
)
}
)
#> # A tibble: 2 × 7
#> pkg expression min median `itr/sec` mem_alloc `gc/sec`
#> <chr> <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 r-lib/vctrs vctrs:::v… 60.6ms 61.3ms 16.3 124MB 65.3
#> 2 r-lib/vctrs@feature/n… vctrs:::v… 219.1ms 225.2ms 4.41 238MB 4.63
# - 100 unique strings each with a suffix 10-100 characters in length,
# total size of 10,000,000
# 10 long common prefixes are pasted on the front, which makes more unique
# strings (typically around 1000) and creates strings with long prefixes that
# we try and "skip" through
cross::bench_versions(
pkgs = c(
"r-lib/vctrs",
"r-lib/vctrs@feature/no-truelength-order"
),
{
set.seed(123)
total_size <- 10000000
n_unique <- 100
min_length <- 10
max_length <- 100
n_prefixes <- 10
prefix_size <- 20
w <- sample(
stringi::stri_rand_strings(
n = n_unique,
length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
),
size = total_size,
replace = TRUE
)
prefixes <- stringi::stri_rand_strings(
n = n_prefixes,
length = prefix_size
)
prefixes <- sample(prefixes, size = total_size, replace = TRUE)
w <- paste0(prefixes, w)
bench::mark(
vctrs:::vec_order_radix(w),
iterations = 20
)
}
)
#> # A tibble: 2 × 7
#> pkg expression min median `itr/sec` mem_alloc `gc/sec`
#> <chr> <bch:expr> <bch> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 r-lib/vctrs vctrs:::v… 60ms 60.7ms 16.5 124MB 65.9
#> 2 r-lib/vctrs@feature/no-… vctrs:::v… 202ms 207.5ms 4.72 238MB 4.95
# 1,000,000 unique strings each with a suffix 10-20 characters in length,
# total size of 10,000,000
# single common prefix
cross::bench_versions(
pkgs = c(
"r-lib/vctrs",
"r-lib/vctrs@feature/no-truelength-order"
),
{
set.seed(123)
total_size <- 10000000
n_unique <- 1000000
min_length <- 10
max_length <- 20
n_prefixes <- 1
prefix_size <- 20
w <- sample(
stringi::stri_rand_strings(
n = n_unique,
length = sample(seq(min_length, max_length), n_unique, replace = TRUE)
),
size = total_size,
replace = TRUE
)
prefixes <- stringi::stri_rand_strings(
n = n_prefixes,
length = prefix_size
)
prefixes <- sample(prefixes, size = total_size, replace = TRUE)
w <- paste0(prefixes, w)
bench::mark(
vctrs:::vec_order_radix(w),
iterations = 20
)
}
)
#> # A tibble: 2 × 7
#> pkg expression min median `itr/sec` mem_alloc `gc/sec`
#> <chr> <bch:expr> <bch:tm> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 r-lib/vctrs vctrs:::v… 476.44ms 489.5ms 1.94 277MB 2.14
#> 2 r-lib/vctrs@feature/… vctrs:::v… 1.25s 1.26s 0.785 238MB 0.825 |
|
Now to talk about some optimizations I made along the way Storing
|
This PR removes
vec_order_radix()'s reliance onTRUELENGTH()andSET_TRUELENGTH()onCHARSXPs, which is non-API and has always been a bit hacky.A culmination of this PR, along with iterative improvements in #2120, #2121, #2122, and #2123.
PR #2119 was an additional failed experiment.
Reverse dependency checks (including dplyr revdeps) look good
"e77dc1e0-48c1-40bb-b416-6b54f080af8f"The way this previously worked was:
x, a character vectorTRUELENGTH()tricks to isolatex_unique, the unique elements ofxx_unique, typically, but not always, a smaller set of strings (consider data frame columns of strings)TRUELENGTH()tricks to create anx_integer_proxyofxwith values from the order ofx_uniquex_integer_proxy, which generally uses the very fast counting sort method for integersThere were really two possible paths forward:
TRUELENGTH()steps with our own hash map.xdirectly.I have opted for approach 2, so the algorithm is now:
x, a character vectorxIt was reasonably straightforward to switch to "just" radix ordering
xsince we already had a string radix ordering algorithm forx_unique, but it did require addingNAhandling tochr_order_radix(), because previously that was handled by the integer ordering algorithm.The hard part was then optimizing the hell out of the character radix ordering algorithm on two fronts:
Without a whole bunch of optimization work, radix ordering all of
xwas going to be a massive performance and memory penalty vs the oldTRUELENGTH()approach.Over the course of several iterations, I believe I have optimized it well enough so that the performance hit is only ~10-30% in most common cases (large vector, few groups, i.e. a typical character column in data science work). In the worst of the worst cases, which is typically when strings have some very long common prefix that we have to work through (URLs are an example of this sometimes), we are up to 3x slower. But this is still orders of magnitude faster than the next fastest sorting method on strings (shell sort in base R's
order()), so I am fine with this.Memory wise, we do use more memory than we used to because we are sorting the whole vector, not just the uniques. But we actually now use exactly the same amount of memory as radix ordering a vector of doubles (both
doubleandconst char*are 8 bytes), so I think this memory increase is both unavoidable and totally reasonable.I spent a lot of time doing some fancy individual optimizations, and I'd like to document them for future me, so I'm going to use the rest of the space below to do that.
Additionally, in the Details here are various iterative benchmarks, run across:
TRUELENGTH()approachRows 1 and 6 in each of these are the most important, as that shows where we started and where we ended up.
I've also rerun the benchmarks with just those rows and included them in full in a comment below, but the iterative comparison was also nice for me to track.
Details