-
Notifications
You must be signed in to change notification settings - Fork 12
Description
The data.table::rbindlist() function merges a list of data.frame very efficiently and faster than do.call(rbind, . The function has also a fill parameter that allows to fill missing columns. We could make use of this function in the rbindFill() function to improve performance if only data.frames are merged (rbindFill does not work for matrix or DataFrame). Of course, this would add another dependency of the MsCoreUtils package.
A comparison of the performance of the original rbindFill() and a version that uses rbindlist() for data.frame is done below. The function is essentially identical to MsCoreUtils::rbindFill() with the exception that, if all elements in l are data.frame it uses rbindlist() to merge them.
#' rbindFill version using rbindlist
rbindFill2 <- function(...) {
l <- list(...)
if (length(l) == 1L && is.list(l[[1L]]))
l <- l[[1L]]
cnms <- c("matrix", "data.frame", "DataFrame", "DFrame")
if (inherits(l, cnms)) # just one single object given as input, do nothing
return(l)
cls <- vapply(l, inherits, integer(length(cnms)), what = cnms, which = TRUE)
rownames(cls) <- cnms
if (any(!as.logical(colSums(cls))))
stop("'rbindFill' just works for ", paste(cls, collapse = ", "))
## Use data.table::rbindlist if all elements are data.frame
if (!any(cls["data.frame", ] == 0))
return(as.data.frame(rbindlist(l, use.names = TRUE, fill = TRUE,
ignore.attr = TRUE)))
## convert matrix to data.frame for easier and equal subsetting and class
## determination
isMatrix <- as.logical(cls["matrix",])
l[isMatrix] <- lapply(l[isMatrix], as.data.frame)
allcl <- unlist(
lapply(l, function(ll) {
vapply1c(ll, function(lll)class(lll)[1L], USE.NAMES = TRUE)
})
)
allnms <- unique(names(allcl))
allcl <- allcl[allnms]
for (i in seq_along(l)) {
diffcn <- setdiff(allnms, names(l[[i]]))
if (length(diffcn))
l[[i]][, diffcn] <- lapply(allcl[diffcn], as, object = NA)
}
r <- do.call(rbind, l)
## if we had just matrices as input we need to convert our temporary
## data.frame back to a matrix
if (all(isMatrix))
r <- as.matrix(r)
r
}Performance comparison
Few long tables
With need to add/fill columns:
a <- data.frame(int = 1:1000, char = rep("A", 1000), log = rep(FALSE, 1000))
b <- data.frame(char = rep("B", 500), num = rep(12.3, 500), int = 1:500)
microbenchmark(rbindFill(a, b),
rbindFill2(a, b))
## Unit: microseconds
## expr min lq mean median uq max neval cld
## rbindFill(a, b) 371.514 397.5955 448.2049 412.510 435.7465 1498.204 100 a
## rbindFill2(a, b) 90.546 115.5990 142.9418 141.526 150.1040 372.130 100 bSame columns present.
## Same columns
a <- a[, c("int", "char")]
b <- b[, c("int", "char")]
microbenchmark(rbindFill(a, b),
rbindFill2(a, b))
## Unit: microseconds
## expr min lq mean median uq max neval cld
## rbindFill(a, b) 189.338 208.5640 229.8524 223.4650 238.047 462.087 100 a
## rbindFill2(a, b) 87.225 98.4645 117.7004 111.9315 126.058 326.144 100 bA little performance gain if column filling is required, only a 2x improvement if not.
Many small tables
That's more the use case for Spectra-related code.
## Performance. many small tables
a <- data.frame(int = 1:10, char = rep("A", 10), log = rep(FALSE, 10))
b <- data.frame(char = rep("D", 15), int = 1:15, num = rep(1.1, 15))
ab <- list(a, b)
set.seed(123)
## 100
l <- ab[sample(1:2, 100, replace = TRUE)]
microbenchmark(
rbindFill(l),
rbindFill2(l)
)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## rbindFill(l) 8778.081 8848.230 9227.7791 8882.896 9115.406 12520.302 100 a
## rbindFill2(l) 222.449 246.168 305.0429 285.874 320.118 1922.416 100 b
## 30x faster
## 500
l <- ab[sample(1:2, 500, replace = TRUE)]
microbenchmark(
rbindFill(l),
rbindFill2(l)
)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## rbindFill(l) 42848.02 43808.9410 46841.6679 45196.9415 47320.9375 147111.882 100 a
## rbindFill2(l) 768.51 815.9855 954.9411 905.7235 935.6075 2820.481 100 b
## 50x faster
## 1000
l <- ab[sample(1:2, 1000, replace = TRUE)]
microbenchmark(
rbindFill(l),
rbindFill2(l)
)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## rbindFill(l) 86.983247 91.242365 100.82433 98.708399 105.161711 205.098656 100 a
## rbindFill2(l) 1.449975 1.502471 1.63093 1.604261 1.641351 2.801352 100 b
## 60x faster
## 10000
l <- ab[sample(1:2, 10000, replace = TRUE)]
microbenchmark(
rbindFill(l),
rbindFill2(l)
)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## rbindFill(l) 900.41956 948.91321 1033.17948 1030.97903 1092.76613 1328.7885 100 a
## rbindFill2(l) 13.60664 14.14598 22.22862 15.85146 17.51346 155.2194 100 b
## 64x fasterConsiderable performance improvement, in particular if many data.frames are merged. Performance gain is about half (i.e., up to 30x) if no column filling is needed.
Summary
For cases where only data.frames are merged, using the native rbindlist() would improve the performance. The question is whether this is good enough to make MsCoreUtils dependent on the data.table package. Any thoughts @lgatto @sgibb ?