Skip to content

Use data.table::rbindlist() for data.frame to improve performance of rbindFill() #140

@jorainer

Description

@jorainer

The data.table::rbindlist() function merges a list of data.frame very efficiently and faster than do.call(rbind, . The function has also a fill parameter that allows to fill missing columns. We could make use of this function in the rbindFill() function to improve performance if only data.frames are merged (rbindFill does not work for matrix or DataFrame). Of course, this would add another dependency of the MsCoreUtils package.

A comparison of the performance of the original rbindFill() and a version that uses rbindlist() for data.frame is done below. The function is essentially identical to MsCoreUtils::rbindFill() with the exception that, if all elements in l are data.frame it uses rbindlist() to merge them.

#' rbindFill version using rbindlist
rbindFill2 <- function(...) {
    l <- list(...)

    if (length(l) == 1L && is.list(l[[1L]]))
        l <- l[[1L]]

    cnms <- c("matrix", "data.frame", "DataFrame", "DFrame")

    if (inherits(l, cnms))  # just one single object given as input, do nothing
        return(l)

    cls <- vapply(l, inherits, integer(length(cnms)), what = cnms, which = TRUE)
    rownames(cls) <- cnms

    if (any(!as.logical(colSums(cls))))
        stop("'rbindFill' just works for ", paste(cls, collapse = ", "))

    ## Use data.table::rbindlist if all elements are data.frame
    if (!any(cls["data.frame", ] == 0))
        return(as.data.frame(rbindlist(l, use.names = TRUE, fill = TRUE,
                                       ignore.attr = TRUE)))

    ## convert matrix to data.frame for easier and equal subsetting and class
    ## determination
    isMatrix <- as.logical(cls["matrix",])
    l[isMatrix] <- lapply(l[isMatrix], as.data.frame)

    allcl <- unlist(
        lapply(l, function(ll) {
            vapply1c(ll, function(lll)class(lll)[1L], USE.NAMES = TRUE)
        })
    )
    allnms <- unique(names(allcl))
    allcl <- allcl[allnms]

    for (i in seq_along(l)) {
        diffcn <- setdiff(allnms, names(l[[i]]))
        if (length(diffcn))
            l[[i]][, diffcn] <- lapply(allcl[diffcn], as, object = NA)
    }
    r <- do.call(rbind, l)

    ## if we had just matrices as input we need to convert our temporary
    ## data.frame back to a matrix
    if (all(isMatrix))
        r <- as.matrix(r)
    r
}

Performance comparison

Few long tables

With need to add/fill columns:

a <- data.frame(int = 1:1000, char = rep("A", 1000), log = rep(FALSE, 1000))
b <- data.frame(char = rep("B", 500), num = rep(12.3, 500), int = 1:500)
microbenchmark(rbindFill(a, b),
               rbindFill2(a, b))
## Unit: microseconds
##              expr     min       lq     mean  median       uq      max neval cld
##   rbindFill(a, b) 371.514 397.5955 448.2049 412.510 435.7465 1498.204   100  a
##  rbindFill2(a, b)  90.546 115.5990 142.9418 141.526 150.1040  372.130   100   b

Same columns present.

## Same columns
a <- a[, c("int", "char")]
b <- b[, c("int", "char")]
microbenchmark(rbindFill(a, b),
               rbindFill2(a, b))
## Unit: microseconds
##              expr     min       lq     mean   median      uq     max neval cld
##   rbindFill(a, b) 189.338 208.5640 229.8524 223.4650 238.047 462.087   100  a
##  rbindFill2(a, b)  87.225  98.4645 117.7004 111.9315 126.058 326.144   100   b

A little performance gain if column filling is required, only a 2x improvement if not.

Many small tables

That's more the use case for Spectra-related code.

## Performance. many small tables
a <- data.frame(int = 1:10, char = rep("A", 10), log = rep(FALSE, 10))
b <- data.frame(char = rep("D", 15), int = 1:15, num = rep(1.1, 15))
ab <- list(a, b)
set.seed(123)

## 100
l <- ab[sample(1:2, 100, replace = TRUE)]
microbenchmark(
    rbindFill(l),
    rbindFill2(l)
)
## Unit: microseconds
##           expr      min       lq      mean   median       uq       max neval cld
##   rbindFill(l) 8778.081 8848.230 9227.7791 8882.896 9115.406 12520.302   100  a
##  rbindFill2(l)  222.449  246.168  305.0429  285.874  320.118  1922.416   100   b
## 30x faster

## 500
l <- ab[sample(1:2, 500, replace = TRUE)]
microbenchmark(
    rbindFill(l),
    rbindFill2(l)
)
## Unit: microseconds
##           expr      min         lq       mean     median         uq        max neval cld
##   rbindFill(l) 42848.02 43808.9410 46841.6679 45196.9415 47320.9375 147111.882   100  a
##  rbindFill2(l)   768.51   815.9855   954.9411   905.7235   935.6075   2820.481   100   b
## 50x faster

## 1000
l <- ab[sample(1:2, 1000, replace = TRUE)]
microbenchmark(
    rbindFill(l),
    rbindFill2(l)
)
## Unit: milliseconds
##           expr       min        lq      mean    median         uq        max neval cld
##   rbindFill(l) 86.983247 91.242365 100.82433 98.708399 105.161711 205.098656   100  a
##  rbindFill2(l)  1.449975  1.502471   1.63093  1.604261   1.641351   2.801352   100   b
## 60x faster

## 10000
l <- ab[sample(1:2, 10000, replace = TRUE)]
microbenchmark(
    rbindFill(l),
    rbindFill2(l)
)
## Unit: milliseconds
##           expr       min        lq       mean     median         uq       max neval cld
##   rbindFill(l) 900.41956 948.91321 1033.17948 1030.97903 1092.76613 1328.7885   100  a
##  rbindFill2(l)  13.60664  14.14598   22.22862   15.85146   17.51346  155.2194   100   b
## 64x faster

Considerable performance improvement, in particular if many data.frames are merged. Performance gain is about half (i.e., up to 30x) if no column filling is needed.

Summary

For cases where only data.frames are merged, using the native rbindlist() would improve the performance. The question is whether this is good enough to make MsCoreUtils dependent on the data.table package. Any thoughts @lgatto @sgibb ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions