-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Hi,
I'm interested in calculating the values of a so-called empirical cumulative distribution function (ECDF) for every row of a SVT_SparseMatrix object. This kind of ECDF calculation is used in the GSVA package in the first step of the GSVA algorithm. The current version of GSVA implements this calculation for dense base R matrices and sparse dgCMatrix objects. When the input is a sparse dgCMatrix, GSVA can calculate two different types of outputs: a dense output, as if one would coerce the input to a dense matrix and calculate the ECDF for every value of every row, or a sparse output, where GSVA calculates the ECDF for the nonzero values (only) of every row. It would be great if the SparseArray package would offer those two kinds of calculations for SVT_SparseMatrix objects.
ECDF values are calculated in two steps: (1) build the ECDF; and (2) evaluate the ECDF at the values of interest. I'm interested in evaluating the ECDF at the same values that I use to build it. The ECDF is calculated by first sorting the numerical values, then tabulating their occurrences, and finally calculating from the lowest to the highest value the running fraction of values smaller or equal than each value. The base R ecdf() function does this, e.g.:
x <- c(4, 3, 2, 2)
f <- ecdf(x)
f
Empirical CDF
Call: ecdf(x)
x[1:3] = 2, 3, 4
f(3)
[1] 0.75
f(x)
f(x)
[1] 1.00 0.75 0.50 0.50In this toy example, the 75% of values are equal or below 3, 100% of values are equal or below 4, and 50% of values are equal or below 2. Instead of using the base R ecdf() function, we can also do this calculation as follows:
vals <- sort(unique(x))
mt <- match(x, vals)
tab <- tabulate(mt, nbins=length(vals))
ecdfvals <- cumsum(tab) / length(x)
ecdfvals[mt]
[1] 1.00 0.75 0.50 0.50We can do this calculation for every row of a dense matrix with the following function:
## using base R 'ecdf()' function
## input and output are a dense matrix
ecdfvals_dense_to_dense <- function(x) t(apply(x, 1, function(rx) ecdf(rx)(rx)))If the input matrix is a sparse dgCMatrix and the output is dense, this can be implemented as follows:
## input a sparse dgCMatrix, output a dense matrix
## ecdf values are calculated on both zero and nonzero values
ecdfvals_sparse_to_dense <- function(x) {
stopifnot(is(x, "dgCMatrix")) ## QC
res <- matrix(0, nrow(x), ncol(x))
nc <- ncol(x)
for (i in 1:nrow(x)) {
rx <- x[i, , drop=FALSE] ## drop=FALSE to keep w/ dgCMatrix
vals <- unique(rx@x) ## fetch unique nonzero values
if (nnzero(rx) < nc) ## if there is at least one zero
vals <- c(0, vals) ## add it to the unique values
vals <- sort(vals) ## sort unique values
mt <- match(rx@x, vals) ## tabulate unique nonzero values
tab <- tabulate(mt, nbins=length(vals))
whz <- which(tab == 0) ## add counts of zero values
if (nnzero(rx) < nc) ## if present
tab[whz] <- nc - nnzero(rx)
ecdfvals <- cumsum(tab)/nc ## calculate ECDF values and put
if (nnzero(rx) < nc) ## them where they belong to
res[i, ] <- rep(ecdfvals[whz], nc)
res[i, which(diff(rx@p) > 0)] <- ecdfvals[mt]
res
}
res
}The following is a comparison of the performance of these two functions, which give the same result, and their current C implementations in the GSVA package:
library(Matrix)
library(GSVA)
nr <- 500
nc <- 1000
p <- nr*nc
fnz <- 0.05
x <- numeric(p)
x[sample(1:p, size=ceiling(fnz*p), replace=FALSE)] <- rnorm(ceiling(fnz*p))
m <- matrix(x, nrow=nr, ncol=nc)
M <- Matrix(m, sparse=TRUE)
bench::mark(d2d=ecdfvals_dense_to_dense(m),
d2d_c=GSVA:::.ecdfvals_dense_to_dense(m, verbose=FALSE),
s2d=ecdfvals_sparse_to_dense(M),
s2d_c=GSVA:::.ecdfvals_sparse_to_dense(M, verbose=FALSE))
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
1 d2d 79.47ms 87.37ms 11.2 40.13MB 5.60 6 3
2 d2d_c 19.07ms 24.56ms 38.0 25.04MB 13.3 20 7
3 s2d 783.16ms 783.16ms 1.28 1.9GB 35.8 1 28
4 s2d_c 4.55ms 5.83ms 133. 5.41MB 9.90 67 5
# ℹ 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
# time <list>, gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. While in the dense-input and dense-output ("dense_to_dense") case, the C implementation is "only" about four times faster and consumes about 40% less memory, in the sparse-input and dense-output ("sparse_to_dense") case, the C implementation is two orders of magnitude more performant in terms of both speed and memory consumption.
An R implementation for the sparse-input and sparse-output could be the following:
## input and output are a sparse dgCMatrix
## ecdf values are calculated on nonzero values only
ecdfvals_sparse_to_sparse <- function(x) {
stopifnot(is(x, "dgCMatrix")) ## QC
nc <- ncol(x)
for (i in 1:nrow(x)) {
rx <- x[i, , drop=FALSE] ## drop=FALSE to keep w/ dgCMatrix
vals <- unique(rx@x) ## fetch unique nonzero values
vals <- sort(vals) ## sort unique values
mt <- match(rx@x, vals) ## tabulate unique nonzero values
tab <- tabulate(mt, nbins=length(vals))
## calculate ECDF values and put
ecdfvals <- cumsum(tab)/nnzero(rx)
## them where they belong to
x[i, which(diff(rx@p) > 0)] <- ecdfvals[mt]
}
x
}The benchmarking of this R implementation and the current C implementation in GSVA is as follows:
bench::mark(s2s=ecdfvals_sparse_to_sparse(M),
s2s_c=GSVA:::.ecdfvals_sparse_to_sparse(M, verbose=FALSE))
# A tibble: 2 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
1 s2s 282.35ms 500.34ms 2.00 174.68MB 5.00 2 5
2 s2s_c 3.45ms 4.04ms 216. 1.83MB 3.97 109 2
# ℹ 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
# time <list>, gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. So, again the C implementation is two orders of magnitude more performant than the R implementation in both speed and memory consumption. In the C implementation, when the input is a sparse dgCMatrix, a RsparseMatrix copy of the input is created first, and then used to efficiently go through the rows.
If you consider that such calculations for SVT_SparseMatrix objects don't belong to this package, I'd appreciate if you could sketch me how could I implement this efficiently for SVT_SparseMatrix objects.