Row-wise ECDF on SVT_SparseMatrix

Hi,

I'm interested in calculating the values of a so-called empirical cumulative distribution function ([ECDF](https://en.wikipedia.org/wiki/Empirical_distribution_function)) for every row of a `SVT_SparseMatrix` object. This kind of ECDF calculation is used in the [GSVA](https://bioconductor.org/packages/GSVA) package in the first step of the [GSVA algorithm](https://doi.org/10.1186/1471-2105-14-7). The current version of GSVA implements this calculation for dense base R matrices and sparse `dgCMatrix` objects. When the input is a sparse `dgCMatrix`, GSVA can calculate two different types of outputs: a dense output, as if one would coerce the input to a dense matrix and calculate the ECDF for every value of every row, or a sparse output, where GSVA calculates the ECDF for the nonzero values (only) of every row. It would be great if the SparseArray package would offer those two kinds of calculations for `SVT_SparseMatrix` objects.

ECDF values are calculated in two steps: (1) build the ECDF; and (2) evaluate the ECDF at the values of interest. I'm interested in evaluating the ECDF at the same values that I use to build it. The ECDF is calculated by first sorting the numerical values, then tabulating their occurrences, and finally calculating from the lowest to the highest value the running fraction of values smaller or equal than each value. The base R `ecdf()` function does this, e.g.:

```r
x <- c(4, 3, 2, 2)
f <- ecdf(x)
f
Empirical CDF 
Call: ecdf(x)
 x[1:3] =      2,      3,      4
f(3)
[1] 0.75
f(x)
f(x)
[1] 1.00 0.75 0.50 0.50
```
In this toy example, the 75% of values are equal or below 3, 100% of values are equal or below 4, and 50% of values are equal or below 2. Instead of using the base R `ecdf()` function, we can also do this calculation as follows:

```r
vals <- sort(unique(x))
mt <- match(x, vals)
tab <- tabulate(mt, nbins=length(vals))
ecdfvals <- cumsum(tab) / length(x)
ecdfvals[mt]
[1] 1.00 0.75 0.50 0.50
```
We can do this calculation for every row of a dense matrix with the following function:

```r
## using base R 'ecdf()' function
## input and output are a dense matrix
ecdfvals_dense_to_dense <- function(x) t(apply(x, 1, function(rx) ecdf(rx)(rx)))
```
If the input matrix is a sparse `dgCMatrix` and the output is dense, this can be implemented as follows:

```r
## input a sparse dgCMatrix, output a dense matrix
## ecdf values are calculated on both zero and nonzero values
ecdfvals_sparse_to_dense <- function(x) {
    stopifnot(is(x, "dgCMatrix")) ## QC
    res <- matrix(0, nrow(x), ncol(x))
    nc <- ncol(x)
    for (i in 1:nrow(x)) {
        rx <- x[i, , drop=FALSE]   ## drop=FALSE to keep w/ dgCMatrix
        vals <- unique(rx@x)       ## fetch unique nonzero values
        if (nnzero(rx) < nc)       ## if there is at least one zero
            vals <- c(0, vals)     ## add it to the unique values
        vals <- sort(vals)         ## sort unique values
        mt <- match(rx@x, vals)    ## tabulate unique nonzero values
        tab <- tabulate(mt, nbins=length(vals))
        whz <- which(tab == 0)     ## add counts of zero values
        if (nnzero(rx) < nc)       ## if present
            tab[whz] <- nc - nnzero(rx)
        ecdfvals <- cumsum(tab)/nc ## calculate ECDF values and put
        if (nnzero(rx) < nc)       ## them where they belong to
            res[i, ] <- rep(ecdfvals[whz], nc)
        res[i, which(diff(rx@p) > 0)] <- ecdfvals[mt]
        res
    }
    res
}
```
The following is a comparison of the performance of these two functions, which give the same result, and their current C implementations in the GSVA package:

```r
library(Matrix)
library(GSVA)

nr <- 500
nc <- 1000
p <- nr*nc
fnz <- 0.05

x <- numeric(p)
x[sample(1:p, size=ceiling(fnz*p), replace=FALSE)] <- rnorm(ceiling(fnz*p))
m <- matrix(x, nrow=nr, ncol=nc)
M <- Matrix(m, sparse=TRUE)

bench::mark(d2d=ecdfvals_dense_to_dense(m),
            d2d_c=GSVA:::.ecdfvals_dense_to_dense(m, verbose=FALSE),
            s2d=ecdfvals_sparse_to_dense(M),
            s2d_c=GSVA:::.ecdfvals_sparse_to_dense(M, verbose=FALSE))
# A tibble: 4 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 d2d         79.47ms  87.37ms     11.2    40.13MB     5.60     6     3
2 d2d_c       19.07ms  24.56ms     38.0    25.04MB    13.3     20     7
3 s2d        783.16ms 783.16ms      1.28     1.9GB    35.8      1    28
4 s2d_c        4.55ms   5.83ms    133.      5.41MB     9.90    67     5
# ℹ 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
#   time <list>, gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. 
```
While in the dense-input and dense-output ("dense_to_dense") case, the C implementation is "only" about four times faster and consumes about 40% less memory, in the sparse-input and dense-output ("sparse_to_dense") case, the C implementation is two orders of magnitude more performant in terms of both speed and memory consumption.

An R implementation for the sparse-input and sparse-output could be the following:

```r
## input and output are a sparse dgCMatrix
## ecdf values are calculated on nonzero values only
ecdfvals_sparse_to_sparse <- function(x) {
    stopifnot(is(x, "dgCMatrix")) ## QC
    nc <- ncol(x)
    for (i in 1:nrow(x)) {
        rx <- x[i, , drop=FALSE]   ## drop=FALSE to keep w/ dgCMatrix
        vals <- unique(rx@x)       ## fetch unique nonzero values
        vals <- sort(vals)         ## sort unique values
        mt <- match(rx@x, vals)    ## tabulate unique nonzero values
        tab <- tabulate(mt, nbins=length(vals))
                                   ## calculate ECDF values and put
        ecdfvals <- cumsum(tab)/nnzero(rx)
                                   ## them where they belong to
        x[i, which(diff(rx@p) > 0)] <- ecdfvals[mt]
    }
    x
}
```
The benchmarking of this R implementation and the current C implementation in GSVA is as follows:

```r
bench::mark(s2s=ecdfvals_sparse_to_sparse(M),
            s2s_c=GSVA:::.ecdfvals_sparse_to_sparse(M, verbose=FALSE))
# A tibble: 2 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 s2s        282.35ms 500.34ms      2.00  174.68MB     5.00     2     5
2 s2s_c        3.45ms   4.04ms    216.      1.83MB     3.97   109     2
# ℹ 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
#   time <list>, gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. 
```
So, again the C implementation is two orders of magnitude more performant than the R implementation in both speed and memory consumption. In the C implementation, when the input is a sparse `dgCMatrix`, a `RsparseMatrix` copy of the input is created first, and then used to efficiently go through the rows.

If you consider that such calculations for `SVT_SparseMatrix` objects don't belong to this package, I'd appreciate if you could sketch me how could I implement this efficiently for `SVT_SparseMatrix` objects.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Row-wise ECDF on SVT_SparseMatrix #20

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Row-wise ECDF on SVT_SparseMatrix #20

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions