NOTICE: While this R package was based on the ssGSEA2.0 repository, neither perform single-sample Gene Set Enrichment Analysis (ssGSEA) as originally described by Barbie, et al. (Barbie et al. 2009). They are instead modifications of pre-ranked GSEA that calculate the enrichment score (ES) differently and support testing directional gene sets (details below). The package and fast-ssGSEA name will be corrected in the near future.
fast.ssgsea is an R package (R Core Team 2024) for
a highly optimized variant of pre-ranked Gene Set Enrichment Analysis
(GSEA) (Subramanian et al. 2005). Unlike
standard GSEA, fast-ssGSEA is capable of testing gene sets where each
gene has an expected direction of change (up- or down-regulation;
indicated by appending a “;u” or “;d” to the end of every gene in a set)
from a prior experiment. fast-ssGSEA is based on Post-Translational
Modification Signature Enrichment Analysis (PTM-SEA) (Krug et al.
2019), and it borrows optimization techniques
from the simple implementation of Fast Gene Set Enrichment Analysis
(FGSEA-simple) (Korotkevich et al. 2021).
Also, while the enrichment score (ES) for standard GSEA is the most
extreme value of the running sum (see any GSEA paper for details), the
ES for fast-ssGSEA is the sum of all values of the running sum. This
change allows the ES to be computed much more quickly, making
fast-ssGSEA 1-2 orders of magnitude faster than FGSEA-simple (see
runtime measurements below). However, fast-ssGSEA it not able to
calculate arbitrarily small p-values like FGSEA-multilevel (Korotkevich
et al. 2021).
The primary function, fast_ssgsea, accepts a numeric matrix with genes
or other molecules as rows and either samples, contrasts, or some other
meaningful representation of the data as columns. The values in each
column must be approximately symmetric around zero, with more extreme
values indicating greater importance. A named list of gene sets (more
generally, molecular signatures) is also required. Other arguments
control the behavior of fast-ssGSEA, and they are described in the
function documentation.
The package also contains a read_gmt function, which reads a Gene
Matrix Transposed (GMT) file to construct a named list of gene sets for
use with fast_ssgsea.
R version 4.0.0 or greater is required to install fast.ssgsea.
It may be possible to get fast.ssgsea to work with older versions of R
by cloning the repository, changing the minimum R version in the
DESCRIPTION (e.g., to >= 3.6.0), and rebuilding the package, but users
should do so at their own risk.
A macOS binary is provided in the latest
release. Users looking to
build and install the development version of fast.ssgsea must have the
Xcode developer tools from Apple and a FORTRAN compiler installed. See
https://mac.r-project.org/tools/ for instructions.
No Windows binary is available, so
Rtools must be
installed to compile C and C++ code. Then, the development version of
fast.ssgsea can be installed with the code below.
Most Linux distributions come pre-packaged with tools to compile C and
C++ code, so no extra work needs to be done. Users can install the
development version of fast.ssgsea on Linux by running the code below.
The development version of fast.ssgsea can be installed with
# install.packages("pak")
pak::pak("pnnl/fast.ssgsea")or
# install.packages("devtools")
devtools::install_github("pnnl/fast.ssgsea")We will simulate a matrix with 10,000 genes as rows and one column. These simulated values represent signed gene-level statistics that might be produced from a package such as limma.
We will also generate 20,000 gene sets by randomly sampling between 5 and 1,000 genes.
n_genes <- 10000L # number of genes
genes <- paste0("gene", seq_len(n_genes))
# Simulate named vector of gene-level values
set.seed(9001L)
stats <- rnorm(n = n_genes)
names(stats) <- genes
# Simulate list of gene sets
n_sets <- 20000L # number of gene sets
min_size <- 5L # size of smallest gene set
max_size <- 1000L # size of largest gene set
# Set sizes are uniformly distributed between min_size and max_size
size_range <- max_size - min_size + 1L
n_reps <- ceiling(n_sets / size_range)
set_sizes <- rep(max_size:min_size, times = n_reps)[seq_len(n_sets)]
gene_sets <- lapply(seq_len(n_sets), function(i) {
set.seed(i)
sample(x = genes, size = set_sizes[i])
})
names(gene_sets) <- paste0("set", seq_along(gene_sets))This shows the runtime of fast_ssgsea on an AMD Ryzen 5 7600X CPU with
a clock speed of 4.7 GHz. A total of 100,000 permutations were used to
calculate P-values and normalized enrichment scores (NES).
library(fast.ssgsea)
# Runtime (in seconds)
system.time({
res <- fast_ssgsea(
stats = stats,
gene_sets = gene_sets,
alpha = 1,
nperm = 1e5L,
min_size = min_size,
seed = 0L
)
})## user system elapsed
## 1.367 0.088 1.374
str(res)## 'data.frame': 20000 obs. of 8 variables:
## $ set : chr "set18791" "set2830" "set19084" "set18223" ...
## $ set_size : int 138 163 841 706 801 87 503 409 320 450 ...
## $ ES : num -1866 1584 698 759 709 ...
## $ NES : num -5.34 4.78 4.66 4.67 4.62 ...
## $ n_same_sign : int 49235 51107 52907 52784 52813 50461 52351 51847 51728 47859 ...
## $ n_as_extreme: int 1 3 8 9 12 12 16 19 19 19 ...
## $ p_value : num 4.06e-05 7.83e-05 1.70e-04 1.89e-04 2.46e-04 ...
## $ adj_p_value : num 0.783 0.783 0.836 0.836 0.836 ...
print(sessionInfo(), locale = FALSE, tzone = FALSE)## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Linux Mint 22.1
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dqrng_0.4.1 fast.ssgsea_0.1.0.9027
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 collapse_2.1.3 fastmap_1.2.0 xfun_0.53
## [5] knitr_1.50 parallel_4.5.2 htmltools_0.5.8.1 rmarkdown_2.29
## [9] cli_3.6.5 data.table_1.17.8 compiler_4.5.2 rstudioapi_0.17.1
## [13] tools_4.5.2 evaluate_1.0.5 Rcpp_1.1.0 yaml_2.3.10
## [17] rlang_1.1.6
Tests were performed on a desktop computer with an AMD Ryzen 5 7600X CPU
running at 4.7 GHz, single threaded, to measure the runtime of
fast-ssGSEA (fast.ssgsea::fast_ssgsea) and FGSEA-simple
(fgsea::fgseaSimple). Different combinations of the number of gene
sets, maximum gene set size, and the number of permutations were tested
in a random order (3 replicates each) to minimize the influence of
previous runs. The R scripts and data are available in the simulation/
folder.
Like fast-ssGSEA, FGSEA-simple relies purely on the number of permutations to calculate p-values, which limits how small they can become. While FGSEA-simple is meant to be run with a smaller number of permutations and followed up by FGSEA-multilevel (the method capable of calculating arbitrarily small p-values), these results serve to illustrate the extreme difference in runtime between the two approaches. This difference is largely the result of changes to how the ES is defined.
Barbie, David A., Pablo Tamayo, Jesse S. Boehm, So Young Kim, Susan E. Moody, Ian F. Dunn, Anna C. Schinzel, et al. 2009. “Systematic RNA Interference Reveals That Oncogenic KRAS-Driven Cancers Require TBK1.” Nature 462 (7269): 108–12. https://doi.org/10.1038/nature08460.
Korotkevich, Gennady, Vladimir Sukhov, Nikolay Budin, Boris Shpak, Maxim N. Artyomov, and Alexey Sergushichev. 2021. “Fast Gene Set Enrichment Analysis.” bioRxiv. https://doi.org/10.1101/060012.
Krug, Karsten, Philipp Mertins, Bin Zhang, Peter Hornbeck, Rajesh Raju, Rushdy Ahmad, Matthew Szucs, et al. 2019. “A Curated Resource for Phosphosite-Specific Signature Analysis.” Molecular & Cellular Proteomics 18 (3): 576–93. https://doi.org/10.1074/mcp.TIR118.000943.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.

