Skip to content

Commit ecb9c6f

Browse files
Remove old data, figures
Remove old data and figures, simplify README text, and add timing results for Fast Gene Set Enrichment Analysis (FGSEA).
1 parent daf96de commit ecb9c6f

24 files changed

+196
-1287
lines changed

README.Rmd

Lines changed: 11 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,13 @@ knitr::opts_chunk$set(
1919

2020
<!-- badges: start -->
2121
[![R-CMD-check](https://github.com/pnnl/fast.ssgsea/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/pnnl/fast.ssgsea/actions/workflows/R-CMD-check.yaml)
22-
[![DOI](https://zenodo.org/badge/394311897.svg)](https://doi.org/10.5281/zenodo.16783102)
2322
<!-- badges: end -->
2423

25-
`fast.ssgsea` is an R package [@R-core-team] for fast Single-Sample Gene Set Enrichment Analysis (ssGSEA) and Post-Translational Modification Signature Enrichment Analysis (PTM-SEA) [@barbie-systematic-2009; @krug-curated-2019].
24+
`fast.ssgsea` is an R package [@R-core-team] for fast gene permutation Gene Set Enrichment Analysis (GSEA) and Post-Translational Modification Signature Enrichment Analysis (PTM-SEA) [@subramanian-gene-2005; @krug-curated-2019].
2625

27-
The primary function, `fast_ssgsea`, accepts a numeric matrix with genes or other molecules as rows and either samples, contrasts, or some other meaningful representation of the data as columns. A named list of gene sets (more generally, molecular signatures) is also required. Other arguments control the behavior of ssGSEA/PTM-SEA, and they are described in the function documentation.
26+
**NOTE:** Support for directional databases, such as PTMsigDB, is broken starting with version 0.1.0.9018. Until this is fixed, PTM-SEA is not supported.
27+
28+
The primary function, `fast_ssgsea`, accepts a numeric matrix with genes or other molecules as rows and either samples, contrasts, or some other meaningful representation of the data as columns. A named list of gene sets (more generally, molecular signatures) is also required. Other arguments control the behavior of GSEA/PTM-SEA, and they are described in the function documentation.
2829

2930
The package also contains a `read_gmt` function, which reads a Gene Matrix Transposed (GMT) file to construct a named list of gene sets for use with `fast_ssgsea`.
3031

@@ -56,11 +57,11 @@ pak::pak("pnnl/fast.ssgsea")
5657

5758
### Simulate Data
5859

59-
We will simulate a matrix with 10,000 genes as rows and 100 samples as columns. Then, we generate 20,000 gene sets by randomly sampling between 10 and 500 genes from the matrix row names.
60+
We will simulate a matrix with 10,000 genes as rows and one column. Then, we generate 20,000 gene sets by randomly sampling between 5 and 1,000 genes.
6061

6162
```{r simulate-data}
6263
n_genes <- 10000L # number of genes
63-
n_samples <- 100L # number of samples
64+
n_samples <- 1L # number of samples (>= 1)
6465
genes <- paste0("gene", seq_len(n_genes))
6566
samples <- paste0("sample", seq_len(n_samples))
6667
@@ -91,7 +92,7 @@ names(gene_sets) <- paste0("set", seq_along(gene_sets))
9192

9293
### Runtime and Results
9394

94-
This shows the runtime of `fast_ssgsea` running on an AMD Ryzen 5 7600X CPU with a clock speed of 4.7 GHz.
95+
This shows the runtime of `fast_ssgsea` running on an AMD Ryzen 5 7600X CPU with a clock speed of 4.7 GHz. A total of 10,000 permutations were used to calculate p-values and normalized enrichment scores (NES).
9596

9697
```{r time-results}
9798
library(fast.ssgsea)
@@ -102,11 +103,8 @@ system.time({
102103
X = X,
103104
gene_sets = gene_sets,
104105
alpha = 1,
105-
nperm = 1000L,
106-
batch_size = 1000L,
107-
adjust_globally = FALSE,
106+
nperm = 10000L, # default is 1000
108107
min_size = min_size,
109-
sort = TRUE,
110108
seed = 0L
111109
)
112110
})
@@ -123,15 +121,14 @@ print(sessionInfo(), locale = FALSE, tzone = FALSE)
123121

124122
## Performance
125123

126-
The `fast.ssgsea` R package utilizes linear algebra and ideas from Fast Gene Set Enrichment Analysis [@korotkevich-fast-2021] to greatly reduce the runtime of gene permutation GSEA and PTM-SEA.
124+
The `fast.ssgsea` R package utilizes linear algebra and ideas from Fast Gene Set Enrichment Analysis [@korotkevich-fast-2021] to greatly reduce the runtime.
127125

128-
Tests were performed on a desktop computer with an AMD Ryzen 5 7600X CPU (6 cores, 12 threads) at 4.7 GHz. Different combinations of the number of samples, gene sets, maximum gene set size, number of permutations, and value of the $\alpha$ parameter (the weighting exponent) were tested in a random order (3 replicates each) to minimize the influence of previous runs.
126+
Tests were performed on a desktop computer with an AMD Ryzen 5 7600X CPU (6 cores, 12 threads) at 4.7 GHz. Different combinations of the number of gene sets, maximum gene set size, number of permutations, and value of the $\alpha$ parameter (the weighting exponent) were tested in a random order (3 replicates each) to minimize the influence of previous runs.
129127

130128
```{r, echo=FALSE}
131-
fig_cap <- "Runtime of fast_ssgsea with A) 1,000 or B) 10,000 permutations."
129+
fig_cap <- "Runtime of fast_ssgsea with A) 10,000, B) 100,000, or C) 1,000,000 permutations."
132130
```
133131

134-
135132
```{r, echo=FALSE, fig.cap=fig_cap}
136133
knitr::include_graphics("./man/figures/README-figure-1.png")
137134
```

README.md

Lines changed: 45 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -16,20 +16,23 @@
1616
<!-- badges: start -->
1717

1818
[![R-CMD-check](https://github.com/pnnl/fast.ssgsea/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/pnnl/fast.ssgsea/actions/workflows/R-CMD-check.yaml)
19-
[![DOI](https://zenodo.org/badge/394311897.svg)](https://doi.org/10.5281/zenodo.16783102)
2019
<!-- badges: end -->
2120

2221
`fast.ssgsea` is an R package ([R Core Team 2024](#ref-R-core-team)) for
23-
fast Single-Sample Gene Set Enrichment Analysis (ssGSEA) and
22+
fast gene permutation Gene Set Enrichment Analysis (GSEA) and
2423
Post-Translational Modification Signature Enrichment Analysis (PTM-SEA)
25-
([Barbie et al. 2009](#ref-barbie-systematic-2009); [Krug et al.
24+
([Subramanian et al. 2005](#ref-subramanian-gene-2005); [Krug et al.
2625
2019](#ref-krug-curated-2019)).
2726

27+
**NOTE:** Support for directional databases, such as PTMsigDB, is broken
28+
starting with version 0.1.0.9018. Until this is fixed, PTM-SEA is not
29+
supported.
30+
2831
The primary function, `fast_ssgsea`, accepts a numeric matrix with genes
2932
or other molecules as rows and either samples, contrasts, or some other
3033
meaningful representation of the data as columns. A named list of gene
3134
sets (more generally, molecular signatures) is also required. Other
32-
arguments control the behavior of ssGSEA/PTM-SEA, and they are described
35+
arguments control the behavior of GSEA/PTM-SEA, and they are described
3336
in the function documentation.
3437

3538
The package also contains a `read_gmt` function, which reads a Gene
@@ -72,13 +75,13 @@ pak::pak("pnnl/fast.ssgsea")
7275

7376
### Simulate Data
7477

75-
We will simulate a matrix with 10,000 genes as rows and 100 samples as
76-
columns. Then, we generate 20,000 gene sets by randomly sampling between
77-
10 and 500 genes from the matrix row names.
78+
We will simulate a matrix with 10,000 genes as rows and one column.
79+
Then, we generate 20,000 gene sets by randomly sampling between 5 and
80+
1,000 genes.
7881

7982
``` r
8083
n_genes <- 10000L # number of genes
81-
n_samples <- 100L # number of samples
84+
n_samples <- 1L # number of samples (>= 1)
8285
genes <- paste0("gene", seq_len(n_genes))
8386
samples <- paste0("sample", seq_len(n_samples))
8487

@@ -110,7 +113,8 @@ names(gene_sets) <- paste0("set", seq_along(gene_sets))
110113
### Runtime and Results
111114

112115
This shows the runtime of `fast_ssgsea` running on an AMD Ryzen 5 7600X
113-
CPU with a clock speed of 4.7 GHz.
116+
CPU with a clock speed of 4.7 GHz. A total of 10,000 permutations were
117+
used to calculate p-values and normalized enrichment scores (NES).
114118

115119
``` r
116120
library(fast.ssgsea)
@@ -121,33 +125,30 @@ system.time({
121125
X = X,
122126
gene_sets = gene_sets,
123127
alpha = 1,
124-
nperm = 1000L,
125-
batch_size = 1000L,
126-
adjust_globally = FALSE,
128+
nperm = 10000L, # default is 1000
127129
min_size = min_size,
128-
sort = TRUE,
129130
seed = 0L
130131
)
131132
})
132133
```
133134

134135
## user system elapsed
135-
## 15.572 1.352 9.120
136+
## 2.655 0.820 3.001
136137

137138
``` r
138139
str(res)
139140
```
140141

141-
## 'data.frame': 2000000 obs. of 9 variables:
142-
## $ sample : Factor w/ 100 levels "sample1","sample2",..: 1 1 1 1 1 1 1 1 1 1 ...
143-
## $ set : chr "set4576" "set12526" "set11427" "set9645" ...
144-
## $ set_size : int 409 427 530 320 320 977 519 517 511 841 ...
145-
## $ ES : num 929 861 693 1043 898 ...
146-
## $ NES : num 4.4 4.13 3.72 4.22 3.64 ...
147-
## $ n_same_sign : int 544 539 536 534 534 525 521 521 521 520 ...
148-
## $ n_as_extreme: int 0 0 0 0 0 0 0 0 0 0 ...
149-
## $ p_value : num 0.00183 0.00185 0.00186 0.00187 0.00187 ...
150-
## $ adj_p_value : num 0.838 0.838 0.838 0.838 0.838 ...
142+
## 'data.frame': 20000 obs. of 9 variables:
143+
## $ sample : Factor w/ 1 level "sample1": 1 1 1 1 1 1 1 1 1 1 ...
144+
## $ set : chr "set5945" "set18791" "set19084" "set16136" ...
145+
## $ set_size : int 36 138 841 801 45 749 761 450 706 163 ...
146+
## $ ES : num 2688 -1866 698 709 2333 ...
147+
## $ NES : num 3.9 -5.33 4.65 4.61 3.8 ...
148+
## $ n_same_sign : int 5049 4962 5226 5210 5058 4799 4784 4771 5200 5080 ...
149+
## $ n_as_extreme: int 0 0 1 1 1 1 1 1 2 2 ...
150+
## $ p_value : num 0.000198 0.000201 0.000383 0.000384 0.000395 ...
151+
## $ adj_p_value : num 0.937 0.937 0.937 0.937 0.937 ...
151152

152153
### Session Information
153154

@@ -167,7 +168,7 @@ print(sessionInfo(), locale = FALSE, tzone = FALSE)
167168
## [1] stats graphics grDevices utils datasets methods base
168169
##
169170
## other attached packages:
170-
## [1] fast.ssgsea_0.1.0.9017
171+
## [1] fast.ssgsea_0.1.0.9018
171172
##
172173
## loaded via a namespace (and not attached):
173174
## [1] dqrng_0.4.1 digest_0.6.37 RcppArmadillo_15.0.2-2
@@ -182,22 +183,22 @@ print(sessionInfo(), locale = FALSE, tzone = FALSE)
182183

183184
The `fast.ssgsea` R package utilizes linear algebra and ideas from Fast
184185
Gene Set Enrichment Analysis ([Korotkevich et al.
185-
2021](#ref-korotkevich-fast-2021)) to greatly reduce the runtime of gene
186-
permutation GSEA and PTM-SEA.
186+
2021](#ref-korotkevich-fast-2021)) to greatly reduce the runtime.
187187

188188
Tests were performed on a desktop computer with an AMD Ryzen 5 7600X CPU
189189
(6 cores, 12 threads) at 4.7 GHz. Different combinations of the number
190-
of samples, gene sets, maximum gene set size, number of permutations,
191-
and value of the $\alpha$ parameter (the weighting exponent) were tested
192-
in a random order (3 replicates each) to minimize the influence of
193-
previous runs.
190+
of gene sets, maximum gene set size, number of permutations, and value
191+
of the $\alpha$ parameter (the weighting exponent) were tested in a
192+
random order (3 replicates each) to minimize the influence of previous
193+
runs.
194194

195195
<div class="figure" style="text-align: center">
196196

197-
<img src="./man/figures/README-figure-1.png" alt="Runtime of fast_ssgsea with A) 1,000 or B) 10,000 permutations." width="749" />
197+
<img src="./man/figures/README-figure-1.png" alt="Runtime of fast_ssgsea with A) 10,000, B) 100,000, or C) 1,000,000 permutations." width="648" />
198198
<p class="caption">
199199

200-
Runtime of fast_ssgsea with A) 1,000 or B) 10,000 permutations.
200+
Runtime of fast_ssgsea with A) 10,000, B) 100,000, or C) 1,000,000
201+
permutations.
201202
</p>
202203

203204
</div>
@@ -207,15 +208,6 @@ Runtime of fast_ssgsea with A) 1,000 or B) 10,000 permutations.
207208
<div id="refs" class="references csl-bib-body hanging-indent"
208209
entry-spacing="0">
209210

210-
<div id="ref-barbie-systematic-2009" class="csl-entry">
211-
212-
Barbie, David A., Pablo Tamayo, Jesse S. Boehm, So Young Kim, Susan E.
213-
Moody, Ian F. Dunn, Anna C. Schinzel, et al. 2009. “Systematic RNA
214-
Interference Reveals That Oncogenic KRAS-Driven Cancers Require TBK1.”
215-
*Nature* 462 (7269): 108–12. <https://doi.org/10.1038/nature08460>.
216-
217-
</div>
218-
219211
<div id="ref-korotkevich-fast-2021" class="csl-entry">
220212

221213
Korotkevich, Gennady, Vladimir Sukhov, Nikolay Budin, Boris Shpak, Maxim
@@ -241,4 +233,15 @@ Computing*. Vienna, Austria: R Foundation for Statistical Computing.
241233

242234
</div>
243235

236+
<div id="ref-subramanian-gene-2005" class="csl-entry">
237+
238+
Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee,
239+
Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005.
240+
“Gene Set Enrichment Analysis: A Knowledge-Based Approach for
241+
Interpreting Genome-Wide Expression Profiles.” *Proceedings of the
242+
National Academy of Sciences* 102 (43): 15545–50.
243+
<https://doi.org/10.1073/pnas.0506580102>.
244+
245+
</div>
246+
244247
</div>

man/figures/README-figure-1.png

26.7 KB
Loading

references.bib

Lines changed: 13 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
1-
@article{barbie-systematic-2009,
2-
title = {Systematic {RNA} interference reveals that oncogenic {KRAS}-driven cancers require {TBK1}},
3-
volume = {462},
4-
copyright = {http://www.springer.com/tdm},
5-
issn = {0028-0836, 1476-4687},
6-
url = {https://www.nature.com/articles/nature08460},
7-
doi = {10.1038/nature08460},
1+
@article{subramanian-gene-2005,
2+
title = {Gene set enrichment analysis: {A} knowledge-based approach for interpreting genome-wide expression profiles},
3+
volume = {102},
4+
issn = {0027-8424, 1091-6490},
5+
shorttitle = {Gene set enrichment analysis},
6+
url = {https://pnas.org/doi/full/10.1073/pnas.0506580102},
7+
doi = {10.1073/pnas.0506580102},
88
language = {en},
9-
number = {7269},
9+
number = {43},
1010
urldate = {2025-01-17},
11-
journal = {Nature},
12-
author = {Barbie, David A. and Tamayo, Pablo and Boehm, Jesse S. and Kim, So Young and Moody, Susan E. and Dunn, Ian F. and Schinzel, Anna C. and Sandy, Peter and Meylan, Etienne and Scholl, Claudia and Fröhling, Stefan and Chan, Edmond M. and Sos, Martin L. and Michel, Kathrin and Mermel, Craig and Silver, Serena J. and Weir, Barbara A. and Reiling, Jan H. and Sheng, Qing and Gupta, Piyush B. and Wadlow, Raymond C. and Le, Hanh and Hoersch, Sebastian and Wittner, Ben S. and Ramaswamy, Sridhar and Livingston, David M. and Sabatini, David M. and Meyerson, Matthew and Thomas, Roman K. and Lander, Eric S. and Mesirov, Jill P. and Root, David E. and Gilliland, D. Gary and Jacks, Tyler and Hahn, William C.},
13-
month = nov,
14-
year = {2009},
15-
pages = {108--112},
11+
journal = {Proceedings of the National Academy of Sciences},
12+
author = {Subramanian, Aravind and Tamayo, Pablo and Mootha, Vamsi K. and Mukherjee, Sayan and Ebert, Benjamin L. and Gillette, Michael A. and Paulovich, Amanda and Pomeroy, Scott L. and Golub, Todd R. and Lander, Eric S. and Mesirov, Jill P.},
13+
month = oct,
14+
year = {2005},
15+
pages = {15545--15550},
1616
}
1717

1818
@article{krug-curated-2019,
@@ -52,48 +52,3 @@ @Manual{R-core-team
5252
year = {2024},
5353
url = {https://www.R-project.org/},
5454
}
55-
56-
@inproceedings{openblas-1,
57-
author={Xianyi, Zhang and Qian, Wang and Yunquan, Zhang},
58-
booktitle={2012 IEEE 18th International Conference on Parallel and Distributed Systems},
59-
title={Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor},
60-
year={2012},
61-
volume={},
62-
number={},
63-
pages={684-691},
64-
doi={10.1109/ICPADS.2012.97},
65-
}
66-
67-
@inproceedings{openblas-2,
68-
author = {Wang, Qian and Zhang, Xianyi and Zhang, Yunquan and Yi, Qing},
69-
title = {AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs},
70-
year = {2013},
71-
isbn = {9781450323789},
72-
publisher = {Association for Computing Machinery},
73-
address = {New York, NY, USA},
74-
url = {https://doi.org/10.1145/2503210.2503219},
75-
doi = {10.1145/2503210.2503219},
76-
booktitle = {Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis},
77-
articleno = {25},
78-
numpages = {12},
79-
location = {Denver, Colorado},
80-
series = {SC '13},
81-
}
82-
83-
@article{blas,
84-
author = {Lawson, C. L. and Hanson, R. J. and Kincaid, D. R. and Krogh, F. T.},
85-
title = {Basic Linear Algebra Subprograms for {Fortran} Usage},
86-
year = {1979},
87-
issue_date = {Sept. 1979},
88-
publisher = {Association for Computing Machinery},
89-
address = {New York, NY, USA},
90-
volume = {5},
91-
number = {3},
92-
issn = {0098-3500},
93-
url = {https://doi.org/10.1145/355841.355847},
94-
doi = {10.1145/355841.355847},
95-
journal = {ACM Trans. Math. Softw.},
96-
month = sep,
97-
pages = {308–323},
98-
numpages = {16}
99-
}
1.23 KB
Binary file not shown.
-683 Bytes
Binary file not shown.
1.2 KB
Binary file not shown.
-1.51 KB
Binary file not shown.
-1.09 KB
Binary file not shown.
-1.47 KB
Binary file not shown.

0 commit comments

Comments
 (0)