Skip to content

Commit c1d62bc

Browse files
committed
Clean up documentation
1 parent 3210843 commit c1d62bc

File tree

3 files changed

+80
-48
lines changed

3 files changed

+80
-48
lines changed

README.md

Lines changed: 21 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,29 @@
55
[![Travis Build Status](https://travis-ci.com/igordot/msigdbr.svg?branch=master)](https://travis-ci.com/igordot/msigdbr)
66
[![codecov](https://codecov.io/gh/igordot/msigdbr/branch/master/graph/badge.svg)](https://codecov.io/gh/igordot/msigdbr)
77

8+
## Overview
9+
810
The `msigdbr` R package provides Molecular Signatures Database (MSigDB) gene sets typically used with the Gene Set Enrichment Analysis (GSEA) software:
911

10-
* in an R-friendly tidy format (a data frame in a "long" format with one gene per row)
11-
* for multiple frequently studied model organisms (human, mouse, rat, pig, zebrafish, fly, yeast, etc.)
12-
* as both gene symbols and NCBI/Entrez Gene IDs (for better compatibility with pathway enrichment tools)
13-
* that can be used in a script without requiring additional external files
12+
* in an R-friendly tidy/long format with one gene per row
13+
* for multiple frequently studied model organisms, such as mouse, rat, pig, zebrafish, fly, and yeast, in addition to the original human genes
14+
* as both gene symbols and NCBI/Entrez Gene IDs for better compatibility with pathway enrichment tools
15+
* that can be installed and loaded as a package without requiring additional external files
16+
17+
## Installation
18+
19+
The package can be installed from [CRAN](https://cran.r-project.org/package=msigdbr).
20+
21+
```{r}
22+
install.packages("msigdbr")
23+
```
24+
25+
## Usage
1426

15-
The package is available on [CRAN](https://cran.r-project.org/package=msigdbr).
27+
The package data can be accessed using the `msigdbr()` function, which returns a data frame of gene sets and their member genes. For example, you can retrieve mouse genes from the C2 (curated) CGP (chemical and genetic perturbations) gene sets.
1628

29+
```{r}
30+
genesets = msigdbr(species = "Mus musculus", category = "C2", subcategory = "CGP")
31+
```
1732

33+
Check the [documentation website](https://igordot.github.io/msigdbr) for more information.

_pkgdown.yml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,21 @@
1+
url: https://igordot.github.io/msigdbr
2+
13
template:
24
params:
35
bootswatch: readable
6+
ganalytics: UA-130263433-2
7+
8+
navbar:
9+
structure:
10+
left: []
11+
right: [home, intro, reference, articles, news, github]
12+
components:
13+
home: ~
14+
articles: ~
15+
intro:
16+
text: Get started
17+
href: articles/msigdbr-intro.html
18+
19+
home:
20+
title: MSigDB gene sets R package
21+
strip_header: true

vignettes/msigdbr-intro.Rmd

Lines changed: 41 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
---
2-
title: "Introduction to the msigdbr package"
2+
title: "Introduction to msigdbr"
33
output:
44
rmarkdown::html_vignette:
55
keep_md: true
66
vignette: >
7-
%\VignetteIndexEntry{Introduction to the msigdbr package}
7+
%\VignetteIndexEntry{Introduction to msigdbr}
88
%\VignetteEngine{knitr::rmarkdown}
99
%\VignetteEncoding{UTF-8}
1010
---
@@ -16,21 +16,23 @@ knitr::opts_chunk$set(
1616
)
1717
# increase the screen width
1818
options(width = 90)
19-
# reduce the minimum number of characters for the tibble column titles
20-
options(pillar.min_title_chars = 8)
19+
# reduce the minimum number of characters for the tibble column titles (default: 15)
20+
options(pillar.min_title_chars = 10)
21+
# increase the maximum number of rows printed (default: 20)
22+
options(tibble.print_max = 25)
2123
```
2224

2325
## Overview
2426

25-
Performing pathway analysis is a common task in genomics and there are many available software tools, many of which are R-based.
26-
Depending on the tool, it may be necessary to import the pathways into R, translate genes to the appropriate species, convert between symbols and IDs, and format the object in the required way.
27+
Pathway analysis is a common task in genomics research and there are many available R-based software tools.
28+
Depending on the tool, it may be necessary to import the pathways, translate genes to the appropriate species, convert between symbols and IDs, and format the resulting object.
2729

2830
The `msigdbr` R package provides Molecular Signatures Database (MSigDB) gene sets typically used with the Gene Set Enrichment Analysis (GSEA) software:
2931

30-
* in an R-friendly tidy format (a data frame in a "long" format with one gene per row)
31-
* for multiple frequently studied model organisms (human, mouse, rat, pig, zebrafish, fly, yeast, etc.)
32-
* as both gene symbols and NCBI/Entrez Gene IDs (for better compatibility with pathway enrichment tools)
33-
* that can be used in a script without requiring additional external files
32+
* in an R-friendly tidy/long format with one gene per row
33+
* for multiple frequently studied model organisms, such as mouse, rat, pig, zebrafish, fly, and yeast, in addition to the original human genes
34+
* as both gene symbols and NCBI/Entrez Gene IDs for better compatibility with pathway enrichment tools
35+
* that can be installed and loaded as a package without requiring additional external files
3436

3537
Please be aware that the homologs were computationally predicted for distinct genes.
3638
The full pathways may not be well conserved across species.
@@ -51,40 +53,24 @@ Load package.
5153
library(msigdbr)
5254
```
5355

54-
Retrieve the gene sets data frame. In this example, for the hallmark collection.
55-
56-
```{r get-human-h}
57-
h_gene_sets = msigdbr(species = "Homo sapiens", category = "H")
58-
head(h_gene_sets)
59-
```
60-
61-
There is a helper function to show the available species.
62-
63-
```{r show-species}
64-
msigdbr_show_species()
65-
```
66-
6756
All gene sets in the database can be retrieved without specifying a collection/category.
6857

6958
```{r get-human-all}
70-
all_gene_sets = msigdbr(species = "Homo sapiens")
59+
all_gene_sets = msigdbr(species = "Mus musculus")
7160
head(all_gene_sets)
7261
```
7362

74-
The `msigdbr()` function output is a data frame and can be manipulated using more standard methods.
63+
There is a helper function to show the available species.
7564

76-
```{r get-human-h-filter}
77-
all_gene_sets %>%
78-
dplyr::filter(gs_cat == "H") %>%
79-
head()
65+
```{r species}
66+
msigdbr_species()
8067
```
8168

82-
Check the available collections and sub-collections.
69+
You can retrieve data for a specific collection, such as the hallmark gene sets.
8370

84-
```{r show-collections}
85-
all_gene_sets %>%
86-
dplyr::distinct(gs_cat, gs_subcat) %>%
87-
dplyr::arrange(gs_cat, gs_subcat)
71+
```{r get-human-h}
72+
h_gene_sets = msigdbr(species = "Mus musculus", category = "H")
73+
head(h_gene_sets)
8874
```
8975

9076
Retrieve mouse C2 (curated) CGP (chemical and genetic perturbations) gene sets.
@@ -94,18 +80,32 @@ cgp_gene_sets = msigdbr(species = "Mus musculus", category = "C2", subcategory =
9480
head(cgp_gene_sets)
9581
```
9682

83+
There is a helper function to show the available collections.
84+
85+
```{r collections}
86+
msigdbr_collections()
87+
```
88+
89+
The `msigdbr()` function output is a data frame and can be manipulated using more standard methods.
90+
91+
```{r get-human-h-filter}
92+
all_gene_sets %>%
93+
dplyr::filter(gs_cat == "H") %>%
94+
head()
95+
```
96+
9797
## Pathway enrichment analysis
9898

99-
The `msigdbr` gene sets data frame can be used with many pathway analysis packages.
99+
The `msigdbr` output can be used with various popular pathway analysis packages.
100100

101-
Use the gene sets data frame for `clusterProfiler` (for genes as Entrez Gene IDs).
101+
Use the gene sets data frame for `clusterProfiler` with genes as Entrez Gene IDs.
102102

103103
```{r cp-entrez, eval=FALSE}
104104
msigdbr_t2g = msigdbr_df %>% dplyr::select(gs_name, entrez_gene) %>% as.data.frame()
105105
enricher(gene = gene_ids_vector, TERM2GENE = msigdbr_t2g, ...)
106106
```
107107

108-
Use the gene sets data frame for `clusterProfiler` (for genes as gene symbols).
108+
Use the gene sets data frame for `clusterProfiler` with genes as gene symbols.
109109

110110
```{r cp-symbols, eval=FALSE}
111111
msigdbr_t2g = msigdbr_df %>% dplyr::select(gs_name, gene_symbol) %>% as.data.frame()
@@ -138,8 +138,8 @@ You can check the installed version with `packageVersion("msigdbr")`.
138138

139139
Yes.
140140
You can then import the GMT files (with `getGmt()` from the `GSEABase` package, for example).
141-
The GMTs only include the human genes, even for gene sets generated from mouse data.
142-
If you are not working with human data, you then have to convert the MSigDB genes to your organism or your genes to human.
141+
The GMTs only include the human genes, even for gene sets generated from mouse experiments.
142+
If you are not working with non-human data, you then have to convert the MSigDB genes to your organism or your genes to human.
143143

144144
**Can I convert between human and mouse genes just by adjusting gene capitalization?**
145145

@@ -156,14 +156,12 @@ You may still end up with dozens of homologs for some genes, so additional clean
156156
There are a few other resources that and provide some of the functionality and served as an inspiration for this package.
157157
[Ge Lab Gene Set Files](http://ge-lab.org/#/data) has GMT files for many species.
158158
[WEHI](http://bioinf.wehi.edu.au/software/MSigDB/) provides MSigDB gene sets in R format for human and mouse, but the genes are provided only as Entrez IDs and each collection is a separate file.
159-
[MSigDF](https://github.com/stephenturner/msigdf) is based on the WEHI resource, so it provides the same data, but converted to a more tidyverse-friendly data frame.
160-
When `msigdbr` was initially released, all of them were multiple releases behind the latest version of MSigDB, so they are possibly no longer maintained.
159+
[MSigDF](https://github.com/stephenturner/msigdf) is based on the WEHI resource, but is converted to a more tidyverse-friendly data frame.
160+
When `msigdbr` was initially released, these were multiple releases behind the latest version of MSigDB, so they may not be actively maintained.
161161

162162
## Details
163163

164164
The Molecular Signatures Database (MSigDB) is a collection of gene sets originally created for use with the Gene Set Enrichment Analysis (GSEA) software.
165165

166166
Gene homologs are provided by HUGO Gene Nomenclature Committee at the European Bioinformatics Institute which integrates the orthology assertions predicted for human genes by eggNOG, Ensembl Compara, HGNC, HomoloGene, Inparanoid, NCBI Gene Orthology, OMA, OrthoDB, OrthoMCL, Panther, PhylomeDB, TreeFam and ZFIN.
167167
For each human equivalent within each species, only the ortholog supported by the largest number of databases is used.
168-
169-

0 commit comments

Comments
 (0)