Transforms RNA-sequencing data into actionable clinical insights with automated reports.
Documentation | umccr.github.io/RNAsum
RNAsum is an R package that integrates whole-genome sequencing (WGS)
and whole-transcriptome sequencing (WTS) data to generate comprehensive,
interactive HTML reports for cancer patient samples.
RNAsum can be installed using one of the following two methods.
RNAsum depends on pdftools, which requires system-level libraries
(poppler, cairo, etc.) to be installed before installing the R package.
System dependencies installation
Ubuntu/Debian:
sudo apt-get install libpoppler-cpp-dev libharfbuzz-dev libfribidi-dev \
libfreetype6-dev libcairo2-dev libpango1.0-devmacOS:
brew install popplerHPC/Cluster (without root):
If you do not have root access (e.g., on a cluster), creating a fresh Conda environment is the most reliable way to provide necessary system libraries:
conda create -n rnasum_env -c conda-forge -c bioconda \
r-base=4.1 poppler harfbuzz fribidi freetype pkg-config \
cairo openssl pango make gxx_linux-64
conda activate rnasum_envOnce system dependencies are met, you can install the package directly from GitHub from within R console.
# 1. Increase timeout to prevent download failure for RNAsum.data
options(timeout = 600)
# 2. Install via remotes
if (!require("remotes")) install.packages("remotes")
remotes::install_github("umccr/RNAsum")Conda package is available from the Anaconda umccr channel:
conda create -n rnasum -c umccr -c conda-forge -c bioconda r-rnasum
conda activate rnasumThe pipeline consists of five main components.
- WTS data collection: ingests per-gene read counts and gene fusions.
- Reference integration: normalises against reference cohorts.
- WGS data integration: links genomic alterations with expression data.
- Knowledge enrichment: annotates with clinically relevant databases.
- Report generation: prioritises findings and creates interactive visualizations.
Detailed workflow documentation
Add RNAsum to PATH environment variable.
rnasum_cli=$(Rscript -e 'cat(system.file("cli", package="RNAsum"))')
ln -sf "$rnasum_cli/rnasum.R" "$rnasum_cli/rnasum"
export PATH="$rnasum_cli:$PATH"rnasum --versionWhen comparing clinical RNA-seq samples to TCGA reference cohorts, batch effects can significantly impact expression rankings due to protocol differences (e.g., ribo-depletion vs poly-A selection, different library preparation methods).
RNAsum provides the --batch_rm parameter to address these differences:
When to use --batch_rm:
- Clinical samples use different RNA-seq protocols than TCGA
- An internal reference cohort was processed with the same protocol
- Expression rankings appear systematically skewed
Alternative approaches:
- Use tissue-matched internal reference cohorts when available
- Consider protocol-specific thresholds for expression classification
- Validate findings with orthogonal methods when possible
| Option | Description | Default |
|---|---|---|
--sample_name |
Sample identifier | Required |
--dataset |
TCGA reference cohort | PANCAN |
--batch_rm |
Remove batch effects | FALSE |
--salmon |
Salmon quantification file | - |
--kallisto |
Kallisto abundance file | - |
--arriba_tsv |
Arriba fusion detection output | - |
--pcgr_tiers_tsv |
PCGR variant calls (tier 1-4) | - |
--cn_gene_tsv |
Copy number by gene | - |
--filter |
Filter low-expressed genes | TRUE |
Run rnasum --help to get complete list of options.
For format and minimal content of input files (e.g. --pcgr_tiers_tsv,
--cn_gene_tsv, --sv_tsv), see Input file
formats.
Note: human reference genome GRCh38 (Ensembl based annotation version 105) is used for gene annotation by default. GRCh37 is no longer supported.
Test data: in /inst/rawdata/test_data folder of the GitHub repo
Runtime: < 15 minutes (16GB RAM, 1 CPU)
Comprehensive reporting, in which WGS-based findings are used as a primary source for expression profile prioritisation.
cd $rnasum_cli
rnasum \
--sample_name test_sample_WTS \
--dataset TEST \
--salmon "$PWD/../rawdata/test_data/dragen/TEST.quant.genes.sf" \
--arriba_pdf "$PWD/../rawdata/test_data/dragen/arriba/fusions.pdf" \
--arriba_tsv "$PWD/../rawdata/test_data/dragen/arriba/fusions.tsv" \
--dragen_fusions "$PWD/../rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final" \
--pcgr_tiers_tsv "$PWD/../rawdata/test_data/small_variants/TEST-snvs_indels.tiers.tsv" \
--cn_gene_tsv "$PWD/../rawdata/test_data/copy_number/TEST.cnv.gene.tsv" \
--sv_tsv "$PWD/../rawdata/test_data/structural/TEST-sv.tsv" \
--report_dir "$PWD/../rawdata/test_data/RNAsum" \
--save_tables FALSE \
--filter TRUEThe HTML report test_sample_WTS.RNAsum.html will be created in the
inst/rawdata/test_data/dragen/RNAsum folder.
Basic reporting including information about detected gene fusions and expression levels of key genes.
cd $rnasum_cli
rnasum \
--sample_name test_sample_WTS \
--dataset TEST \
--salmon "$PWD/../rawdata/test_data/dragen/TEST.quant.genes.sf" \
--arriba_pdf "$PWD/../rawdata/test_data/dragen/arriba/fusions.pdf" \
--arriba_tsv "$PWD/../rawdata/test_data/dragen/arriba/fusions.tsv" \
--report_dir "$PWD/../rawdata/test_data/RNAsum" \
--save_tables FALSE \
--filter TRUEThe HTML report test_sample_WTS.RNAsum.html will be created in the
inst/rawdata/test_data/dragen/RNAsum folder.
Assess potential batch effects between the clinical sample and TCGA reference data before running the main analysis. This is particularly important when using different RNA-seq protocols.
The batch assessment functions (assess_batch_effects(),
quick_batch_check()) are exported by RNAsum starting from the current
release, so once the package is installed/upgraded user can call them
directly after library(RNAsum) — no source() is needed.
If user cannot upgrade and are working from a local clone of the source tree, the user can instead source the file directly from the repository:
# Only needed when working from a local source clone and not from an installed RNAsum
# source("R/batch_assessment.R")library(RNAsum)
# Load test sample data (Salmon gene quantification)
test_file <- system.file("rawdata/test_data/dragen/TEST.quant.genes.sf", package = "RNAsum")
sample_data <- read.delim(test_file)
sample_tpm <- setNames(sample_data$TPM, sample_data$Name)
# Get reference data
ref_paths <- get_refdata(dataset = "TEST", batch_rm = FALSE)
ref_counts <- utils::read.table(gzfile(ref_paths$ext_ref$counts),
header = TRUE, sep = "\t", row.names = NULL)
ref_matrix <- as.matrix(ref_counts[, -1])
rownames(ref_matrix) <- ref_counts[[1]]
# Quick assessment using most variable genes (default)
batch_results <- assess_batch_effects(
sample_data = sample_tpm,
reference_data = ref_matrix,
gene_set_type = "top_n",
n_genes = 2000,
output_dir = "batch_assessment_results"
)Assessment interpretation:
- PCA distance percentile > 95%: High batch effect risk - consider
--batch_rm - Median correlation < 0.7: Substantial protocol differences detected
- Extreme Z-scores > 10%: Expression ranking issues likely
Detailed gene set options and examples
RNAsum provides three gene set options for targeted batch assessment.
Option 1: Top variable genes (default)
# Standard assessment using most variable genes (recommended for general use)
batch_results <- assess_batch_effects(
sample_data = sample_tpm,
reference_data = ref_matrix,
gene_set_type = "top_n", # Uses most variable genes
n_genes = 2000, # Number of genes to analyze
protocol_clinical = "ribo-depletion",
protocol_reference = "TCGA_poly-A",
output_dir = "batch_assessment_topn"
)Option 2: Cancer gene set
# Combined cancer genes database (1315 unique genes)
batch_cancer_combined <- assess_batch_effects(
sample_data = sample_tpm,
reference_data = ref_matrix,
gene_set_type = "cancer_genes",
cancer_gene_source = "combined", # Combined UMCCR + OncoKB databases
protocol_clinical = "ribo-depletion",
protocol_reference = "TCGA_poly-A",
output_dir = "batch_assessment_cancer_combined"
)
# Quick assessment with cancer genes
quick_batch_check(
sample_data = sample_tpm,
reference_data = ref_matrix,
gene_set_type = "cancer_genes",
cancer_gene_source = "combined",
save_plots = TRUE
)Option 3: Custom gene set
# User-defined gene sets (genes must match sample data format)
# Note: Ensure gene IDs match data format (Ensembl IDs vs gene symbols)
# Method 1: Direct Ensembl ID specification
custom_genes_direct <- c("ENSG00000141510", "ENSG00000012048", "ENSG00000139618") # TP53, BRCA1, BRCA2
batch_custom_direct <- assess_batch_effects(
sample_data = sample_tpm,
reference_data = ref_matrix,
gene_set_type = "custom",
gene_subset = custom_genes_direct,
protocol_clinical = "ribo-depletion",
protocol_reference = "TCGA_poly-A",
output_dir = "batch_assessment_custom_direct"
)
# Method 2: Use cancer genes with automatic conversion (recommended)
batch_cancer_symbols <- assess_batch_effects(
sample_data = sample_tpm,
reference_data = ref_matrix,
gene_set_type = "cancer_genes", # Enhanced: handles symbols automatically
cancer_gene_source = "combined", # 1315 genes with auto-conversion
protocol_clinical = "ribo-depletion",
protocol_reference = "TCGA_poly-A",
output_dir = "batch_assessment_cancer_auto"
)For cancer gene set analysis, use the enhanced built-in functionality with automatic format conversion:
- No external files needed - Cancer gene databases and format conversion are built-in
- Automatic gene ID conversion - Handles Ensembl IDs ↔ gene symbols transparently
- Three cancer databases available - UMCCR (1248), OncoKB (1019), Combined (1315 genes)
- Smart format detection - Automatically detects and converts incompatible formats
Complete workflow example
Running all three options with TEST data
# Streamlined batch assessment workflow using enhanced functionality
library(RNAsum)
# Load test sample data
test_file <- system.file("rawdata/test_data/dragen/TEST.quant.genes.sf", package = "RNAsum")
sample_data <- read.delim(test_file)
sample_tpm <- setNames(sample_data$TPM, sample_data$Name)
# Get reference data
ref_paths <- get_refdata(dataset = "TEST", batch_rm = FALSE)
ref_counts <- utils::read.table(gzfile(ref_paths$ext_ref$counts),
header = TRUE, sep = "\t", row.names = NULL)
ref_matrix <- as.matrix(ref_counts[, -1])
rownames(ref_matrix) <- ref_counts[[1]]
# OPTION 1: Top variable genes
batch_topn <- assess_batch_effects(
sample_data = sample_tpm,
reference_data = ref_matrix,
gene_set_type = "top_n",
n_genes = 2000,
protocol_clinical = "ribo-depletion",
protocol_reference = "TCGA_poly-A"
)
# OPTION 2: Cancer genes
batch_cancer <- assess_batch_effects(
sample_data = sample_tpm,
reference_data = ref_matrix,
gene_set_type = "cancer_genes",
cancer_gene_source = "combined",
protocol_clinical = "ribo-depletion",
protocol_reference = "TCGA_poly-A"
)
# OPTION 3: Custom genes
custom_ensembl <- c("ENSG00000141510", "ENSG00000012048", "ENSG00000139618") # TP53, BRCA1, BRCA2
batch_custom <- assess_batch_effects(
sample_data = sample_tpm,
reference_data = ref_matrix,
gene_set_type = "custom",
gene_subset = custom_ensembl,
protocol_clinical = "ribo-depletion",
protocol_reference = "TCGA_poly-A"
)
# Compare results
cat("=== Enhanced Batch Assessment Results ===\n")
cat("Top genes analysis: ", round(batch_topn$pca_results$distance_percentile * 100, 1), "% PCA distance percentile\n")
cat("Cancer genes analysis: ", round(batch_cancer$pca_results$distance_percentile * 100, 1), "% PCA distance percentile\n")
cat("Custom genes analysis: ", round(batch_custom$pca_results$distance_percentile * 100, 1), "% PCA distance percentile\n")RNAsum generates an interactive HTML report with the following core
sections:
- Findings summary: summary of genes listed across various report sections
- Mutated genes: expression of genes with somatic mutations (requires WGS)
- Fusion genes: detected gene fusions with functional annotations
- Structural variants: expression of genes located within structural variants (requires WGS)
- CN altered genes: expression in CN-gained/lost regions (requires WGS)
- Cancer genes: expression of cancer-associated genes
RNAsum includes 33 TCGA cancer type cohorts for comparative analysis:
| Cancer Type | Dataset Code | Samples |
|---|---|---|
| Pan-Cancer | PANCAN |
330 |
| Breast Invasive Carcinoma | BRCA |
300 |
| Lung Adenocarcinoma | LUAD |
300 |
| Pancreatic Adenocarcinoma | PAAD |
150 |
See the complete TCGA projects summary table.
| Resource | Link |
|---|---|
| Full documentation | umccr.github.io/RNAsum |
| Workflow details | Workflow details |
| Batch assessment guide | Batch assessment |
| Report structure | Report structure |
| TCGA datasets | TCGA projects summary |
We welcome contributions! Please see our Code of Conduct and contribution guidelines.
Found a bug or have a feature request? Open an issue.
If you use RNAsum please cite:
Kanwal S, Marzec J, Diakumis P, Hofmann O, Grimmond S (2024). “RNAsum: An R package to comprehensively post-process, summarise and visualise genomics and transcriptomics data.” version 1.1.0, https://umccr.github.io/RNAsum/
A BibTeX entry for LaTeX users is
@Unpublished{,
title = {RNAsum: An R package to comprehensively post-process, summarise and visualise genomics and transcriptomics data},
author = {Sehrish Kanwal and Jacek Marzec and Peter Diakumis and Oliver Hofmann and Sean Grimmond},
year = {2024},
note = {version 1.1.0},
url = {https://umccr.github.io/RNAsum/},
}
