GBM-TxEval is a computational tool designed to evaluate transcriptional treatment responses in glioblastoma (GBM). It processes longitudinal gene expression data to calculate therapy-induced log2 fold changes, performs gene set enrichment analysis, and projects the results into principal component space. This enables stratification of samples into "Up" and "Down" responder subtypes, supporting the investigation of resistance mechanisms and selection of optimal models for preclinical drug evaluation.
R/utils_io.R: File loading utilities.R/utils_processing.R: Core computation functions (normalization, filtering, log2FC, FGSEA).server/: Modular server logic for each step.ui/: Modular UI components for each step.global.R: Global configuration: packages, gene length tables, gene sets, PCA loadings, etc.app.R: Main entry point combining all modules into a seamless workflow.
The analysis pipeline is divided into six modular steps:
The first step involves uploading two input files in CSV format: sample metadata and a gene expression matrix. These inputs are essential for downstream processing and must conform to specific format requirements.
The metadata file must contain one row per sample and include the following mandatory columns:
Sample_ID: Unique identifier for each sample.Donor_ID: Identifier linking paired samples (e.g., untreated/treated) from the same donor.Model: Label indicating the glioma model used.Condition: Treatment status of the sample.
The Condition column supports flexible input formats. It is case-insensitive and can accept the following values:
"Untreated","Primary", or"0"→ interpreted as Untreated"Treated","Recurrent", or"1"→ interpreted as Treated
Optional metadata columns can also be included for annotation and filtering purposes in later steps.
The expression matrix must be in CSV format and meet the following constraints:
- It must include exactly one non-numeric column, which is strictly required to be the first column. This column represents the gene identifier (e.g., Ensembl ID or HGNC symbol).
- All remaining columns must be numeric and correspond to sample names.
- Sample names (column headers) must match the values in the
Sample_IDcolumn of the metadata file. Partial overlap is permitted; unmatched samples will be excluded from analysis.
Upon upload, the app attempts to infer the type of expression data (Counts, FPKM, or TPM) using the following heuristic, applied to a random subset of two columns:
- If all values are integers, or if the proportion of non-integer values is between 5% and 80%, the maximum value is below 20,000, and all values are non-negative, the matrix is classified as Counts (this includes both raw and expected counts).
- If over 80% of values are non-integers, the maximum value exceeds 10,000, and all values are non-negative, it is classified as FPKM.
- If the sum of values in both sampled columns is approximately 1e6 (within a ±1e5 tolerance), it is classified as TPM.
- If these conditions are not met, or if fewer than 2 samples are available, the expression type is labeled Unknown.
The detected expression type can be overridden manually by the user.
This step ensures proper alignment between the sample metadata and the expression matrix. The right panel provides a preview of the uploaded metadata, allowing users to map column names to their respective roles:
Sample_IDDonor_IDModelCondition(e.g., untreated vs. treated)
Users can also specify any number of optional metadata columns (referred to as "extra information"). These columns may include additional contextual details such as sample origin, treatment regimen, or sequencing batch. Selected extra information columns will:
- Be preserved in the filtered metadata file available for download in Step 4.
- Appear in the final visualization as part of the interactive hover text in Step 6.
There is no upper limit to the number of extra columns retained. Including such information is recommended when additional clinical or experimental annotations may aid interpretation.
This step allows users to define one or more metadata-based filtering conditions. Any column in the uploaded metadata can be selected as a filtering variable—common examples include tumor purity, model subtype, treatment timepoint, or sequencing batch.
- Zero or more filtering variables may be selected.
- Filtering columns may be categorical or numeric; however, filtering operates via explicit selection of value levels.
This design supports flexible sample selection for downstream analysis while allowing fine-grained control over cohort composition.
In this step, users specify which values within each selected filtering column should be retained. Filtering is performed using checkbox selection, where each level in a column can be toggled independently.
💡 Tip: For numeric columns (e.g., purity scores), we recommend creating binary classification columns (e.g.,
"HighPurity"= TRUE/FALSE) in advance. This enables efficient selection via checkboxes and avoids the need for custom sliders.
Once filters are applied:
- The updated sample metadata is shown in the right panel for confirmation.
- Both the filtered metadata (
colData.csv) and matched expression matrix (countData.csv) can be downloaded for external use or archival.
In this step, GBM-TxEval performs core analytical operations including expression normalization, low-expression filtering, log2 fold change calculation, PC1 score projection, and gene set enrichment analysis (GSEA). The overall analytical framework follows the strategy described in Tanner et al., Genome Biology 2024.
Two gene filtering strategies are supported:
- Default Filtering (Recommended):
- Expression values below a defined threshold are set to zero.
- Users can choose from preset thresholds or specify a custom value.
- If the input data type is Counts, the filtering is applied before normalization.
- Advanced Filtering:
- Retains genes expressed above the global first quartile (Q1) in at least a user-defined minimum proportion of samples (e.g., 100%, 75%, etc.).
- Designed for experienced users needing adaptive filtering based on data distribution.
The normalization method depends on the input expression type:
- If the uploaded data is TPM or FPKM, the normalization method is fixed (i.e., no transformation is applied).
- If the input is Counts, users can choose to normalize the data using either:
- FPKM: Fragments Per Kilobase of transcript per Million mapped reads
- TPM: Transcripts Per Million
Normalization is performed using uploaded gene length information, and both TPM and FPKM are computed with consistent formulae.
Log2 fold changes are computed on normalized expression values (either TPM or FPKM), not on raw counts. For count-based input, the data are first normalized according to the selected method, and then log2FC is calculated for each donor based on paired untreated vs. treated samples:
log2FC_gene = log2((expr_treated + 0.01) / (expr_untreated + 0.01))
A pseudocount of 0.01 is added to avoid division by zero.
After computing donor-specific log2FC vectors, each vector is projected into a predefined PCA space to calculate a PC1 score, which summarizes the direction and magnitude of the transcriptional response.
Specifically:
-
A reference PC1 rotation vector (PCA loadings) is precomputed from a cohort-wide log2FC matrix derived from paired primary and recurrent glioblastoma bulk RNA-seq data.
-
For each donor, their log2FC vector is centered and projected onto the PC1 axis using the formula:
PC1_score = Σ (log2FC_gene × PC1_loading_gene) -
Only genes present in both the donor’s log2FC vector and the PCA rotation vector are used in the computation.
This PC1 score captures the principal axis of transcriptional variation observed across treatment conditions and serves as a continuous quantitative representation of response magnitude.
To evaluate transcriptional programs affected by treatment, GBM-TxEval performs Gene Set Enrichment Analysis (GSEA) using the FGSEA algorithm. For each donor, normalized log2FC values are ranked and used as input for enrichment testing against a predefined gene set (e.g., JARID2 target genes).
The gene ranking is performed as follows:
ranks = sort(log2FC, decreasing = TRUE)
Enrichment scores (ES) are calculated based on this ranking, with the top pathway's ES retained per donor.
To ensure compatibility between gene sets and input expression data, the application automatically detects the type of gene identifier in the expression matrix:
- If gene IDs match Ensembl identifiers, the Ensembl-based GMT file is used.
- If gene IDs match HGNC symbols, the corresponding symbol-based GMT file is used.
⚠️ Note: This automatic gene ID detection is highly recommended. Manual override is available but intended only for edge cases such as ambiguous gene formats or mismatched annotations.
Once the “Start Processing” button is clicked:
- A progress bar appears, tracking analysis progress across donors.
- The current Donor ID and processing index are shown in real-time (e.g., “Donor ID: D123 [3/14]”).
Upon completion, a preview of the results is displayed and a CSV download option is provided.
Each row in the output corresponds to a donor, containing the following columns:
Donor_ID: Unique donor identifierModel: Glioma model associated with the sample pairPC1_Score: Projection of log2FC vector into the PC1 spaceES: Enrichment Score from FGSEA (for the JARID2 pathway or selected gene set)Responder: Binary classification of transcriptional response (Upif ES ≥ 0;Downotherwise)
The resulting dataset serves as input for the final step: interactive visualization and interpretation.
This final step generates an interactive scatter plot that enables intuitive exploration of transcriptional responses across donors.
Each point in the plot represents a donor and is plotted as:
- X-axis:
PC1_Score— summarizing the global transcriptional shift based on log2 fold changes - Y-axis:
ES— Enrichment Score from FGSEA for the specified gene set - Color and symbol: Encodes
Modeltype
The following result columns are visualized:
Donor_IDModelPC1_ScoreESResponder
- Hover Tooltips: Hovering over a point reveals detailed information, including all retained metadata and extra user-specified columns from earlier steps.
- Customization Options:
Point Size: Adjustable via sliderPoint Opacity: Adjustable via slider
These controls allow users to tune the visual clarity based on dataset size and density.
The resulting interactive plot can be downloaded for presentation or publication. Users are encouraged to annotate or further customize the exported figure using external tools as needed.
Tested on R 4.3.2 with the following package versions:
shiny 1.11.1
shinyjs 2.1.0
plotly 4.11.0
ggplot2 3.5.2
fgsea 1.28.0
data.table 1.17.6
DT 0.33
bslib 0.9.0
Developer: Bo Wang
Affiliation: University of Leeds
This supplementary analysis extends and visualizes the core findings from Tanner et al., Genome Biology 2024.
While the original study highlighted the importance of JARID2 in stratifying glioblastoma patient responses to treatment, the updated analysis presented here broadens this focus to the wider PRC1 and PRC2 complexes.
Using updated RNA-seq processing and gene set annotations, we identify consistent enrichment of multiple PRC1/PRC2-related gene sets, indicating a broader polycomb involvement in patient stratification beyond JARID2.
✨ The analysis and visualization shown here were performed by Bo Wang.
- Bubble size reflects the normalized enrichment score (NES).
- Colour indicates the false discovery rate (FDR).
- Among all tested gene sets, JARID2 still shows the highest NES, consistent with earlier findings.
- However, the updated analysis highlights that other PRC1- and PRC2-associated gene sets, including PDGFR2, CBX6, CBX7, and CBX8, are also commonly enriched across samples.
- Interestingly, CBX2, which was previously prominent, is no longer highly significant.
- Colours represent the NES, and only the most significant gene sets from the outer bubble plot are displayed.
- Nearly all of these top-ranked sets are linked to PRC1 and PRC2 complexes, showing a broader polycomb signature beyond JARID2 alone.
- Sample clustering based on this heatmap clearly stratifies patients into Up responders (red) and Down responders (blue), reinforcing the robustness of polycomb-related enrichment as a stratification marker.
Together, these results extend the original focus on JARID2 and suggest that a wider network of PRC1- and PRC2-associated gene sets may play a central role in determining glioblastoma patient responses to treatment.











