This project analyzes gene expression differences between healthy controls and COVID-19 patients with thrombotic complications using transcriptomic data from the GEO dataset GSE300129.
The goal is to identify:
- Differentially expressed genes (DEGs)
- Key regulatory genes in protein–protein interaction networks
- Enriched biological pathways related to thrombosis in severe COVID-19
GEO Dataset
↓
Data Preprocessing
↓
Differential Expression Analysis
↓
Feature Selection
↓
PPI Network Construction
↓
Network Centrality Analysis (DiCE)
↓
Pathway Enrichment (GO / KEGG)
↓
Visualization
Differential Centrality (DiCE) is a network-based gene prioritization method that identifies genes whose importance within a biological interaction network changes between conditions (for example, healthy vs disease).
Instead of focusing only on gene expression changes, DiCE evaluates how a gene’s network connectivity and influence differ between condition-specific networks. This allows the identification of genes that may not show large expression changes but become more central or influential in disease networks.
The method integrates differential expression analysis, feature selection, and network topology analysis to identify biologically relevant genes associated with disease progression.
This approach enables the discovery of regulatory genes and network hubs that may be overlooked by traditional differential expression analysis alone.
Original publication of the DiCE algorithm:
https://academic.oup.com/nar/article/53/13/gkaf609/8192812?login=false
For more details, see:
Pashaei et al., DiCE: differential centrality-ensemble analysis based on gene expression profiles and protein–protein interaction network, Nucleic Acids Research (2025).
- Statistical significance filter
Genes with significant differential expression after FDR correction were selected. FDR < 0.05
- Fold-change filter
Genes were also filtered using a log2 fold change threshold to ensure meaningful expression differences. |log2FC| > 0.5
This removes genes with very small expression changes that are unlikely to be biologically relevant.
- Feature selection filter
Information Gain was calculated to measure how informative each gene is for distinguishing Healthy vs Disease samples.
Only genes with information gain greater than the average information gain were retained.
- STRING mapping filter
Genes were mapped to STRING protein IDs, and only successfully mapped genes were kept.
- Interaction filter
Only protein–protein interactions where both interacting proteins are present in the selected gene list were retained.
These filtering steps ensure that the network contains reliable, biologically meaningful genes.
Two separate networks were constructed:
Edges represent protein–protein interactions from the STRING database.
Edge weights were calculated using gene expression correlation distance:
distance = 1 − Pearson correlation
Two network centrality measures were computed for each gene.
Measures how frequently a gene lies on the shortest paths between other genes.
This identifies genes that act as communication bridges between biological pathways.
Measures gene importance based on connections to other highly connected genes.
This identifies hub genes in the network.
Centrality values were compared between the two networks:
Δ Centrality = |Centrality_disease − Centrality_control|
- Distance is used as weight in betweenness centrality
- Correlation is used as weight in eigenvector centrality
Dataset GSE300129 contains:
- Healthy control samples
- COVID-19 ICU patients with thrombotic complications
-
- COVID-19 ICU patients with non-thrombotic complications
Steps:
- Download GEO dataset
- Extract expression matrix
- Extract sample metadata
This processing step is can be done using GEO2 tool. You can select the samples, groups and threshold value then download it. Further steps is carried on in python or R.
But what I performed is hanling each step in python. Therefore I selected my groups and saved them in a dataframes. When exploring the dataset in GEO2 tool, I observed that significant genes (p-value)between non-thrombotic and thrombotic was zero so I merged two groups into disease and proceeded data processing step with healthy and disease.
The expression matrix was structured as:
-
rows = genes
-
columns = samples
To standardize gene expression values, Z-score normalization was applied:
Z = (X − mean) / standard deviation
Purpose:
- Remove scale differences between genes
- Improve comparability across samples
- Improve statistical analysis stability
Gene expression differences between healthy and disease samples were tested using Welch’s t-test.
Implemented using:
scipy.stats.ttest_indp < 0.05 genes lower than 0.05 are kept.
Because thousands of genes are tested simultaneously, False Discovery Rate (FDR) correction was applied using the Benjamini–Hochberg method.
Genes with: FDR < 0.05 were considered significantly differentially expressed genes (DEGs).
Expression differences were quantified using log2 fold change:
log2FC = log2(treatment / control)
Interpretation:
positive log2FC → upregulated genes
negative log2FC → downregulated genes
Acocording to DICE algorithm Information Gain was implemented. It measures how informative each gene is for distinguishing Healthy vs Disease
Implemented with:
sklearn.feature_selection.mutual_info_classif
Genes with information gain greater than the average were selected.
Protein interaction data were obtained from the STRING database.
Files used:
-
9606.protein.info.v12.0.txt
-
9606.protein.links.v12.0.txt (I renamed as protein_links.csv)
Steps:
-
Map gene symbols to STRING protein IDs
-
Filter interactions for selected genes
-
Construct a network using:
-
NetworkX
Gene-gene relationships were quantified using correlation-based distance:
- distance = 1 − correlation
- Two centrality measures were computed. Betweenness Centrality measures how often a gene lies on shortest paths between other genes. Indicates potential regulatory or bottleneck genes.
- Eigenvector Centrality measures influence of a gene based on connections to other highly connected genes. Identifies hub genes in the network.
Functional enrichment analysis was performed using: KEGG pathways and Gene Ontology (GO). This identifies biological processes significantly associated with the selected genes.
1️⃣ Install dependencies
pip install -r requirements.txt
📦 Requirements:
GEOparse pandas numpy scipy statsmodels scikit-learn networkx matplotlib seaborn
2️⃣ Run the analysis notebook or scripts.