🧬 COVID-19 Thrombosis Gene Expression & Network Analysis using DICE algorithm

This project analyzes gene expression differences between healthy controls and COVID-19 patients with thrombotic complications using transcriptomic data from the GEO dataset GSE300129.

The goal is to identify:

Differentially expressed genes (DEGs)
Key regulatory genes in protein–protein interaction networks
Enriched biological pathways related to thrombosis in severe COVID-19

Pipeline Overview

GEO Dataset

↓

Data Preprocessing

↓

Differential Expression Analysis

↓

Feature Selection

↓

PPI Network Construction

↓

Network Centrality Analysis (DiCE)

↓

Pathway Enrichment (GO / KEGG)

↓

Visualization

🧠 What is Differential Centrality (DiCE)

Differential Centrality (DiCE) is a network-based gene prioritization method that identifies genes whose importance within a biological interaction network changes between conditions (for example, healthy vs disease).

Instead of focusing only on gene expression changes, DiCE evaluates how a gene’s network connectivity and influence differ between condition-specific networks. This allows the identification of genes that may not show large expression changes but become more central or influential in disease networks.

The method integrates differential expression analysis, feature selection, and network topology analysis to identify biologically relevant genes associated with disease progression.

This approach enables the discovery of regulatory genes and network hubs that may be overlooked by traditional differential expression analysis alone.

Original publication of the DiCE algorithm:

https://academic.oup.com/nar/article/53/13/gkaf609/8192812?login=false

For more details, see:
Pashaei et al., DiCE: differential centrality-ensemble analysis based on gene expression profiles and protein–protein interaction network, Nucleic Acids Research (2025).

The DiCE algorithm generally follows these key phases:

Statistical significance filter

Genes with significant differential expression after FDR correction were selected. FDR < 0.05

Fold-change filter

Genes were also filtered using a log2 fold change threshold to ensure meaningful expression differences. |log2FC| > 0.5

This removes genes with very small expression changes that are unlikely to be biologically relevant.

Feature selection filter

Information Gain was calculated to measure how informative each gene is for distinguishing Healthy vs Disease samples.

Only genes with information gain greater than the average information gain were retained.

STRING mapping filter

Genes were mapped to STRING protein IDs, and only successfully mapped genes were kept.

Interaction filter

Only protein–protein interactions where both interacting proteins are present in the selected gene list were retained.

These filtering steps ensure that the network contains reliable, biologically meaningful genes.

🕸 Network Construction

Two separate networks were constructed:

Edges represent protein–protein interactions from the STRING database.

Edge weights were calculated using gene expression correlation distance:

distance = 1 − Pearson correlation

📊 Centrality Calculation

Two network centrality measures were computed for each gene.

Betweenness Centrality

Measures how frequently a gene lies on the shortest paths between other genes.

This identifies genes that act as communication bridges between biological pathways.

⭐ Eigenvector Centrality

Measures gene importance based on connections to other highly connected genes.

This identifies hub genes in the network.

🔄 Differential Centrality (DiCE)

Centrality values were compared between the two networks:

Δ Centrality = |Centrality_disease − Centrality_control|

⚠️ Note:

Distance is used as weight in betweenness centrality
Correlation is used as weight in eigenvector centrality

🦠 Pipeline Applied to COVID-19 Dataset

Dataset GSE300129 contains:

Healthy control samples
COVID-19 ICU patients with thrombotic complications
- COVID-19 ICU patients with non-thrombotic complications

Steps:

Download GEO dataset
Extract expression matrix
Extract sample metadata

🧹 Metadata Processing

This processing step is can be done using GEO2 tool. You can select the samples, groups and threshold value then download it. Further steps is carried on in python or R.

But what I performed is hanling each step in python. Therefore I selected my groups and saved them in a dataframes. When exploring the dataset in GEO2 tool, I observed that significant genes (p-value)between non-thrombotic and thrombotic was zero so I merged two groups into disease and proceeded data processing step with healthy and disease.

The expression matrix was structured as:

rows = genes
columns = samples

To standardize gene expression values, Z-score normalization was applied:

Z = (X − mean) / standard deviation

Purpose:

Remove scale differences between genes
Improve comparability across samples
Improve statistical analysis stability

1. Differential Expression Analysis

Gene expression differences between healthy and disease samples were tested using Welch’s t-test.

Implemented using:

scipy.stats.ttest_ind

p < 0.05 genes lower than 0.05 are kept.

Multiple Testing Correction

Because thousands of genes are tested simultaneously, False Discovery Rate (FDR) correction was applied using the Benjamini–Hochberg method.

Genes with: FDR < 0.05 were considered significantly differentially expressed genes (DEGs).

2. Fold Change Calculation

Expression differences were quantified using log2 fold change:

log2FC = log2(treatment / control)

Interpretation:

positive log2FC → upregulated genes

negative log2FC → downregulated genes

Feature Selection

Acocording to DICE algorithm Information Gain was implemented. It measures how informative each gene is for distinguishing Healthy vs Disease

Implemented with:

sklearn.feature_selection.mutual_info_classif

Genes with information gain greater than the average were selected.

3.Protein–Protein Interaction (PPI) Network Construction

Protein interaction data were obtained from the STRING database.

Files used:

9606.protein.info.v12.0.txt
9606.protein.links.v12.0.txt (I renamed as protein_links.csv)

Steps:

Map gene symbols to STRING protein IDs
Filter interactions for selected genes
Construct a network using:
NetworkX

4. Network Weighting

Gene-gene relationships were quantified using correlation-based distance:

distance = 1 − correlation

5.Network Centrality Analysis

Two centrality measures were computed. Betweenness Centrality measures how often a gene lies on shortest paths between other genes. Indicates potential regulatory or bottleneck genes.
Eigenvector Centrality measures influence of a gene based on connections to other highly connected genes. Identifies hub genes in the network.

6.Pathway Enrichment Analysis

Functional enrichment analysis was performed using: KEGG pathways and Gene Ontology (GO). This identifies biological processes significantly associated with the selected genes.

HOW TO RUN ?

1️⃣ Install dependencies


pip install -r requirements.txt

📦 Requirements:

GEOparse pandas numpy scipy statsmodels scikit-learn networkx matplotlib seaborn

2️⃣ Run the analysis notebook or scripts.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.venv		.venv
9606.protein.info.v12.0.txt		9606.protein.info.v12.0.txt
README.md		README.md
entire_workflow_implementation.ipynb		entire_workflow_implementation.ipynb
protein_links.csv		protein_links.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 COVID-19 Thrombosis Gene Expression & Network Analysis using DICE algorithm

Pipeline Overview

🧠 What is Differential Centrality (DiCE)

The DiCE algorithm generally follows these key phases:

🕸 Network Construction

📊 Centrality Calculation

Betweenness Centrality

⭐ Eigenvector Centrality

🔄 Differential Centrality (DiCE)

🦠 Pipeline Applied to COVID-19 Dataset

🧹 Metadata Processing

1. Differential Expression Analysis

Multiple Testing Correction

2. Fold Change Calculation

Feature Selection

3.Protein–Protein Interaction (PPI) Network Construction

4. Network Weighting

5.Network Centrality Analysis

6.Pathway Enrichment Analysis

HOW TO RUN ?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧬 COVID-19 Thrombosis Gene Expression & Network Analysis using DICE algorithm

Pipeline Overview

🧠 What is Differential Centrality (DiCE)

The DiCE algorithm generally follows these key phases:

🕸 Network Construction

📊 Centrality Calculation

Betweenness Centrality

⭐ Eigenvector Centrality

🔄 Differential Centrality (DiCE)

🦠 Pipeline Applied to COVID-19 Dataset

🧹 Metadata Processing

1. Differential Expression Analysis

Multiple Testing Correction

2. Fold Change Calculation

Feature Selection

3.Protein–Protein Interaction (PPI) Network Construction

4. Network Weighting

5.Network Centrality Analysis

6.Pathway Enrichment Analysis

HOW TO RUN ?

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages