Skip to content

anyavala/Gene-expression-analysis-with-KG

Repository files navigation

🧬 COVID-19 Thrombosis Gene Expression & Network Analysis using DICE algorithm

This project analyzes gene expression differences between healthy controls and COVID-19 patients with thrombotic complications using transcriptomic data from the GEO dataset GSE300129.

The goal is to identify:

  • Differentially expressed genes (DEGs)
  • Key regulatory genes in protein–protein interaction networks
  • Enriched biological pathways related to thrombosis in severe COVID-19

Pipeline Overview

GEO Dataset

Data Preprocessing

Differential Expression Analysis

Feature Selection

PPI Network Construction

Network Centrality Analysis (DiCE)

Pathway Enrichment (GO / KEGG)

Visualization

🧠 What is Differential Centrality (DiCE)

Differential Centrality (DiCE) is a network-based gene prioritization method that identifies genes whose importance within a biological interaction network changes between conditions (for example, healthy vs disease).

Instead of focusing only on gene expression changes, DiCE evaluates how a gene’s network connectivity and influence differ between condition-specific networks. This allows the identification of genes that may not show large expression changes but become more central or influential in disease networks.

The method integrates differential expression analysis, feature selection, and network topology analysis to identify biologically relevant genes associated with disease progression.

This approach enables the discovery of regulatory genes and network hubs that may be overlooked by traditional differential expression analysis alone.

Original publication of the DiCE algorithm:

https://academic.oup.com/nar/article/53/13/gkaf609/8192812?login=false

For more details, see:
Pashaei et al., DiCE: differential centrality-ensemble analysis based on gene expression profiles and protein–protein interaction network, Nucleic Acids Research (2025).

The DiCE algorithm generally follows these key phases:

  1. Statistical significance filter

Genes with significant differential expression after FDR correction were selected. FDR < 0.05

  1. Fold-change filter

Genes were also filtered using a log2 fold change threshold to ensure meaningful expression differences. |log2FC| > 0.5

This removes genes with very small expression changes that are unlikely to be biologically relevant.

  1. Feature selection filter

Information Gain was calculated to measure how informative each gene is for distinguishing Healthy vs Disease samples.

Only genes with information gain greater than the average information gain were retained.

  1. STRING mapping filter

Genes were mapped to STRING protein IDs, and only successfully mapped genes were kept.

  1. Interaction filter

Only protein–protein interactions where both interacting proteins are present in the selected gene list were retained.

These filtering steps ensure that the network contains reliable, biologically meaningful genes.

🕸 Network Construction

Two separate networks were constructed:

Edges represent protein–protein interactions from the STRING database.

Edge weights were calculated using gene expression correlation distance:

distance = 1 − Pearson correlation

📊 Centrality Calculation

Two network centrality measures were computed for each gene.

Betweenness Centrality

Measures how frequently a gene lies on the shortest paths between other genes.

This identifies genes that act as communication bridges between biological pathways.

⭐ Eigenvector Centrality

Measures gene importance based on connections to other highly connected genes.

This identifies hub genes in the network.


🔄 Differential Centrality (DiCE)

Centrality values were compared between the two networks:

Δ Centrality = |Centrality_disease − Centrality_control|

⚠️ Note:

  • Distance is used as weight in betweenness centrality
  • Correlation is used as weight in eigenvector centrality

🦠 Pipeline Applied to COVID-19 Dataset

Dataset GSE300129 contains:

  • Healthy control samples
  • COVID-19 ICU patients with thrombotic complications
    • COVID-19 ICU patients with non-thrombotic complications

Steps:

  1. Download GEO dataset
  2. Extract expression matrix
  3. Extract sample metadata

🧹 Metadata Processing

This processing step is can be done using GEO2 tool. You can select the samples, groups and threshold value then download it. Further steps is carried on in python or R.

But what I performed is hanling each step in python. Therefore I selected my groups and saved them in a dataframes. When exploring the dataset in GEO2 tool, I observed that significant genes (p-value)between non-thrombotic and thrombotic was zero so I merged two groups into disease and proceeded data processing step with healthy and disease.

The expression matrix was structured as:

  • rows = genes

  • columns = samples

To standardize gene expression values, Z-score normalization was applied:

Z = (X − mean) / standard deviation

Purpose:

  • Remove scale differences between genes
  • Improve comparability across samples
  • Improve statistical analysis stability

1. Differential Expression Analysis

Gene expression differences between healthy and disease samples were tested using Welch’s t-test.

Implemented using:

scipy.stats.ttest_ind

p < 0.05 genes lower than 0.05 are kept.

Multiple Testing Correction

Because thousands of genes are tested simultaneously, False Discovery Rate (FDR) correction was applied using the Benjamini–Hochberg method.

Genes with: FDR < 0.05 were considered significantly differentially expressed genes (DEGs).

2. Fold Change Calculation

Expression differences were quantified using log2 fold change:

log2FC = log2(treatment / control)

Interpretation:

positive log2FC → upregulated genes

negative log2FC → downregulated genes

Feature Selection

Acocording to DICE algorithm Information Gain was implemented. It measures how informative each gene is for distinguishing Healthy vs Disease

Implemented with:

sklearn.feature_selection.mutual_info_classif

Genes with information gain greater than the average were selected.

3.Protein–Protein Interaction (PPI) Network Construction

Protein interaction data were obtained from the STRING database.

Files used:

  • 9606.protein.info.v12.0.txt

  • 9606.protein.links.v12.0.txt (I renamed as protein_links.csv)

Steps:

  • Map gene symbols to STRING protein IDs

  • Filter interactions for selected genes

  • Construct a network using:

  • NetworkX

4. Network Weighting

Gene-gene relationships were quantified using correlation-based distance:

  • distance = 1 − correlation

5.Network Centrality Analysis

  • Two centrality measures were computed. Betweenness Centrality measures how often a gene lies on shortest paths between other genes. Indicates potential regulatory or bottleneck genes.
  • Eigenvector Centrality measures influence of a gene based on connections to other highly connected genes. Identifies hub genes in the network.

6.Pathway Enrichment Analysis

Functional enrichment analysis was performed using: KEGG pathways and Gene Ontology (GO). This identifies biological processes significantly associated with the selected genes.

HOW TO RUN ?

1️⃣ Install dependencies


pip install -r requirements.txt

📦 Requirements:

GEOparse pandas numpy scipy statsmodels scikit-learn networkx matplotlib seaborn

2️⃣ Run the analysis notebook or scripts.

Releases

No releases published

Packages

 
 
 

Contributors