Differential gene expression analysis of RNA-seq data using R, including volcano plots and functional insights into top regulated genes
This project explores an RNA-seq dataset comparing diseased cell lines and diseased cell lines treated with compound X. The analysis involves differential expression, visualization with a volcano plot, and functional annotation of top regulated genes.
- Generate a volcano plot.
- Determine the upregulated genes (Genes with Log2FC > 1 and pvalue < 0.01)
- Determine the downregulated genes (Genes with Log2FC < -1 and pvalue < 0.01)
- What are the functions of the top 5 upregulated genes and top 5 downregulated genes. (Use genecards)
The dataset contains an experiment between a diseased cell line and diseased cell lines treated with compound X. The difference in expression change between the two health status is computed as Fold change to log 2 (Log2FC) and the significance of each is computed in p-value. Access Dataset here.
link_to_rnaseq <- "https://gist.githubusercontent.com/stephenturner/806e31fce55a8b7175af/raw/1a507c4c3f9f1baaa3a69187223ff3d3050628d4/results.txt"
rna_seq <- read.table(file = link_to_rnaseq, header = TRUE)
names(rna_seq)
row(rna_seq)
head(rna_seq)import pandas as pd
url = "https://gist.githubusercontent.com/stephenturner/806e31fce55a8b7175af/raw/1a507c4c3f9f1baaa3a69187223ff3d3050628d4/results.txt"
df = pd.read_csv(url, delim_whitespace=True)
df.to_excel("dumbseq_dataset.xlsx", index=False)rna_seq$negLogP <- -log10(rna_seq$pvalue)
plot(rna_seq$log2FoldChange, rna_seq$negLogP,
main = "Volcano Plot of RNA-seq Data",
xlab = "log2 Fold Change",
ylab = "-log10(p-value)",
pch = 20, col = "black")
abline(v = c(-1, 1), col = "red", lty = 2)
abline(h = -log10(0.01), col = "blue", lty = 2)
rna_seq$diffexpressed <- 'NO'
rna_seq$diffexpressed[rna_seq$log2FoldChange > 1 & rna_seq$pvalue < 0.01] <- 'UP'
rna_seq$diffexpressed[rna_seq$log2FoldChange < -1 & rna_seq$pvalue < 0.01] <- 'DOWN'
head(rna_seq)
plot(rna_seq$log2FoldChange, rna_seq$negLogP,
main = "Volcano Plot with Highlighted Genes",
xlab = "log2 Fold Change",
ylab = "-log10(p-value)",
pch = 20,
col = ifelse(rna_seq$diffexpressed == "UP", "red",
ifelse(rna_seq$diffexpressed == "DOWN", "blue", "grey")))
abline(v = c(-1, 1), col = "grey", lty = 2)
abline(h = -log10(0.01), col = "grey", lty = 2)
The volcano plot shows the distribution of genes based on their log2 fold change (x-axis) and statistical significance (-log10 p-value, y-axis). Genes on the right side (red dots) represent upregulated genes in the treated diseased cells (compound X vs untreated); Genes on the left side (blue dots) represent downregulated genes after treatment; Grey dots represent genes with no significant differential expression.
Interpretation: Compound X treatment induces both upregulation and downregulation of multiple genes, suggesting it influences disease-related molecular pathways.
up_reg <- rna_seq %>%
filter(diffexpressed == "UP") %>%
arrange(desc(log2FoldChange)) %>%
head(5) %>%
select(Gene, log2FoldChange, pvalue)
print(up_reg)| Gene | log2FoldChange | pvalue |
|---|---|---|
| DTHD1 | 1.540 | 5.594e-05 |
| EMILIN2 | 1.534 | 2.976e-06 |
| PI16 | 1.495 | 1.297e-04 |
| C4orf45 | 1.288 | 2.472e-04 |
| FAM180B | 1.249 | 1.146e-03 |
down_genes <- rna_seq %>%
filter(diffexpressed == "DOWN") %>%
arrange(log2FoldChange) %>%
head(5) %>%
select(Gene, log2FoldChange, pvalue)
print(down_genes)| Gene | log2FoldChange | pvalue |
|---|---|---|
| TBX5 | -2.129 | 5.655e-08 |
| IFITM1 | -1.687 | 3.735e-06 |
| TNN | -1.658 | 8.973e-06 |
| COL13A1 | -1.647 | 1.394e-05 |
| IFITM3 | -1.610 | 1.202e-05 |
The analysis revealed distinct sets of genes upregulated and downregulated upon Compound X treatment. Upregulated genes (e.g., DTHD1, EMILIN2) suggest enhanced apoptosis and matrix remodeling, while downregulated genes (e.g., TBX5, IFITM1) indicate suppression of immune-related and developmental transcriptional programs. These findings highlight potential molecular mechanisms by which Compound X exerts its therapeutic effects.
Genecard: See here
Task: HackBio