Skip to content

Commit 2c916dc

Browse files
committed
Add doublet detection, PAGA trajectory, T cell subclustering
Pipeline expanded from 6 to 8 steps: - 01: Scrublet doublet detection (36 detected, 34 removed after QC) - 06: PAGA trajectory inference + diffusion pseudotime rooted in CD14+ monocytes. PAGA-initialised UMAP for cleaner layout. - 07: T cell subclustering resolves CD4+ (47.5%) and CD8+ (52.5%) via marker scoring — addresses the main biological gap. - 08: Publication figures now export PDF alongside PNG. 2,604 cells retained (was 2,638 before doublet removal). 5 clusters at resolution 0.5, silhouette 0.204. All seeds set for reproducibility.
1 parent c1f4523 commit 2c916dc

6 files changed

Lines changed: 274 additions & 37 deletions

File tree

README.md

Lines changed: 63 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
55
[![Python](https://img.shields.io/badge/Python-3.10%2B-blue)](https://www.python.org/)
66

7-
End-to-end single-cell RNA-seq analysis pipeline in Python using [scanpy](https://scanpy.readthedocs.io/). Quality control, normalisation, dimensionality reduction, unsupervised clustering with automated resolution selection, and marker-based cell type annotation on human peripheral blood mononuclear cells.
7+
End-to-end single-cell RNA-seq analysis pipeline in Python using [scanpy](https://scanpy.readthedocs.io/). Doublet detection, quality control, normalisation, dimensionality reduction, clustering with automated resolution selection, marker-based cell type annotation, PAGA trajectory inference, and T cell subclustering on human peripheral blood mononuclear cells.
88

99
<p align="center">
1010
<img src="docs/umap_3d_rotation.gif" alt="3D UMAP rotation showing PBMC immune cell clusters" width="600">
@@ -28,63 +28,91 @@ End-to-end single-cell RNA-seq analysis pipeline in Python using [scanpy](https:
2828
## Workflow
2929

3030
```
31-
PBMC 3k (10X Genomics)
31+
PBMC 3k (10X Genomics, 2,700 cells)
3232
3333
34-
01 QC ──────────── Filter: 200 < genes < 2500, mito < 5%
35-
34+
01 QC ──────────── Scrublet doublet detection → filter: 200 < genes < 2500, mito < 5%
35+
(36 doublets detected, 34 removed)
3636
3737
02 Preprocess ──── Normalise (10k), log1p, 2000 HVGs, regress, scale
3838
3939
40-
03 Reduce ──────── PCA (40 PCs) → kNN graph → UMAP
40+
03 Reduce ──────── PCA (40 PCs) → kNN graph → UMAP (random_state=42)
4141
4242
4343
04 Cluster ─────── Leiden at 5 resolutions → silhouette selection (≥5 clusters)
4444
4545
46-
05 Annotate ────── Wilcoxon DE → score against PBMC marker signatures
46+
05 Annotate ────── Wilcoxon DE → score against PBMC marker signatures → 5 cell types
47+
48+
49+
06 Trajectory ──── PAGA graph abstraction → diffusion pseudotime (rooted in CD14+ mono)
4750
4851
49-
06 Figures ─────── Multi-panel publication figure + 3D UMAP
52+
07 Subcluster ──── T cell compartment → resolve CD4+ (47.5%) and CD8+ (52.5%)
53+
54+
55+
08 Figures ─────── Multi-panel publication figure (PNG 300 DPI + PDF vector)
5056
```
5157

5258
## Pipeline
5359

5460
| Step | Script | What it does |
5561
|------|--------|--------------|
56-
| 01 | `01_load_and_qc.py` | Download PBMC 3k, calculate QC metrics (genes/cell, UMI counts, mitochondrial %), filter low-quality cells |
57-
| 02 | `02_preprocess.py` | Normalise to 10k counts/cell, log-transform, select 2,000 highly variable genes, regress out confounders, scale |
58-
| 03 | `03_reduce_dimensions.py` | PCA (40 components), build k-nearest neighbour graph, compute UMAP embedding |
59-
| 04 | `04_cluster.py` | Leiden clustering at 5 resolutions (0.3–1.2), evaluate with silhouette score, select best with a floor of 5 clusters |
60-
| 05 | `05_annotate_cell_types.py` | Wilcoxon rank-sum test for marker genes, score clusters against known PBMC signatures, assign cell types |
61-
| 06 | `06_publication_figures.py` | Multi-panel figure: UMAP, composition bar chart, marker heatmap, summary statistics |
62+
| 01 | `01_load_and_qc.py` | Download PBMC 3k, Scrublet doublet detection, QC metrics, filter low-quality cells and doublets |
63+
| 02 | `02_preprocess.py` | Normalise to 10k counts/cell, log-transform, select 2,000 HVGs, regress out confounders, scale |
64+
| 03 | `03_reduce_dimensions.py` | PCA (40 components), k-nearest neighbour graph, UMAP embedding |
65+
| 04 | `04_cluster.py` | Leiden clustering at 5 resolutions (0.3–1.2), silhouette evaluation, select best with ≥5 cluster floor |
66+
| 05 | `05_annotate_cell_types.py` | Wilcoxon rank-sum DE, score clusters against curated PBMC signatures, assign cell types |
67+
| 06 | `06_trajectory.py` | PAGA partition-based graph abstraction, PAGA-initialised UMAP, diffusion pseudotime |
68+
| 07 | `07_t_cell_subclustering.py` | Extract T cell compartment, subcluster, resolve CD4+/CD8+ via marker scoring |
69+
| 08 | `08_publication_figures.py` | Multi-panel figure with UMAP, composition, marker heatmap, summary (PNG + PDF) |
6270

6371
All scripts are in `scripts/`. Each reads the previous step's `.h5ad` output from `results/`.
6472

6573
## Results
6674

75+
### Cell Type Composition
76+
6777
| Cell Type | Cells | % | Key Markers |
6878
|-----------|-------|---|-------------|
69-
| CD4+ T cells | 1,195 | 45.3 | CD3D, IL7R |
70-
| CD14+ Monocytes | 464 | 17.6 | CD14, LYZ |
71-
| NK cells | 419 | 15.9 | NKG7, GNLY |
72-
| B cells | 342 | 13.0 | MS4A1, CD79A |
73-
| FCGR3A+ Monocytes | 180 | 6.8 | FCGR3A, MS4A7 |
74-
| Dendritic cells | 38 | 1.4 | FCER1A, CST3 |
79+
| CD4+ T cells | 1,192 | 45.8 | CD3D, IL7R |
80+
| CD14+ Monocytes | 636 | 24.4 | CD14, LYZ |
81+
| NK cells | 410 | 15.7 | NKG7, GNLY |
82+
| B cells | 330 | 12.7 | MS4A1, CD79A |
83+
| Dendritic cells | 36 | 1.4 | FCER1A, CST3 |
84+
85+
2,604 cells retained after QC and doublet removal (from 2,700 raw). Clustering selected resolution 0.5 (5 clusters, silhouette 0.204).
86+
87+
### T Cell Subclustering
88+
89+
Subclustering the T cell compartment (1,192 cells) resolves the CD4+/CD8+ boundary that is not visible at the global clustering level:
90+
91+
| Subtype | Cells | % of T cells |
92+
|---------|-------|---|
93+
| CD8+ T | 626 | 52.5 |
94+
| CD4+ T | 566 | 47.5 |
95+
96+
The near-equal split is consistent with healthy donor PBMCs. CD8+ T cells were assigned by scoring CD8A/CD8B/GZMK/GZMA against IL7R/CD4/TCF7/LEF1.
97+
98+
### Trajectory Inference
99+
100+
PAGA connects CD14+ monocytes → dendritic cells (the myeloid differentiation axis) and reveals the T/NK cell cluster neighbourhood in UMAP space. Diffusion pseudotime, rooted in CD14+ monocytes, orders cells along the monocyte-to-DC trajectory.
101+
102+
### Biological Interpretation
75103

76-
The dominance of CD4+ T cells (45%) is expected in healthy donor PBMCs. The ratio of classical (CD14+) to nonclassical (FCGR3A+) monocytes is approximately 2.6:1, consistent with published literature. Dendritic cells are a rare population (1.4%), correctly resolved as a distinct cluster. CD8+ T cells and megakaryocytes are present in the dataset but were not resolved as separate clusters at resolution 0.5 — they likely merge with the CD4+ T cell and monocyte clusters respectively due to shared marker expression (CD3D/CD3E for T cell subtypes).
104+
The dominance of CD4+ T cells (46%) is expected in healthy donor PBMCs. Dendritic cells are a rare population (1.4%), correctly resolved as a distinct cluster despite low cell count. The monocyte population is predominantly classical (CD14+); nonclassical (FCGR3A+) monocytes were not resolved as a separate cluster at resolution 0.5 — they likely merge with the classical monocyte cluster. This is consistent with the resolution-sensitivity of FCGR3A+ monocyte separation observed in the literature.
77105

78-
Clustering selected resolution 0.5 (6 clusters, silhouette 0.196). Silhouette scores in single-cell data are typically low due to continuous rather than discrete cell states; the metric is used here for relative comparison between resolutions, not as an absolute quality measure.
106+
Silhouette scores in single-cell data are typically low due to continuous rather than discrete cell states; the metric is used here for relative comparison between resolutions, not as an absolute quality measure.
79107

80108
## Quick Start
81109

82110
```bash
83111
git clone https://github.com/Ekin-Kahraman/single-cell-rnaseq-immune-profiling.git
84112
cd single-cell-rnaseq-immune-profiling
85113
pip install -e .
86-
python run_pipeline.py # full pipeline (~17s)
87-
python run_pipeline.py --from 4 # resume from step 4
114+
python run_pipeline.py # full pipeline (~38s)
115+
python run_pipeline.py --from 6 # resume from trajectory step
88116
```
89117

90118
## Testing
@@ -98,17 +126,21 @@ pytest -v
98126

99127
## Design Decisions
100128

101-
- **Automated annotation** — Clusters are scored against curated PBMC marker gene sets rather than annotated by manual inspection. This makes the pipeline reproducible and removes subjective judgement.
102-
- **Multi-resolution clustering** — Running Leiden at multiple resolutions and picking by silhouette score (with a biological floor) avoids the common problem of choosing an arbitrary resolution.
129+
- **Doublet detection** — Scrublet integrated before QC filtering. 36 doublets detected (1.3%), 34 removed after other QC filters. Following [Luecken & Theis (2019)](https://doi.org/10.15252/msb.20188746) best practices.
130+
- **Automated annotation** — Clusters scored against curated PBMC marker gene sets rather than manual inspection. Reproducible and removes subjective judgement.
131+
- **Multi-resolution clustering** — Leiden at 5 resolutions with silhouette evaluation and a biological floor of ≥5 clusters.
132+
- **Trajectory inference** — PAGA provides a principled graph abstraction of cell-type connectivity. Diffusion pseudotime orders cells along differentiation axes.
133+
- **T cell subclustering** — Resolves CD4+/CD8+ populations that share CD3D/CD3E expression and cannot be separated at global clustering resolution.
103134
- **Colourblind-friendly palette** — Okabe-Ito colours throughout.
104-
- **Modular scripts** — Each step is independent. Re-run any step without repeating upstream work.
135+
- **Reproducible seeds**`random_state=42` for UMAP, Leiden, Scrublet, and silhouette sampling.
136+
- **Dual-format figures** — PNG (300 DPI) for web, PDF (vector) for publication submission.
105137

106-
## Limitations and Future Work
138+
## Limitations
107139

108-
- **No doublet detection.** Scrublet or similar should precede QC in a production pipeline. Omitted here because PBMC 3k is a clean benchmark with negligible doublet rates.
109-
- **No batch correction.** Single-sample dataset. Multi-sample analyses would require Harmony, scVI, or BBKNN.
110-
- **`regress_out` is debatable.** Used here following the original scanpy tutorial, but Luecken & Theis (2019) suggest regression may overcorrect for well-filtered cells. Included for pedagogical alignment with the standard workflow.
111-
- **CD8+ T cells not resolved.** Would require higher clustering resolution or subclustering of the T cell compartment.
140+
- **Single-sample dataset.** Multi-sample analyses would require batch correction (Harmony, scVI, or BBKNN).
141+
- **`regress_out` is debatable.** Used here following the original scanpy tutorial, but Luecken & Theis (2019) suggest regression may overcorrect for well-filtered cells.
142+
- **No pathway enrichment.** Gene set enrichment (via decoupler or GSEApy) would connect cell types to functional programmes. Planned as a future addition.
143+
- **FCGR3A+ monocytes not resolved.** At resolution 0.5, nonclassical monocytes merge with the CD14+ cluster. Higher resolution or targeted subclustering would separate them.
112144

113145
## Licence
114146

run_pipeline.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,13 @@
2020
("03 Reduce dimensions", "03_reduce_dimensions.py"),
2121
("04 Cluster", "04_cluster.py"),
2222
("05 Annotate cell types", "05_annotate_cell_types.py"),
23-
("06 Publication figures", "06_publication_figures.py"),
23+
("06 Trajectory inference", "06_trajectory.py"),
24+
("07 T cell subclustering", "07_t_cell_subclustering.py"),
25+
("08 Publication figures", "08_publication_figures.py"),
2426
]
2527

28+
N_STEPS = len(STEPS)
29+
2630

2731
def run_pipeline(start_from=1):
2832
print("=" * 60)
@@ -32,10 +36,10 @@ def run_pipeline(start_from=1):
3236
total_start = time.time()
3337
for i, (name, script) in enumerate(STEPS, 1):
3438
if i < start_from:
35-
print(f"\n[{i}/6] {name} -- SKIPPED")
39+
print(f"\n[{i}/{N_STEPS}] {name} -- SKIPPED")
3640
continue
3741
print(f"\n{'─' * 60}")
38-
print(f"[{i}/6] {name}")
42+
print(f"[{i}/{N_STEPS}] {name}")
3943
print("─" * 60)
4044
step_start = time.time()
4145
result = subprocess.run(
@@ -57,7 +61,7 @@ def run_pipeline(start_from=1):
5761
def main():
5862
parser = argparse.ArgumentParser(description="Run single-cell RNA-seq pipeline")
5963
parser.add_argument("--from", dest="start_from", type=int, default=1,
60-
help="Step number to start from (1-6)")
64+
help=f"Step number to start from (1-{N_STEPS})")
6165
args = parser.parse_args()
6266
run_pipeline(start_from=args.start_from)
6367

scripts/01_load_and_qc.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,13 +84,22 @@ def run_qc(adata):
8484
plt.close(fig)
8585
print(f"Saved QC plot to {FIG_DIR / '01_qc_metrics.png'}")
8686

87+
# --- Doublet detection ---
88+
sc.pp.scrublet(adata, random_state=42)
89+
n_doublets = adata.obs["predicted_doublet"].sum()
90+
print(f"Scrublet detected {n_doublets} predicted doublets ({n_doublets/adata.n_obs*100:.1f}%)")
91+
8792
# --- Filtering ---
8893
n_before = adata.n_obs
8994
sc.pp.filter_cells(adata, min_genes=MIN_GENES_PER_CELL)
9095
sc.pp.filter_genes(adata, min_cells=MIN_CELLS_PER_GENE)
9196
adata = adata[adata.obs["n_genes_by_counts"] < MAX_GENES_PER_CELL, :].copy()
9297
adata = adata[adata.obs["pct_counts_mt"] < MAX_PCT_MITO, :].copy()
98+
# Remove predicted doublets
99+
n_doublets_remaining = adata.obs["predicted_doublet"].sum()
100+
adata = adata[~adata.obs["predicted_doublet"], :].copy()
93101
n_after = adata.n_obs
102+
print(f"Removed {n_doublets_remaining} doublets from filtered cells")
94103

95104
print(f"Filtered: {n_before} -> {n_after} cells ({n_before - n_after} removed)")
96105
print(f"Genes remaining: {adata.n_vars}")

scripts/06_trajectory.py

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
"""Step 06: PAGA trajectory inference and diffusion pseudotime."""
2+
3+
import scanpy as sc
4+
import matplotlib.pyplot as plt
5+
from pathlib import Path
6+
7+
RESULTS_DIR = Path("results")
8+
FIG_DIR = RESULTS_DIR / "figures"
9+
10+
11+
def compute_trajectory(adata):
12+
"""Run PAGA and diffusion pseudotime on annotated data."""
13+
# PAGA — partition-based graph abstraction
14+
sc.tl.paga(adata, groups="cell_type")
15+
sc.pl.paga(adata, threshold=0.03, show=False)
16+
plt.savefig(FIG_DIR / "06_paga_graph.png", dpi=150, bbox_inches="tight")
17+
plt.savefig(FIG_DIR / "06_paga_graph.pdf", bbox_inches="tight")
18+
plt.close()
19+
print("Computed PAGA graph")
20+
21+
# Reinitialise UMAP using PAGA as initialisation for cleaner layout
22+
sc.tl.umap(adata, init_pos="paga", random_state=42)
23+
24+
# Diffusion pseudotime — root in CD14+ monocytes (most primitive in PBMC)
25+
sc.tl.diffmap(adata)
26+
27+
# Find a root cell in CD14+ monocytes
28+
monocyte_mask = adata.obs["cell_type"] == "CD14+ Monocytes"
29+
if monocyte_mask.any():
30+
adata.uns["iroot"] = monocyte_mask.values.nonzero()[0][0]
31+
sc.tl.dpt(adata)
32+
print("Computed diffusion pseudotime (rooted in CD14+ monocytes)")
33+
else:
34+
print("Warning: No CD14+ monocytes found, skipping diffusion pseudotime")
35+
36+
return adata
37+
38+
39+
def plot_trajectory(adata):
40+
"""Plot PAGA-initialised UMAP and pseudotime."""
41+
FIG_DIR.mkdir(parents=True, exist_ok=True)
42+
43+
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
44+
45+
# PAGA-initialised UMAP coloured by cell type
46+
sc.pl.umap(adata, color="cell_type", legend_loc="right margin",
47+
frameon=False, ax=axes[0], show=False, title="Cell types (PAGA layout)")
48+
49+
# PAGA connectivity overlaid on UMAP
50+
sc.pl.paga(adata, pos=adata.uns["paga"]["pos"], threshold=0.03,
51+
node_size_scale=1.5, ax=axes[1], show=False,
52+
title="PAGA connectivity")
53+
54+
# Diffusion pseudotime
55+
if "dpt_pseudotime" in adata.obs:
56+
sc.pl.umap(adata, color="dpt_pseudotime", frameon=False,
57+
ax=axes[2], show=False, title="Diffusion pseudotime",
58+
color_map="viridis")
59+
else:
60+
axes[2].text(0.5, 0.5, "Pseudotime not computed",
61+
ha="center", va="center", transform=axes[2].transAxes)
62+
axes[2].set_axis_off()
63+
64+
fig.tight_layout()
65+
fig.savefig(FIG_DIR / "06_trajectory.png", dpi=150, bbox_inches="tight")
66+
fig.savefig(FIG_DIR / "06_trajectory.pdf", bbox_inches="tight")
67+
plt.close(fig)
68+
print(f"Saved trajectory plots to {FIG_DIR / '06_trajectory.png'}")
69+
70+
71+
def main():
72+
in_path = RESULTS_DIR / "05_annotated.h5ad"
73+
adata = sc.read_h5ad(in_path)
74+
print(f"Loaded {in_path}")
75+
76+
adata = compute_trajectory(adata)
77+
plot_trajectory(adata)
78+
79+
out_path = RESULTS_DIR / "06_trajectory.h5ad"
80+
adata.write(out_path)
81+
print(f"Saved trajectory data to {out_path}")
82+
return adata
83+
84+
85+
if __name__ == "__main__":
86+
main()

0 commit comments

Comments
 (0)