Add doublet detection, PAGA trajectory, T cell subclustering

Ekin-Kahraman · Ekin-Kahraman · commit 2c916dc322de · 2026-04-04T12:40:32.000+01:00
Pipeline expanded from 6 to 8 steps:
- 01: Scrublet doublet detection (36 detected, 34 removed after QC)
- 06: PAGA trajectory inference + diffusion pseudotime rooted in
  CD14+ monocytes. PAGA-initialised UMAP for cleaner layout.
- 07: T cell subclustering resolves CD4+ (47.5%) and CD8+ (52.5%)
  via marker scoring — addresses the main biological gap.
- 08: Publication figures now export PDF alongside PNG.

2,604 cells retained (was 2,638 before doublet removal). 5 clusters
at resolution 0.5, silhouette 0.204. All seeds set for reproducibility.
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
 [![Python](https://img.shields.io/badge/Python-3.10%2B-blue)](https://www.python.org/)
 
-End-to-end single-cell RNA-seq analysis pipeline in Python using [scanpy](https://scanpy.readthedocs.io/). Quality control, normalisation, dimensionality reduction, unsupervised clustering with automated resolution selection, and marker-based cell type annotation on human peripheral blood mononuclear cells.
+End-to-end single-cell RNA-seq analysis pipeline in Python using [scanpy](https://scanpy.readthedocs.io/). Doublet detection, quality control, normalisation, dimensionality reduction, clustering with automated resolution selection, marker-based cell type annotation, PAGA trajectory inference, and T cell subclustering on human peripheral blood mononuclear cells.
 
 <p align="center">
   <img src="docs/umap_3d_rotation.gif" alt="3D UMAP rotation showing PBMC immune cell clusters" width="600">
@@ -28,63 +28,91 @@ End-to-end single-cell RNA-seq analysis pipeline in Python using [scanpy](https:
 ## Workflow
 
 ```
-PBMC 3k (10X Genomics)
+PBMC 3k (10X Genomics, 2,700 cells)
     │
     ▼
- 01 QC ──────────── Filter: 200 < genes < 2500, mito < 5%
-    │
+ 01 QC ──────────── Scrublet doublet detection → filter: 200 < genes < 2500, mito < 5%
+    │                (36 doublets detected, 34 removed)
     ▼
  02 Preprocess ──── Normalise (10k), log1p, 2000 HVGs, regress, scale
     │
     ▼
- 03 Reduce ──────── PCA (40 PCs) → kNN graph → UMAP
+ 03 Reduce ──────── PCA (40 PCs) → kNN graph → UMAP (random_state=42)
     │
     ▼
  04 Cluster ─────── Leiden at 5 resolutions → silhouette selection (≥5 clusters)
     │
     ▼
- 05 Annotate ────── Wilcoxon DE → score against PBMC marker signatures
+ 05 Annotate ────── Wilcoxon DE → score against PBMC marker signatures → 5 cell types
+    │
+    ▼
+ 06 Trajectory ──── PAGA graph abstraction → diffusion pseudotime (rooted in CD14+ mono)
     │
     ▼
- 06 Figures ─────── Multi-panel publication figure + 3D UMAP
+ 07 Subcluster ──── T cell compartment → resolve CD4+ (47.5%) and CD8+ (52.5%)
+    │
+    ▼
+ 08 Figures ─────── Multi-panel publication figure (PNG 300 DPI + PDF vector)
 ```
 
 ## Pipeline
 
 | Step | Script | What it does |
 |------|--------|--------------|
-| 01 | `01_load_and_qc.py` | Download PBMC 3k, calculate QC metrics (genes/cell, UMI counts, mitochondrial %), filter low-quality cells |
-| 02 | `02_preprocess.py` | Normalise to 10k counts/cell, log-transform, select 2,000 highly variable genes, regress out confounders, scale |
-| 03 | `03_reduce_dimensions.py` | PCA (40 components), build k-nearest neighbour graph, compute UMAP embedding |
-| 04 | `04_cluster.py` | Leiden clustering at 5 resolutions (0.3–1.2), evaluate with silhouette score, select best with a floor of 5 clusters |
-| 05 | `05_annotate_cell_types.py` | Wilcoxon rank-sum test for marker genes, score clusters against known PBMC signatures, assign cell types |
-| 06 | `06_publication_figures.py` | Multi-panel figure: UMAP, composition bar chart, marker heatmap, summary statistics |
+| 01 | `01_load_and_qc.py` | Download PBMC 3k, Scrublet doublet detection, QC metrics, filter low-quality cells and doublets |
+| 02 | `02_preprocess.py` | Normalise to 10k counts/cell, log-transform, select 2,000 HVGs, regress out confounders, scale |
+| 03 | `03_reduce_dimensions.py` | PCA (40 components), k-nearest neighbour graph, UMAP embedding |
+| 04 | `04_cluster.py` | Leiden clustering at 5 resolutions (0.3–1.2), silhouette evaluation, select best with ≥5 cluster floor |
+| 05 | `05_annotate_cell_types.py` | Wilcoxon rank-sum DE, score clusters against curated PBMC signatures, assign cell types |
+| 06 | `06_trajectory.py` | PAGA partition-based graph abstraction, PAGA-initialised UMAP, diffusion pseudotime |
+| 07 | `07_t_cell_subclustering.py` | Extract T cell compartment, subcluster, resolve CD4+/CD8+ via marker scoring |
+| 08 | `08_publication_figures.py` | Multi-panel figure with UMAP, composition, marker heatmap, summary (PNG + PDF) |
 
 All scripts are in `scripts/`. Each reads the previous step's `.h5ad` output from `results/`.
 
 ## Results
 
+### Cell Type Composition
+
 | Cell Type | Cells | % | Key Markers |
 |-----------|-------|---|-------------|
-| CD4+ T cells | 1,195 | 45.3 | CD3D, IL7R |
-| CD14+ Monocytes | 464 | 17.6 | CD14, LYZ |
-| NK cells | 419 | 15.9 | NKG7, GNLY |
-| B cells | 342 | 13.0 | MS4A1, CD79A |
-| FCGR3A+ Monocytes | 180 | 6.8 | FCGR3A, MS4A7 |
-| Dendritic cells | 38 | 1.4 | FCER1A, CST3 |
+| CD4+ T cells | 1,192 | 45.8 | CD3D, IL7R |
+| CD14+ Monocytes | 636 | 24.4 | CD14, LYZ |
+| NK cells | 410 | 15.7 | NKG7, GNLY |
+| B cells | 330 | 12.7 | MS4A1, CD79A |
+| Dendritic cells | 36 | 1.4 | FCER1A, CST3 |
+
+2,604 cells retained after QC and doublet removal (from 2,700 raw). Clustering selected resolution 0.5 (5 clusters, silhouette 0.204).
+
+### T Cell Subclustering
+
+Subclustering the T cell compartment (1,192 cells) resolves the CD4+/CD8+ boundary that is not visible at the global clustering level:
+
+| Subtype | Cells | % of T cells |
+|---------|-------|---|
+| CD8+ T | 626 | 52.5 |
+| CD4+ T | 566 | 47.5 |
+
+The near-equal split is consistent with healthy donor PBMCs. CD8+ T cells were assigned by scoring CD8A/CD8B/GZMK/GZMA against IL7R/CD4/TCF7/LEF1.
+
+### Trajectory Inference
+
+PAGA connects CD14+ monocytes → dendritic cells (the myeloid differentiation axis) and reveals the T/NK cell cluster neighbourhood in UMAP space. Diffusion pseudotime, rooted in CD14+ monocytes, orders cells along the monocyte-to-DC trajectory.
+
+### Biological Interpretation
 
-The dominance of CD4+ T cells (45%) is expected in healthy donor PBMCs. The ratio of classical (CD14+) to nonclassical (FCGR3A+) monocytes is approximately 2.6:1, consistent with published literature. Dendritic cells are a rare population (1.4%), correctly resolved as a distinct cluster. CD8+ T cells and megakaryocytes are present in the dataset but were not resolved as separate clusters at resolution 0.5 — they likely merge with the CD4+ T cell and monocyte clusters respectively due to shared marker expression (CD3D/CD3E for T cell subtypes).
+The dominance of CD4+ T cells (46%) is expected in healthy donor PBMCs. Dendritic cells are a rare population (1.4%), correctly resolved as a distinct cluster despite low cell count. The monocyte population is predominantly classical (CD14+); nonclassical (FCGR3A+) monocytes were not resolved as a separate cluster at resolution 0.5 — they likely merge with the classical monocyte cluster. This is consistent with the resolution-sensitivity of FCGR3A+ monocyte separation observed in the literature.
 
-Clustering selected resolution 0.5 (6 clusters, silhouette 0.196). Silhouette scores in single-cell data are typically low due to continuous rather than discrete cell states; the metric is used here for relative comparison between resolutions, not as an absolute quality measure.
+Silhouette scores in single-cell data are typically low due to continuous rather than discrete cell states; the metric is used here for relative comparison between resolutions, not as an absolute quality measure.
 
 ## Quick Start
 
 ```bash
 git clone https://github.com/Ekin-Kahraman/single-cell-rnaseq-immune-profiling.git
 cd single-cell-rnaseq-immune-profiling
 pip install -e .
-python run_pipeline.py            # full pipeline (~17s)
-python run_pipeline.py --from 4   # resume from step 4
+python run_pipeline.py            # full pipeline (~38s)
+python run_pipeline.py --from 6   # resume from trajectory step
 ```
 
 ## Testing
@@ -98,17 +126,21 @@ pytest -v
 
 ## Design Decisions
 
-- **Automated annotation** — Clusters are scored against curated PBMC marker gene sets rather than annotated by manual inspection. This makes the pipeline reproducible and removes subjective judgement.
-- **Multi-resolution clustering** — Running Leiden at multiple resolutions and picking by silhouette score (with a biological floor) avoids the common problem of choosing an arbitrary resolution.
+- **Doublet detection** — Scrublet integrated before QC filtering. 36 doublets detected (1.3%), 34 removed after other QC filters. Following [Luecken & Theis (2019)](https://doi.org/10.15252/msb.20188746) best practices.
+- **Automated annotation** — Clusters scored against curated PBMC marker gene sets rather than manual inspection. Reproducible and removes subjective judgement.
+- **Multi-resolution clustering** — Leiden at 5 resolutions with silhouette evaluation and a biological floor of ≥5 clusters.
+- **Trajectory inference** — PAGA provides a principled graph abstraction of cell-type connectivity. Diffusion pseudotime orders cells along differentiation axes.
+- **T cell subclustering** — Resolves CD4+/CD8+ populations that share CD3D/CD3E expression and cannot be separated at global clustering resolution.
 - **Colourblind-friendly palette** — Okabe-Ito colours throughout.
-- **Modular scripts** — Each step is independent. Re-run any step without repeating upstream work.
+- **Reproducible seeds** — `random_state=42` for UMAP, Leiden, Scrublet, and silhouette sampling.
+- **Dual-format figures** — PNG (300 DPI) for web, PDF (vector) for publication submission.
 
-## Limitations and Future Work
+## Limitations
 
-- **No doublet detection.** Scrublet or similar should precede QC in a production pipeline. Omitted here because PBMC 3k is a clean benchmark with negligible doublet rates.
-- **No batch correction.** Single-sample dataset. Multi-sample analyses would require Harmony, scVI, or BBKNN.
-- **`regress_out` is debatable.** Used here following the original scanpy tutorial, but Luecken & Theis (2019) suggest regression may overcorrect for well-filtered cells. Included for pedagogical alignment with the standard workflow.
-- **CD8+ T cells not resolved.** Would require higher clustering resolution or subclustering of the T cell compartment.
+- **Single-sample dataset.** Multi-sample analyses would require batch correction (Harmony, scVI, or BBKNN).
+- **`regress_out` is debatable.** Used here following the original scanpy tutorial, but Luecken & Theis (2019) suggest regression may overcorrect for well-filtered cells.
+- **No pathway enrichment.** Gene set enrichment (via decoupler or GSEApy) would connect cell types to functional programmes. Planned as a future addition.
+- **FCGR3A+ monocytes not resolved.** At resolution 0.5, nonclassical monocytes merge with the CD14+ cluster. Higher resolution or targeted subclustering would separate them.
 
 ## Licence
 
diff --git a/run_pipeline.py b/run_pipeline.py
@@ -20,9 +20,13 @@
     ("03 Reduce dimensions", "03_reduce_dimensions.py"),
     ("04 Cluster", "04_cluster.py"),
     ("05 Annotate cell types", "05_annotate_cell_types.py"),
-    ("06 Publication figures", "06_publication_figures.py"),
+    ("06 Trajectory inference", "06_trajectory.py"),
+    ("07 T cell subclustering", "07_t_cell_subclustering.py"),
+    ("08 Publication figures", "08_publication_figures.py"),
 ]
 
+N_STEPS = len(STEPS)
+
 
 def run_pipeline(start_from=1):
     print("=" * 60)
@@ -32,10 +36,10 @@ def run_pipeline(start_from=1):
     total_start = time.time()
     for i, (name, script) in enumerate(STEPS, 1):
         if i < start_from:
-            print(f"\n[{i}/6] {name} -- SKIPPED")
+            print(f"\n[{i}/{N_STEPS}] {name} -- SKIPPED")
             continue
         print(f"\n{'─' * 60}")
-        print(f"[{i}/6] {name}")
+        print(f"[{i}/{N_STEPS}] {name}")
         print("─" * 60)
         step_start = time.time()
         result = subprocess.run(
@@ -57,7 +61,7 @@ def run_pipeline(start_from=1):
 def main():
     parser = argparse.ArgumentParser(description="Run single-cell RNA-seq pipeline")
     parser.add_argument("--from", dest="start_from", type=int, default=1,
-                        help="Step number to start from (1-6)")
+                        help=f"Step number to start from (1-{N_STEPS})")
     args = parser.parse_args()
     run_pipeline(start_from=args.start_from)
 
diff --git a/scripts/01_load_and_qc.py b/scripts/01_load_and_qc.py
@@ -84,13 +84,22 @@ def run_qc(adata):
     plt.close(fig)
     print(f"Saved QC plot to {FIG_DIR / '01_qc_metrics.png'}")
 
+    # --- Doublet detection ---
+    sc.pp.scrublet(adata, random_state=42)
+    n_doublets = adata.obs["predicted_doublet"].sum()
+    print(f"Scrublet detected {n_doublets} predicted doublets ({n_doublets/adata.n_obs*100:.1f}%)")
+
     # --- Filtering ---
     n_before = adata.n_obs
     sc.pp.filter_cells(adata, min_genes=MIN_GENES_PER_CELL)
     sc.pp.filter_genes(adata, min_cells=MIN_CELLS_PER_GENE)
     adata = adata[adata.obs["n_genes_by_counts"] < MAX_GENES_PER_CELL, :].copy()
     adata = adata[adata.obs["pct_counts_mt"] < MAX_PCT_MITO, :].copy()
+    # Remove predicted doublets
+    n_doublets_remaining = adata.obs["predicted_doublet"].sum()
+    adata = adata[~adata.obs["predicted_doublet"], :].copy()
     n_after = adata.n_obs
+    print(f"Removed {n_doublets_remaining} doublets from filtered cells")
 
     print(f"Filtered: {n_before} -> {n_after} cells ({n_before - n_after} removed)")
     print(f"Genes remaining: {adata.n_vars}")
diff --git a/scripts/06_trajectory.py b/scripts/06_trajectory.py
@@ -0,0 +1,86 @@
+"""Step 06: PAGA trajectory inference and diffusion pseudotime."""
+
+import scanpy as sc
+import matplotlib.pyplot as plt
+from pathlib import Path
+
+RESULTS_DIR = Path("results")
+FIG_DIR = RESULTS_DIR / "figures"
+
+
+def compute_trajectory(adata):
+    """Run PAGA and diffusion pseudotime on annotated data."""
+    # PAGA — partition-based graph abstraction
+    sc.tl.paga(adata, groups="cell_type")
+    sc.pl.paga(adata, threshold=0.03, show=False)
+    plt.savefig(FIG_DIR / "06_paga_graph.png", dpi=150, bbox_inches="tight")
+    plt.savefig(FIG_DIR / "06_paga_graph.pdf", bbox_inches="tight")
+    plt.close()
+    print("Computed PAGA graph")
+
+    # Reinitialise UMAP using PAGA as initialisation for cleaner layout
+    sc.tl.umap(adata, init_pos="paga", random_state=42)
+
+    # Diffusion pseudotime — root in CD14+ monocytes (most primitive in PBMC)
+    sc.tl.diffmap(adata)
+
+    # Find a root cell in CD14+ monocytes
+    monocyte_mask = adata.obs["cell_type"] == "CD14+ Monocytes"
+    if monocyte_mask.any():
+        adata.uns["iroot"] = monocyte_mask.values.nonzero()[0][0]
+        sc.tl.dpt(adata)
+        print("Computed diffusion pseudotime (rooted in CD14+ monocytes)")
+    else:
+        print("Warning: No CD14+ monocytes found, skipping diffusion pseudotime")
+
+    return adata
+
+
+def plot_trajectory(adata):
+    """Plot PAGA-initialised UMAP and pseudotime."""
+    FIG_DIR.mkdir(parents=True, exist_ok=True)
+
+    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
+
+    # PAGA-initialised UMAP coloured by cell type
+    sc.pl.umap(adata, color="cell_type", legend_loc="right margin",
+               frameon=False, ax=axes[0], show=False, title="Cell types (PAGA layout)")
+
+    # PAGA connectivity overlaid on UMAP
+    sc.pl.paga(adata, pos=adata.uns["paga"]["pos"], threshold=0.03,
+               node_size_scale=1.5, ax=axes[1], show=False,
+               title="PAGA connectivity")
+
+    # Diffusion pseudotime
+    if "dpt_pseudotime" in adata.obs:
+        sc.pl.umap(adata, color="dpt_pseudotime", frameon=False,
+                   ax=axes[2], show=False, title="Diffusion pseudotime",
+                   color_map="viridis")
+    else:
+        axes[2].text(0.5, 0.5, "Pseudotime not computed",
+                     ha="center", va="center", transform=axes[2].transAxes)
+        axes[2].set_axis_off()
+
+    fig.tight_layout()
+    fig.savefig(FIG_DIR / "06_trajectory.png", dpi=150, bbox_inches="tight")
+    fig.savefig(FIG_DIR / "06_trajectory.pdf", bbox_inches="tight")
+    plt.close(fig)
+    print(f"Saved trajectory plots to {FIG_DIR / '06_trajectory.png'}")
+
+
+def main():
+    in_path = RESULTS_DIR / "05_annotated.h5ad"
+    adata = sc.read_h5ad(in_path)
+    print(f"Loaded {in_path}")
+
+    adata = compute_trajectory(adata)
+    plot_trajectory(adata)
+
+    out_path = RESULTS_DIR / "06_trajectory.h5ad"
+    adata.write(out_path)
+    print(f"Saved trajectory data to {out_path}")
+    return adata
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/07_t_cell_subclustering.py b/scripts/07_t_cell_subclustering.py
diff --git a/scripts/08_publication_figures.py b/scripts/08_publication_figures.py