You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add doublet detection, PAGA trajectory, T cell subclustering
Pipeline expanded from 6 to 8 steps:
- 01: Scrublet doublet detection (36 detected, 34 removed after QC)
- 06: PAGA trajectory inference + diffusion pseudotime rooted in
CD14+ monocytes. PAGA-initialised UMAP for cleaner layout.
- 07: T cell subclustering resolves CD4+ (47.5%) and CD8+ (52.5%)
via marker scoring — addresses the main biological gap.
- 08: Publication figures now export PDF alongside PNG.
2,604 cells retained (was 2,638 before doublet removal). 5 clusters
at resolution 0.5, silhouette 0.204. All seeds set for reproducibility.
End-to-end single-cell RNA-seq analysis pipeline in Python using [scanpy](https://scanpy.readthedocs.io/). Quality control, normalisation, dimensionality reduction, unsupervised clustering with automated resolution selection, and marker-based cell type annotation on human peripheral blood mononuclear cells.
7
+
End-to-end single-cell RNA-seq analysis pipeline in Python using [scanpy](https://scanpy.readthedocs.io/). Doublet detection, quality control, normalisation, dimensionality reduction, clustering with automated resolution selection, marker-based cell type annotation, PAGA trajectory inference, and T cell subclustering on human peripheral blood mononuclear cells.
All scripts are in `scripts/`. Each reads the previous step's `.h5ad` output from `results/`.
64
72
65
73
## Results
66
74
75
+
### Cell Type Composition
76
+
67
77
| Cell Type | Cells | % | Key Markers |
68
78
|-----------|-------|---|-------------|
69
-
| CD4+ T cells | 1,195 | 45.3 | CD3D, IL7R |
70
-
| CD14+ Monocytes | 464 | 17.6 | CD14, LYZ |
71
-
| NK cells | 419 | 15.9 | NKG7, GNLY |
72
-
| B cells | 342 | 13.0 | MS4A1, CD79A |
73
-
| FCGR3A+ Monocytes | 180 | 6.8 | FCGR3A, MS4A7 |
74
-
| Dendritic cells | 38 | 1.4 | FCER1A, CST3 |
79
+
| CD4+ T cells | 1,192 | 45.8 | CD3D, IL7R |
80
+
| CD14+ Monocytes | 636 | 24.4 | CD14, LYZ |
81
+
| NK cells | 410 | 15.7 | NKG7, GNLY |
82
+
| B cells | 330 | 12.7 | MS4A1, CD79A |
83
+
| Dendritic cells | 36 | 1.4 | FCER1A, CST3 |
84
+
85
+
2,604 cells retained after QC and doublet removal (from 2,700 raw). Clustering selected resolution 0.5 (5 clusters, silhouette 0.204).
86
+
87
+
### T Cell Subclustering
88
+
89
+
Subclustering the T cell compartment (1,192 cells) resolves the CD4+/CD8+ boundary that is not visible at the global clustering level:
90
+
91
+
| Subtype | Cells | % of T cells |
92
+
|---------|-------|---|
93
+
| CD8+ T | 626 | 52.5 |
94
+
| CD4+ T | 566 | 47.5 |
95
+
96
+
The near-equal split is consistent with healthy donor PBMCs. CD8+ T cells were assigned by scoring CD8A/CD8B/GZMK/GZMA against IL7R/CD4/TCF7/LEF1.
97
+
98
+
### Trajectory Inference
99
+
100
+
PAGA connects CD14+ monocytes → dendritic cells (the myeloid differentiation axis) and reveals the T/NK cell cluster neighbourhood in UMAP space. Diffusion pseudotime, rooted in CD14+ monocytes, orders cells along the monocyte-to-DC trajectory.
101
+
102
+
### Biological Interpretation
75
103
76
-
The dominance of CD4+ T cells (45%) is expected in healthy donor PBMCs. The ratio of classical (CD14+) to nonclassical (FCGR3A+) monocytes is approximately 2.6:1, consistent with published literature. Dendritic cells are a rare population (1.4%), correctly resolved as a distinct cluster. CD8+ T cells and megakaryocytes are present in the dataset but were not resolved as separate clusters at resolution 0.5 — they likely merge with the CD4+ T cell and monocyte clusters respectively due to shared marker expression (CD3D/CD3E for T cell subtypes).
104
+
The dominance of CD4+ T cells (46%) is expected in healthy donor PBMCs. Dendritic cells are a rare population (1.4%), correctly resolved as a distinct cluster despite low cell count. The monocyte population is predominantly classical (CD14+); nonclassical (FCGR3A+) monocytes were not resolved as a separate cluster at resolution 0.5 — they likely merge with the classical monocyte cluster. This is consistent with the resolution-sensitivity of FCGR3A+ monocyte separation observed in the literature.
77
105
78
-
Clustering selected resolution 0.5 (6 clusters, silhouette 0.196). Silhouette scores in single-cell data are typically low due to continuous rather than discrete cell states; the metric is used here for relative comparison between resolutions, not as an absolute quality measure.
106
+
Silhouette scores in single-cell data are typically low due to continuous rather than discrete cell states; the metric is used here for relative comparison between resolutions, not as an absolute quality measure.
python run_pipeline.py --from 4# resume from step 4
114
+
python run_pipeline.py # full pipeline (~38s)
115
+
python run_pipeline.py --from 6# resume from trajectory step
88
116
```
89
117
90
118
## Testing
@@ -98,17 +126,21 @@ pytest -v
98
126
99
127
## Design Decisions
100
128
101
-
-**Automated annotation** — Clusters are scored against curated PBMC marker gene sets rather than annotated by manual inspection. This makes the pipeline reproducible and removes subjective judgement.
102
-
-**Multi-resolution clustering** — Running Leiden at multiple resolutions and picking by silhouette score (with a biological floor) avoids the common problem of choosing an arbitrary resolution.
129
+
-**Doublet detection** — Scrublet integrated before QC filtering. 36 doublets detected (1.3%), 34 removed after other QC filters. Following [Luecken & Theis (2019)](https://doi.org/10.15252/msb.20188746) best practices.
130
+
-**Automated annotation** — Clusters scored against curated PBMC marker gene sets rather than manual inspection. Reproducible and removes subjective judgement.
131
+
-**Multi-resolution clustering** — Leiden at 5 resolutions with silhouette evaluation and a biological floor of ≥5 clusters.
132
+
-**Trajectory inference** — PAGA provides a principled graph abstraction of cell-type connectivity. Diffusion pseudotime orders cells along differentiation axes.
133
+
-**T cell subclustering** — Resolves CD4+/CD8+ populations that share CD3D/CD3E expression and cannot be separated at global clustering resolution.
-**Modular scripts** — Each step is independent. Re-run any step without repeating upstream work.
135
+
-**Reproducible seeds** — `random_state=42` for UMAP, Leiden, Scrublet, and silhouette sampling.
136
+
-**Dual-format figures** — PNG (300 DPI) for web, PDF (vector) for publication submission.
105
137
106
-
## Limitations and Future Work
138
+
## Limitations
107
139
108
-
-**No doublet detection.**Scrublet or similar should precede QC in a production pipeline. Omitted here because PBMC 3k is a clean benchmark with negligible doublet rates.
109
-
-**No batch correction.**Single-sample dataset. Multi-sample analyses would require Harmony, scVI, or BBKNN.
110
-
-**`regress_out` is debatable.**Used here following the original scanpy tutorial, but Luecken & Theis (2019) suggest regression may overcorrect for well-filtered cells. Included for pedagogical alignment with the standard workflow.
111
-
-**CD8+ T cells not resolved.**Would require higher clustering resolution or subclustering of the T cell compartment.
140
+
-**Single-sample dataset.**Multi-sample analyses would require batch correction (Harmony, scVI, or BBKNN).
141
+
-**`regress_out` is debatable.**Used here following the original scanpy tutorial, but Luecken & Theis (2019) suggest regression may overcorrect for well-filtered cells.
142
+
-**No pathway enrichment.**Gene set enrichment (via decoupler or GSEApy) would connect cell types to functional programmes. Planned as a future addition.
143
+
-**FCGR3A+ monocytes not resolved.**At resolution 0.5, nonclassical monocytes merge with the CD14+ cluster. Higher resolution or targeted subclustering would separate them.
0 commit comments