Skip to content

Commit 8fda738

Browse files
committed
Tighten README signal
1 parent 3930f7a commit 8fda738

1 file changed

Lines changed: 13 additions & 13 deletions

File tree

README.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Single-cell RNA-seq analysis pipeline in Python using [scanpy](https://scanpy.re
88

99
Retains 2,604 PBMCs, resolves 5 major immune populations plus CD4/CD8 T-cell subclusters, with CI-tested reproducibility.
1010

11-
## Production Readiness
11+
## Engineering Evidence
1212

1313
- CI runs unit tests across Python 3.10, 3.11, and 3.12.
1414
- CI also runs the complete PBMC pipeline on Python 3.12 and validates generated `.h5ad`, CSV, PNG, PDF, and manifest artefacts.
@@ -24,12 +24,12 @@ Retains 2,604 PBMCs, resolves 5 major immune populations plus CD4/CD8 T-cell sub
2424

2525
![Publication Figure](docs/publication_figure.png)
2626

27-
**Panel A** UMAP coloured by unsupervised Leiden clusters. **Panel B** Same embedding coloured by assigned cell type. **Panel C** Cell type proportions. **Panel D** Z-scored expression of canonical marker genes per cell type (red = high, blue = low). **Panel E** Summary statistics.
27+
**Panel A** - UMAP coloured by unsupervised Leiden clusters. **Panel B** - Same embedding coloured by assigned cell type. **Panel C** - Cell type proportions. **Panel D** - Z-scored expression of canonical marker genes per cell type (red = high, blue = low). **Panel E** - Summary statistics.
2828
</details>
2929

3030
## Dataset
3131

32-
**10X Genomics PBMC 3k** 2,700 peripheral blood mononuclear cells from a healthy donor, sequenced on the Chromium platform. This is the standard benchmark dataset used across [scanpy](https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html), [Seurat](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html), and other single-cell frameworks.
32+
**10X Genomics PBMC 3k** - 2,700 peripheral blood mononuclear cells from a healthy donor, sequenced on the Chromium platform. This is the standard benchmark dataset used across [scanpy](https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html), [Seurat](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html), and other single-cell frameworks.
3333

3434
- **Direct download**: [filtered gene-barcode matrices](https://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz) (5.9 MB)
3535
- **Reference**: Zheng et al. (2017) [Massively parallel digital transcriptional profiling of single cells](https://doi.org/10.1038/ncomms14049). *Nature Communications* 8, 14049.
@@ -141,7 +141,7 @@ PAGA connects CD14+ monocytes → dendritic cells (the myeloid differentiation a
141141

142142
### Biological Interpretation
143143

144-
The dominance of CD4+ T cells (46%) is expected in healthy donor PBMCs. Dendritic cells are a rare population (1.4%), correctly resolved as a distinct cluster despite low cell count. The monocyte population is predominantly classical (CD14+); nonclassical (FCGR3A+) monocytes were not resolved as a separate cluster at resolution 0.5 they likely merge with the classical monocyte cluster. This is consistent with the resolution-sensitivity of FCGR3A+ monocyte separation observed in the literature.
144+
The dominance of CD4+ T cells (46%) is expected in healthy donor PBMCs. Dendritic cells are a rare population (1.4%), correctly resolved as a distinct cluster despite low cell count. The monocyte population is predominantly classical (CD14+); nonclassical (FCGR3A+) monocytes were not resolved as a separate cluster at resolution 0.5 - they likely merge with the classical monocyte cluster. This is consistent with the resolution-sensitivity of FCGR3A+ monocyte separation observed in the literature.
145145

146146
Silhouette scores in single-cell data are typically low due to continuous rather than discrete cell states; the metric is used here for relative comparison between resolutions, not as an absolute quality measure.
147147

@@ -169,22 +169,22 @@ pytest -v
169169

170170
## Design Decisions
171171

172-
- **Doublet detection** Scrublet integrated before QC filtering. 36 doublets detected (1.3%), 34 removed after other QC filters. Recommended by [Luecken & Theis (2019)](https://doi.org/10.15252/msb.20188746).
173-
- **Automated annotation** Clusters scored against curated PBMC marker gene sets rather than manual inspection. The marker sets are themselves a subjective choice but encoding them explicitly makes the annotation reproducible and auditable.
174-
- **Multi-resolution clustering** Leiden at 5 resolutions with silhouette evaluation. The ≥5 cluster floor reflects the known minimum of major PBMC lineages (T cells, B cells, monocytes, NK, DCs).
175-
- **Trajectory inference** PAGA provides a principled graph abstraction of cell-type connectivity. Diffusion pseudotime rooted in CD14+ monocytes because they are the most primitive myeloid progenitor in PBMCs the expected starting point of the monocyte-to-DC differentiation axis.
176-
- **T cell subclustering** Resolves CD4+/CD8+ populations that share CD3D/CD3E expression and cannot be separated at global clustering resolution.
177-
- **Colourblind-friendly palette** Okabe-Ito colours throughout.
178-
- **Reproducible seeds** `random_state=42` for UMAP, Leiden, Scrublet, and silhouette sampling.
179-
- **Dual-format figures** PNG (300 DPI) for web, PDF (vector) for publication submission.
172+
- **Doublet detection** - Scrublet integrated before QC filtering. 36 doublets detected (1.3%), 34 removed after other QC filters. Recommended by [Luecken & Theis (2019)](https://doi.org/10.15252/msb.20188746).
173+
- **Automated annotation** - Clusters scored against curated PBMC marker gene sets rather than manual inspection. The marker sets are themselves a subjective choice - but encoding them explicitly makes the annotation reproducible and auditable.
174+
- **Multi-resolution clustering** - Leiden at 5 resolutions with silhouette evaluation. The ≥5 cluster floor reflects the known minimum of major PBMC lineages (T cells, B cells, monocytes, NK, DCs).
175+
- **Trajectory inference** - PAGA provides a principled graph abstraction of cell-type connectivity. Diffusion pseudotime rooted in CD14+ monocytes because they are the most primitive myeloid progenitor in PBMCs - the expected starting point of the monocyte-to-DC differentiation axis.
176+
- **T cell subclustering** - Resolves CD4+/CD8+ populations that share CD3D/CD3E expression and cannot be separated at global clustering resolution.
177+
- **Colourblind-friendly palette** - Okabe-Ito colours throughout.
178+
- **Reproducible seeds** - `random_state=42` for UMAP, Leiden, Scrublet, and silhouette sampling.
179+
- **Dual-format figures** - PNG (300 DPI) for web, PDF (vector) for publication submission.
180180

181181
## Limitations
182182

183183
- **Single-sample dataset.** Multi-sample analyses would require batch correction (Harmony, scVI, or BBKNN).
184184
- **`regress_out` is debatable.** Used here following the original scanpy tutorial, but Luecken & Theis (2019) suggest regression may overcorrect for well-filtered cells.
185185
- **No pathway enrichment.** Gene set enrichment (via decoupler or GSEApy) would connect cell types to functional programmes. Planned as a future addition.
186186
- **FCGR3A+ monocytes not resolved.** At resolution 0.5, nonclassical monocytes merge with the CD14+ cluster. Higher resolution or targeted subclustering would separate them.
187-
- **Megakaryocytes not resolved.** The PBMC 3k dataset contains a small platelet/megakaryocyte population (PPBP+, PF4+) that merges with other clusters at this resolution. The canonical scanpy tutorial resolves 8 cell types from this dataset; our pipeline resolves 5 at the global level + 2 via T cell subclustering. The difference is resolution choice we optimise for silhouette score rather than maximising cluster count.
187+
- **Megakaryocytes not resolved.** The PBMC 3k dataset contains a small platelet/megakaryocyte population (PPBP+, PF4+) that merges with other clusters at this resolution. The canonical scanpy tutorial resolves 8 cell types from this dataset; our pipeline resolves 5 at the global level + 2 via T cell subclustering. The difference is resolution choice - we optimise for silhouette score rather than maximising cluster count.
188188

189189
## Licence
190190

0 commit comments

Comments
 (0)