|
1 | 1 | # Single-Cell RNA-seq Immune Cell Profiling |
2 | 2 |
|
3 | | -End-to-end single-cell RNA-seq analysis pipeline in Python using [scanpy](https://scanpy.readthedocs.io/). Demonstrates quality control, normalisation, dimensionality reduction, clustering with automated resolution selection, and marker-based cell type annotation on human PBMC data. |
| 3 | +End-to-end single-cell RNA-seq analysis pipeline in Python using [scanpy](https://scanpy.readthedocs.io/). Quality control, normalisation, dimensionality reduction, unsupervised clustering with automated resolution selection, and marker-based cell type annotation on human peripheral blood mononuclear cells. |
4 | 4 |
|
5 | 5 | <p align="center"> |
6 | 6 | <img src="docs/umap_3d_rotation.gif" alt="3D UMAP rotation showing PBMC immune cell clusters" width="600"> |
7 | 7 | </p> |
8 | 8 |
|
9 | 9 | <details> |
10 | | -<summary>Publication figure (static)</summary> |
| 10 | +<summary>Static publication figure</summary> |
11 | 11 |
|
12 | 12 |  |
| 13 | + |
| 14 | +**Panel A** — UMAP coloured by unsupervised Leiden clusters. **Panel B** — Same embedding coloured by assigned cell type. **Panel C** — Cell type proportions. **Panel D** — Z-scored expression of canonical marker genes per cell type (red = high, blue = low). **Panel E** — Summary statistics. |
13 | 15 | </details> |
14 | 16 |
|
15 | 17 | ## Dataset |
16 | 18 |
|
17 | | -**10X Genomics PBMC 3k** -- 2,700 peripheral blood mononuclear cells from a healthy donor, sequenced on the Chromium platform. A standard benchmark dataset for single-cell analysis pipelines. |
| 19 | +**10X Genomics PBMC 3k** — 2,700 peripheral blood mononuclear cells from a healthy donor, sequenced on the Chromium platform. This is the standard benchmark dataset used across [scanpy](https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html), [Seurat](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html), and other single-cell frameworks. |
18 | 20 |
|
19 | | -- **Source**: [10X Genomics](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k) |
20 | | -- **Reference**: Zheng et al. (2017) *Nature Communications* |
| 21 | +- **Direct download**: [filtered gene-barcode matrices](https://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz) (5.9 MB) |
| 22 | +- **Reference**: Zheng et al. (2017) [Massively parallel digital transcriptional profiling of single cells](https://doi.org/10.1038/ncomms14049). *Nature Communications* 8, 14049. |
21 | 23 |
|
22 | 24 | ## Pipeline |
23 | 25 |
|
24 | | -| Step | Script | Description | |
25 | | -|------|--------|-------------| |
26 | | -| 01 | `scripts/01_load_and_qc.py` | Download data, calculate QC metrics (genes/cell, counts, mito %), filter | |
27 | | -| 02 | `scripts/02_preprocess.py` | Normalise (10k/cell), log-transform, select 2,000 HVGs, regress covariates, scale | |
28 | | -| 03 | `scripts/03_reduce_dimensions.py` | PCA (40 components), neighbour graph, UMAP embedding | |
29 | | -| 04 | `scripts/04_cluster.py` | Leiden clustering at 5 resolutions, silhouette-based selection (min 5 clusters) | |
30 | | -| 05 | `scripts/05_annotate_cell_types.py` | Wilcoxon DE, marker gene scoring, automated cell type assignment | |
31 | | -| 06 | `scripts/06_publication_figures.py` | Multi-panel publication figure (UMAP, composition, heatmap) | |
| 26 | +| Step | Script | What it does | |
| 27 | +|------|--------|--------------| |
| 28 | +| 01 | `01_load_and_qc.py` | Download PBMC 3k, calculate QC metrics (genes/cell, UMI counts, mitochondrial %), filter low-quality cells | |
| 29 | +| 02 | `02_preprocess.py` | Normalise to 10k counts/cell, log-transform, select 2,000 highly variable genes, regress out confounders, scale | |
| 30 | +| 03 | `03_reduce_dimensions.py` | PCA (40 components), build k-nearest neighbour graph, compute UMAP embedding | |
| 31 | +| 04 | `04_cluster.py` | Leiden clustering at 5 resolutions (0.3–1.2), evaluate with silhouette score, select best with a floor of 5 clusters | |
| 32 | +| 05 | `05_annotate_cell_types.py` | Wilcoxon rank-sum test for marker genes, score clusters against known PBMC signatures, assign cell types | |
| 33 | +| 06 | `06_publication_figures.py` | Multi-panel figure: UMAP, composition bar chart, marker heatmap, summary statistics | |
| 34 | + |
| 35 | +All scripts are in `scripts/`. Each reads the previous step's `.h5ad` output from `results/`. |
32 | 36 |
|
33 | 37 | ## Results |
34 | 38 |
|
35 | | -| Cell Type | Count | Proportion | |
36 | | -|-----------|-------|------------| |
37 | | -| CD4+ T cells | 1,195 | 45.3% | |
38 | | -| CD14+ Monocytes | 464 | 17.6% | |
39 | | -| NK cells | 419 | 15.9% | |
40 | | -| B cells | 342 | 13.0% | |
41 | | -| FCGR3A+ Monocytes | 180 | 6.8% | |
42 | | -| Dendritic cells | 38 | 1.4% | |
| 39 | +| Cell Type | Cells | % | Key Markers | |
| 40 | +|-----------|-------|---|-------------| |
| 41 | +| CD4+ T cells | 1,195 | 45.3 | CD3D, IL7R | |
| 42 | +| CD14+ Monocytes | 464 | 17.6 | CD14, LYZ | |
| 43 | +| NK cells | 419 | 15.9 | NKG7, GNLY | |
| 44 | +| B cells | 342 | 13.0 | MS4A1, CD79A | |
| 45 | +| FCGR3A+ Monocytes | 180 | 6.8 | FCGR3A, MS4A7 | |
| 46 | +| Dendritic cells | 38 | 1.4 | FCER1A, CST3 | |
43 | 47 |
|
44 | | -Best clustering: Leiden resolution 0.5, 6 clusters, silhouette score 0.196. |
| 48 | +These proportions are consistent with expected PBMC composition from a healthy donor. Clustering selected resolution 0.5 (6 clusters, silhouette 0.196). |
45 | 49 |
|
46 | 50 | ## Quick Start |
47 | 51 |
|
48 | 52 | ```bash |
49 | | -# Clone |
50 | 53 | git clone https://github.com/Ekin-Kahraman/single-cell-rnaseq-immune-profiling.git |
51 | 54 | cd single-cell-rnaseq-immune-profiling |
52 | | - |
53 | | -# Install |
54 | 55 | pip install -e . |
55 | | - |
56 | | -# Run full pipeline (~17 seconds) |
57 | | -python run_pipeline.py |
58 | | - |
59 | | -# Resume from a specific step |
60 | | -python run_pipeline.py --from 4 |
| 56 | +python run_pipeline.py # full pipeline (~17s) |
| 57 | +python run_pipeline.py --from 4 # resume from step 4 |
61 | 58 | ``` |
62 | 59 |
|
63 | | -## Requirements |
64 | | - |
65 | | -- Python >= 3.10 |
66 | | -- scanpy >= 1.10 |
67 | | -- leidenalg >= 0.10 |
68 | | -- scikit-learn >= 1.3 |
69 | | - |
70 | | -Full dependency list in `pyproject.toml`. |
71 | | - |
72 | 60 | ## Testing |
73 | 61 |
|
74 | 62 | ```bash |
75 | 63 | pip install -e ".[dev]" |
76 | 64 | pytest -v |
77 | 65 | ``` |
78 | 66 |
|
79 | | -## Project Structure |
80 | | - |
81 | | -``` |
82 | | -. |
83 | | -├── scripts/ # Analysis pipeline (01-06) |
84 | | -├── tests/ # pytest test suite |
85 | | -├── data/ # Raw data (auto-downloaded) |
86 | | -├── results/ # Output: h5ad files, figures, CSVs |
87 | | -├── run_pipeline.py # Pipeline orchestrator |
88 | | -├── pyproject.toml # Dependencies and project config |
89 | | -└── CITATION.cff # Citation metadata |
90 | | -``` |
| 67 | +7 tests covering QC filtering, normalisation, HVG selection, clustering, and marker gene validation. CI runs on Python 3.10, 3.11, and 3.12. |
91 | 68 |
|
92 | | -## Key Design Decisions |
| 69 | +## Design Decisions |
93 | 70 |
|
94 | | -- **Automated cell type annotation**: Clusters are assigned to cell types by scoring against curated PBMC marker gene sets, not manual inspection. |
95 | | -- **Multi-resolution clustering**: Leiden is run at 5 resolutions (0.3-1.2) and the best is selected by silhouette score with a biological floor of 5 clusters. |
96 | | -- **Colourblind-friendly palette**: Publication figures use the Okabe-Ito palette for accessibility. |
97 | | -- **Modular scripts**: Each step reads the previous step's output from disk. Steps can be re-run independently. |
| 71 | +- **Automated annotation** — Clusters are scored against curated PBMC marker gene sets rather than annotated by manual inspection. This makes the pipeline reproducible and removes subjective judgement. |
| 72 | +- **Multi-resolution clustering** — Running Leiden at multiple resolutions and picking by silhouette score (with a biological floor) avoids the common problem of choosing an arbitrary resolution. |
| 73 | +- **Colourblind-friendly palette** — Okabe-Ito colours throughout. |
| 74 | +- **Modular scripts** — Each step is independent. Re-run any step without repeating upstream work. |
98 | 75 |
|
99 | 76 | ## Licence |
100 | 77 |
|
|
0 commit comments