Skip to content

Commit 73d7312

Browse files
committed
Publication polish: fix broken dataset link, add marker genes to results table, tighten README
1 parent d728986 commit 73d7312

1 file changed

Lines changed: 34 additions & 57 deletions

File tree

README.md

Lines changed: 34 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -1,100 +1,77 @@
11
# Single-Cell RNA-seq Immune Cell Profiling
22

3-
End-to-end single-cell RNA-seq analysis pipeline in Python using [scanpy](https://scanpy.readthedocs.io/). Demonstrates quality control, normalisation, dimensionality reduction, clustering with automated resolution selection, and marker-based cell type annotation on human PBMC data.
3+
End-to-end single-cell RNA-seq analysis pipeline in Python using [scanpy](https://scanpy.readthedocs.io/). Quality control, normalisation, dimensionality reduction, unsupervised clustering with automated resolution selection, and marker-based cell type annotation on human peripheral blood mononuclear cells.
44

55
<p align="center">
66
<img src="docs/umap_3d_rotation.gif" alt="3D UMAP rotation showing PBMC immune cell clusters" width="600">
77
</p>
88

99
<details>
10-
<summary>Publication figure (static)</summary>
10+
<summary>Static publication figure</summary>
1111

1212
![Publication Figure](docs/publication_figure.png)
13+
14+
**Panel A** — UMAP coloured by unsupervised Leiden clusters. **Panel B** — Same embedding coloured by assigned cell type. **Panel C** — Cell type proportions. **Panel D** — Z-scored expression of canonical marker genes per cell type (red = high, blue = low). **Panel E** — Summary statistics.
1315
</details>
1416

1517
## Dataset
1618

17-
**10X Genomics PBMC 3k** -- 2,700 peripheral blood mononuclear cells from a healthy donor, sequenced on the Chromium platform. A standard benchmark dataset for single-cell analysis pipelines.
19+
**10X Genomics PBMC 3k** 2,700 peripheral blood mononuclear cells from a healthy donor, sequenced on the Chromium platform. This is the standard benchmark dataset used across [scanpy](https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html), [Seurat](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html), and other single-cell frameworks.
1820

19-
- **Source**: [10X Genomics](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k)
20-
- **Reference**: Zheng et al. (2017) *Nature Communications*
21+
- **Direct download**: [filtered gene-barcode matrices](https://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz) (5.9 MB)
22+
- **Reference**: Zheng et al. (2017) [Massively parallel digital transcriptional profiling of single cells](https://doi.org/10.1038/ncomms14049). *Nature Communications* 8, 14049.
2123

2224
## Pipeline
2325

24-
| Step | Script | Description |
25-
|------|--------|-------------|
26-
| 01 | `scripts/01_load_and_qc.py` | Download data, calculate QC metrics (genes/cell, counts, mito %), filter |
27-
| 02 | `scripts/02_preprocess.py` | Normalise (10k/cell), log-transform, select 2,000 HVGs, regress covariates, scale |
28-
| 03 | `scripts/03_reduce_dimensions.py` | PCA (40 components), neighbour graph, UMAP embedding |
29-
| 04 | `scripts/04_cluster.py` | Leiden clustering at 5 resolutions, silhouette-based selection (min 5 clusters) |
30-
| 05 | `scripts/05_annotate_cell_types.py` | Wilcoxon DE, marker gene scoring, automated cell type assignment |
31-
| 06 | `scripts/06_publication_figures.py` | Multi-panel publication figure (UMAP, composition, heatmap) |
26+
| Step | Script | What it does |
27+
|------|--------|--------------|
28+
| 01 | `01_load_and_qc.py` | Download PBMC 3k, calculate QC metrics (genes/cell, UMI counts, mitochondrial %), filter low-quality cells |
29+
| 02 | `02_preprocess.py` | Normalise to 10k counts/cell, log-transform, select 2,000 highly variable genes, regress out confounders, scale |
30+
| 03 | `03_reduce_dimensions.py` | PCA (40 components), build k-nearest neighbour graph, compute UMAP embedding |
31+
| 04 | `04_cluster.py` | Leiden clustering at 5 resolutions (0.3–1.2), evaluate with silhouette score, select best with a floor of 5 clusters |
32+
| 05 | `05_annotate_cell_types.py` | Wilcoxon rank-sum test for marker genes, score clusters against known PBMC signatures, assign cell types |
33+
| 06 | `06_publication_figures.py` | Multi-panel figure: UMAP, composition bar chart, marker heatmap, summary statistics |
34+
35+
All scripts are in `scripts/`. Each reads the previous step's `.h5ad` output from `results/`.
3236

3337
## Results
3438

35-
| Cell Type | Count | Proportion |
36-
|-----------|-------|------------|
37-
| CD4+ T cells | 1,195 | 45.3% |
38-
| CD14+ Monocytes | 464 | 17.6% |
39-
| NK cells | 419 | 15.9% |
40-
| B cells | 342 | 13.0% |
41-
| FCGR3A+ Monocytes | 180 | 6.8% |
42-
| Dendritic cells | 38 | 1.4% |
39+
| Cell Type | Cells | % | Key Markers |
40+
|-----------|-------|---|-------------|
41+
| CD4+ T cells | 1,195 | 45.3 | CD3D, IL7R |
42+
| CD14+ Monocytes | 464 | 17.6 | CD14, LYZ |
43+
| NK cells | 419 | 15.9 | NKG7, GNLY |
44+
| B cells | 342 | 13.0 | MS4A1, CD79A |
45+
| FCGR3A+ Monocytes | 180 | 6.8 | FCGR3A, MS4A7 |
46+
| Dendritic cells | 38 | 1.4 | FCER1A, CST3 |
4347

44-
Best clustering: Leiden resolution 0.5, 6 clusters, silhouette score 0.196.
48+
These proportions are consistent with expected PBMC composition from a healthy donor. Clustering selected resolution 0.5 (6 clusters, silhouette 0.196).
4549

4650
## Quick Start
4751

4852
```bash
49-
# Clone
5053
git clone https://github.com/Ekin-Kahraman/single-cell-rnaseq-immune-profiling.git
5154
cd single-cell-rnaseq-immune-profiling
52-
53-
# Install
5455
pip install -e .
55-
56-
# Run full pipeline (~17 seconds)
57-
python run_pipeline.py
58-
59-
# Resume from a specific step
60-
python run_pipeline.py --from 4
56+
python run_pipeline.py # full pipeline (~17s)
57+
python run_pipeline.py --from 4 # resume from step 4
6158
```
6259

63-
## Requirements
64-
65-
- Python >= 3.10
66-
- scanpy >= 1.10
67-
- leidenalg >= 0.10
68-
- scikit-learn >= 1.3
69-
70-
Full dependency list in `pyproject.toml`.
71-
7260
## Testing
7361

7462
```bash
7563
pip install -e ".[dev]"
7664
pytest -v
7765
```
7866

79-
## Project Structure
80-
81-
```
82-
.
83-
├── scripts/ # Analysis pipeline (01-06)
84-
├── tests/ # pytest test suite
85-
├── data/ # Raw data (auto-downloaded)
86-
├── results/ # Output: h5ad files, figures, CSVs
87-
├── run_pipeline.py # Pipeline orchestrator
88-
├── pyproject.toml # Dependencies and project config
89-
└── CITATION.cff # Citation metadata
90-
```
67+
7 tests covering QC filtering, normalisation, HVG selection, clustering, and marker gene validation. CI runs on Python 3.10, 3.11, and 3.12.
9168

92-
## Key Design Decisions
69+
## Design Decisions
9370

94-
- **Automated cell type annotation**: Clusters are assigned to cell types by scoring against curated PBMC marker gene sets, not manual inspection.
95-
- **Multi-resolution clustering**: Leiden is run at 5 resolutions (0.3-1.2) and the best is selected by silhouette score with a biological floor of 5 clusters.
96-
- **Colourblind-friendly palette**: Publication figures use the Okabe-Ito palette for accessibility.
97-
- **Modular scripts**: Each step reads the previous step's output from disk. Steps can be re-run independently.
71+
- **Automated annotation**Clusters are scored against curated PBMC marker gene sets rather than annotated by manual inspection. This makes the pipeline reproducible and removes subjective judgement.
72+
- **Multi-resolution clustering** — Running Leiden at multiple resolutions and picking by silhouette score (with a biological floor) avoids the common problem of choosing an arbitrary resolution.
73+
- **Colourblind-friendly palette**Okabe-Ito colours throughout.
74+
- **Modular scripts**Each step is independent. Re-run any step without repeating upstream work.
9875

9976
## Licence
10077

0 commit comments

Comments
 (0)