Skip to content

Ekin-Kahraman/single-cell-rnaseq-immune-profiling

Repository files navigation

Single-Cell RNA-seq Immune Cell Profiling

CI License: MIT Python

Single-cell RNA-seq analysis pipeline in Python using scanpy. Doublet detection, quality control, normalisation, dimensionality reduction, clustering with automated resolution selection, marker-based cell type annotation, PAGA trajectory inference, and T cell subclustering on human peripheral blood mononuclear cells.

Retains 2,604 PBMCs, resolves 5 major immune populations plus CD4/CD8 T-cell subclusters, with CI-tested reproducibility.

Engineering Evidence

  • CI runs unit tests across Python 3.10, 3.11, and 3.12.
  • CI also runs the complete PBMC pipeline on Python 3.12 and validates generated .h5ad, CSV, PNG, PDF, and manifest artefacts.
  • results/output_manifest.csv is generated on each full run with file sizes and SHA-256 checksums for pipeline outputs.
  • scripts/validate_outputs.py checks retained cell counts, required annotations, T-cell subtyping, marker tables, figures, and manifest integrity.

3D UMAP rotation showing PBMC immune cell clusters

Static publication figure

Publication Figure

Panel A - UMAP coloured by unsupervised Leiden clusters. Panel B - Same embedding coloured by assigned cell type. Panel C - Cell type proportions. Panel D - Z-scored expression of canonical marker genes per cell type (red = high, blue = low). Panel E - Summary statistics.

Dataset

10X Genomics PBMC 3k - 2,700 peripheral blood mononuclear cells from a healthy donor, sequenced on the Chromium platform. This is the standard benchmark dataset used across scanpy, Seurat, and other single-cell frameworks.

Workflow

PBMC 3k (10X Genomics, 2,700 cells)
    │
    ▼
 01 QC ──────────── Scrublet doublet detection → filter: 200 < genes < 2500, mito < 5%
    │                (36 doublets detected, 34 removed)
    ▼
 02 Preprocess ──── Normalise (10k), log1p, 2000 HVGs, regress, scale
    │
    ▼
 03 Reduce ──────── PCA (40 PCs) → kNN graph → UMAP (random_state=42)
    │
    ▼
 04 Cluster ─────── Leiden at 5 resolutions → silhouette selection (≥5 clusters)
    │
    ▼
 05 Annotate ────── Wilcoxon DE → score against PBMC marker signatures → 5 cell types
    │
    ▼
 06 Trajectory ──── PAGA graph abstraction → diffusion pseudotime (rooted in CD14+ mono)
    │
    ▼
 07 Subcluster ──── T cell compartment → resolve CD4+ (47.5%) and CD8+ (52.5%)
    │
    ▼
 08 Figures ─────── Multi-panel publication figure (PNG 300 DPI + PDF vector)
    │
    ▼
 09 Manifest ─────── SHA-256 manifest for generated analysis artefacts

Pipeline

Step Script What it does
01 01_load_and_qc.py Download PBMC 3k, Scrublet doublet detection, QC metrics, filter low-quality cells and doublets
02 02_preprocess.py Normalise to 10k counts/cell, log-transform, select 2,000 HVGs, regress out confounders, scale
03 03_reduce_dimensions.py PCA (40 components), k-nearest neighbour graph, UMAP embedding
04 04_cluster.py Leiden clustering at 5 resolutions (0.3–1.2), silhouette evaluation, select best with ≥5 cluster floor
05 05_annotate_cell_types.py Wilcoxon rank-sum DE, score clusters against curated PBMC signatures, assign cell types
06 06_trajectory.py PAGA partition-based graph abstraction, PAGA-initialised UMAP, diffusion pseudotime
07 07_t_cell_subclustering.py Extract T cell compartment, subcluster, resolve CD4+/CD8+ via marker scoring
08 08_publication_figures.py Multi-panel figure with UMAP, composition, marker heatmap, summary (PNG + PDF)
09 09_output_manifest.py Generate checksums and file-size manifest for analysis artefacts

Project Structure

single-cell-rnaseq-immune-profiling/
├── scripts/
│   ├── 01_load_and_qc.py          QC + Scrublet doublet detection
│   ├── 02_preprocess.py            Normalise, HVG, regress, scale
│   ├── 03_reduce_dimensions.py     PCA, kNN graph, UMAP
│   ├── 04_cluster.py               Multi-resolution Leiden + silhouette
│   ├── 05_annotate_cell_types.py   Marker DE + automated annotation
│   ├── 06_trajectory.py            PAGA + diffusion pseudotime
│   ├── 07_t_cell_subclustering.py  CD4+/CD8+ resolution
│   ├── 08_publication_figures.py   Multi-panel figure (PNG + PDF)
│   ├── 09_output_manifest.py        Output checksums and file sizes
│   ├── validate_outputs.py          Full-run output validator
│   └── palette.py                  Shared Okabe-Ito colourblind palette
├── tests/
│   └── test_pipeline.py            7 tests (QC, normalisation, clustering, markers)
├── docs/
│   ├── umap_3d_rotation.gif        Animated 3D UMAP
│   └── publication_figure.png      Static multi-panel figure
├── run_pipeline.py                 CLI runner with --from resume flag
├── pyproject.toml                  Dependencies + metadata
├── requirements-lock.txt           Pinned versions for reproducibility
└── CITATION.cff

Each script reads the previous step's .h5ad output from results/.

Results

Cell Type Composition

Cell Type Cells % Key Markers
CD4+ T cells 1,192 45.8 CD3D, IL7R
CD14+ Monocytes 636 24.4 CD14, LYZ
NK cells 410 15.7 NKG7, GNLY
B cells 330 12.7 MS4A1, CD79A
Dendritic cells 36 1.4 FCER1A, CST3

2,604 cells retained after QC and doublet removal (from 2,700 raw). Clustering selected resolution 0.5 (5 clusters, silhouette 0.204).

T Cell Subclustering

Subclustering the T cell compartment (1,192 cells) resolves the CD4+/CD8+ boundary that is not visible at the global clustering level:

Subtype Cells % of T cells
CD8+ T 626 52.5
CD4+ T 566 47.5

The near-equal split is consistent with healthy donor PBMCs. CD8+ T cells were assigned by scoring CD8A/CD8B/GZMK/GZMA against IL7R/CD4/TCF7/LEF1.

Trajectory Inference

PAGA connects CD14+ monocytes → dendritic cells (the myeloid differentiation axis) and reveals the T/NK cell cluster neighbourhood in UMAP space. Diffusion pseudotime, rooted in CD14+ monocytes, orders cells along the monocyte-to-DC trajectory.

Biological Interpretation

The dominance of CD4+ T cells (46%) is expected in healthy donor PBMCs. Dendritic cells are a rare population (1.4%), correctly resolved as a distinct cluster despite low cell count. The monocyte population is predominantly classical (CD14+); nonclassical (FCGR3A+) monocytes were not resolved as a separate cluster at resolution 0.5 - they likely merge with the classical monocyte cluster. This is consistent with the resolution-sensitivity of FCGR3A+ monocyte separation observed in the literature.

Silhouette scores in single-cell data are typically low due to continuous rather than discrete cell states; the metric is used here for relative comparison between resolutions, not as an absolute quality measure.

Quick Start

git clone https://github.com/Ekin-Kahraman/single-cell-rnaseq-immune-profiling.git
cd single-cell-rnaseq-immune-profiling
pip install -e .                  # or: pip install -r requirements-lock.txt
python run_pipeline.py            # full pipeline (~38s)
python run_pipeline.py --from 6   # resume from trajectory step
python scripts/validate_outputs.py

For exact reproducibility, use requirements-lock.txt which pins all dependency versions.

Testing

pip install -e ".[dev]"
pytest -v

7 tests covering QC filtering, normalisation, HVG selection, clustering, and marker gene validation. CI runs tests on Python 3.10, 3.11, and 3.12, then runs and validates the full pipeline on Python 3.12.

Design Decisions

  • Doublet detection - Scrublet integrated before QC filtering. 36 doublets detected (1.3%), 34 removed after other QC filters. Recommended by Luecken & Theis (2019).
  • Automated annotation - Clusters scored against curated PBMC marker gene sets rather than manual inspection. The marker sets are themselves a subjective choice - but encoding them explicitly makes the annotation reproducible and auditable.
  • Multi-resolution clustering - Leiden at 5 resolutions with silhouette evaluation. The ≥5 cluster floor reflects the known minimum of major PBMC lineages (T cells, B cells, monocytes, NK, DCs).
  • Trajectory inference - PAGA provides a principled graph abstraction of cell-type connectivity. Diffusion pseudotime rooted in CD14+ monocytes because they are the most primitive myeloid progenitor in PBMCs - the expected starting point of the monocyte-to-DC differentiation axis.
  • T cell subclustering - Resolves CD4+/CD8+ populations that share CD3D/CD3E expression and cannot be separated at global clustering resolution.
  • Colourblind-friendly palette - Okabe-Ito colours throughout.
  • Reproducible seeds - random_state=42 for UMAP, Leiden, Scrublet, and silhouette sampling.
  • Dual-format figures - PNG (300 DPI) for web, PDF (vector) for publication submission.

Limitations

  • Single-sample dataset. Multi-sample analyses would require batch correction (Harmony, scVI, or BBKNN).
  • regress_out is debatable. Used here following the original scanpy tutorial, but Luecken & Theis (2019) suggest regression may overcorrect for well-filtered cells.
  • No pathway enrichment. Gene set enrichment (via decoupler or GSEApy) would connect cell types to functional programmes. Planned as a future addition.
  • FCGR3A+ monocytes not resolved. At resolution 0.5, nonclassical monocytes merge with the CD14+ cluster. Higher resolution or targeted subclustering would separate them.
  • Megakaryocytes not resolved. The PBMC 3k dataset contains a small platelet/megakaryocyte population (PPBP+, PF4+) that merges with other clusters at this resolution. The canonical scanpy tutorial resolves 8 cell types from this dataset; our pipeline resolves 5 at the global level + 2 via T cell subclustering. The difference is resolution choice - we optimise for silhouette score rather than maximising cluster count.

Licence

MIT

About

End-to-end single-cell RNA-seq immune cell profiling pipeline in Python (scanpy, PBMC 3k)

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages