sanjaysgk/ipg is a bioinformatics pipeline for immunopeptidogenomics: it builds a personalised cryptic peptide search database from RNA-seq, then searches it against immunopeptidomics MS/MS data to identify non-canonical (cryptic) peptides. It implements the method of Scull et al. (2021) as a reproducible nf-core-style Nextflow pipeline.
The pipeline runs in independent steps selected with --step:
--step db_construct — RNA-seq → cryptic peptide FASTA
- Align reads with two-pass STAR and infer strandedness with RSeQC.
- Assemble transcripts with StringTie and reconcile with the reference annotation via gffcompare.
- GATK4 RNA-seq best-practice BAM preparation (MarkDuplicates → SplitNCigarReads → two-pass BQSR).
- Call somatic variants with Mutect2 in tumour-only mode.
- Build the cryptic peptide database with the IPG custom C tools (
curate_vcf,alt_liftover,triple_translate,squish).
--step ms_search — MS/MS → identified cryptic peptides
- Search each sample's spectra against its cryptic database with MSFragger, Comet and Sage.
- Rescore PSMs with MS2Rescore + mokapot FDR and integrate engines at a configurable peptide-level FDR (default 1%).
- Optional de novo discovery lane (
--run_denovo, InstaNovo) — predicts peptides directly from spectra and classifies them canonical / cryptic / novel. - Optional immunoinformatics (HLA binding, motif clustering, quantification) and a cryptic-discovery report.
Note
New to Nextflow? See the nf-core installation docs. The repository ships a pixi environment that pins every tool — install it with pixi install (curl -fsSL https://pixi.sh/install.sh | bash if you don't have pixi).
Prepare a samplesheet:
samplesheet.csv
sample,fastq_1,fastq_2,strandedness
SAMPLE,/path/to/R1.fastq.gz,/path/to/R2.fastq.gz,reverseBuild the cryptic peptide database:
pixi run nextflow run . \
-profile singularity \
--step db_construct \
--input samplesheet.csv \
--outdir results \
-params-file reference.yamlWarning
Provide parameters via the CLI or a -params-file, not via a custom -c config file.
To try the pipeline on the bundled chr22 test data, run with -profile test,pixi. For the full reference-genome parameters, the MS-search samplesheet, the --step ms_search and --step post_ms workflows, and all options, see docs/usage.md.
- Database construction:
results/db_construct/<sample>/<sample>_cryptic.fasta - MS search: the integrated peptide table under
results/ms_search/<sample>/ - A MultiQC report and Nextflow execution reports under
results/pipeline_info/
See docs/output.md for the full output description.
| Profile | Purpose |
|---|---|
pixi |
Run every tool from the local pixi env (no containers) |
singularity / docker |
Pull biocontainers (HPC / cloud) |
monash |
SLURM on the Monash M3 comp partition (xy86 account) |
test |
Use the bundled chr22 test data |
sanjaysgk/ipg was written by Sanjay SG Krishna (@sanjaysgk), Li Lab, Monash University, porting the immunopeptidogenomics method and custom C tools developed by Kate Scull (Purcell Lab; kescull/immunopeptidogenomics). Supervised by Chen Li (Li Lab) and Anthony W. Purcell (Purcell Lab), Monash University.
Contributions and bug reports are welcome — please open a GitHub issue or a pull request.
If you use sanjaysgk/ipg, please cite the method paper:
Scull KE, Pandey K, Ramarathinam SH, Purcell AW. Immunopeptidogenomics: harnessing RNA-seq to illuminate the dark immunopeptidome. Mol Cell Proteomics. 2021;20:100143. doi:10.1016/j.mcpro.2021.100143
A reference list for every tool in the pipeline is in CITATIONS.md. This pipeline is built with Nextflow and the nf-core framework (Ewels et al., Nat Biotechnol. 2020, doi:10.1038/s41587-020-0439-x).
MIT — see LICENSE.