Skip to content

New pipeline: nf-core/lrp2 #145

@JTL-lab

Description

@JTL-lab

Pipeline title/name

lrp2

Keywords

proteogenomics, proteomics, mass-spectrometry, long-read-sequencing, splicing, rna-isoforms, splicing-analysis, proteoforms, pacbio

What is it about?

LRP2 is a pipeline for long-read proteogenomics analysis (see Miller et al. 2022). It takes as input PacBio full-length transcript reads and/or raw mass spectrometry files, performs transcript discovery and quality control, ORF prediction, differential expression analysis, and proteomics database generation and search to validate protein isoforms (splice proteoforms).

Please provide a schematic diagram of the proposed pipeline

LRP2 Workflow Diagram

What would a minimal first release of this pipeline include?

Five modular subworkflows:

  1. PacBio Isocall: Align full-length transcript reads and collapse reads to isoforms
  2. Transcriptome: Classify transcripts with SQANTI3, filter predicted artifacts, assign hash-based isoform IDs
  3. Predicted proteome: Predict ORFs with CPAT, classify proteins with SQANTI protein
  4. Multi-sample analysis (optional): Differential splicing (LeafCutter), expression and usage (edgeR, DRIMSeq)
  5. Proteomics (optional): Build custom protein search databases, convert raw MS files with msconvert, search with FragPipe or MetaMorpheus, map peptides to isoforms

BioRxiv preprint (note that the paper Results section and main figure showcase outputs that can be obtained using the pipeline): https://www.biorxiv.org/content/10.64898/2026.05.27.728216v1

We plan to develop these additional features in later releases:

  • Inbuilt support for RefSeq annotations.
  • Extension to Oxford Nanopore reads.
  • Incorporation of gene fusions, SNPs, and indels into long-read processing subworkflows.
  • Allow for processing of multiple ORFs per transcript for proteomic (FragPipe) searches.
  • Support for TMT-based quantification for multiplexed experiments.
  • Further incorporation of perplexity-based metrics to improve protein inference (see Schertzer et al. 2025).

I confirm my proposed pipeline will follow nf-core guidelines. Most importantly, my pipeline will:

  • be built with Nextflow.
  • pass nf-core lint tests and use standardized parameters.
  • be community-owned and developed within the nf-core organization.
  • open source under the MIT license with proper credits and acknowledgments.
  • have a descriptive, all lowercase, and without punctuation name.
  • use the nf-core pipeline template and predominantly use official nf-core modules.
  • focus on a specific data/analysis type with appropriate scope.
  • have properly maintained documentation.
  • be bundled using versioned Docker/Singularity containers.

Why do we need a new pipeline?

No existing pipeline supports end-to-end, cohort-scale long-read proteogenomics. Current nf-core pipelines handle transcriptomics (nf-core/rnaseq, nf-core/isoseq) or proteomics (nf-core/proteomicslfq, nf-core/mspepid), but do not integrate both modalities. In practice, researchers conduct these analyses piecemeal, limiting reproducibility and scalability. Furthermore, most existing proteogenomics workflows have historically relied upon short-read RNA-seq (SRS), which does not capture full-length, high-confidence transcript structures.

LRP2 addresses this critical gap in proteogenomics by integrating long-read RNA-seq with mass spectrometry-based proteomics, building sample-specific protein databases from long-read RNA-sequencing data enabling the mapping of both annotated and novel peptides.

Beyond integration, LRP2 introduces several methodological advances. We include successor tools developed in collaboration with the original authors (PacBio Isocall, long-read Leafcutter, and SQANTI protein). LRP2 supports multiple proteomic search engines (FragPipe, MetaMorpheus) across both DDA and DIA acquisition modes with default or custom workflows. Finally , LRP2 allows for multi-condition differential analysis at four resolutions: gene expression (edgeR), transcript expression/ usage (edgeR, DRIMSeq), ORF expression/usage (edgeR, DRIMSeq), and subisoform-level quantification (longread Leafcutter).

Who would be interested?

LRP2 addresses a critical gap for researchers working at the intersection of long-read RNA-seq, ORF detection, and mass spectrometry-based proteomics. LRP2 fills this space to offer an integrated, scalable pipeline that is accessible to biologists without deep bioinformatics expertise. We have already engaged multiple groups generating large-scale/consortia level RNA/MS matched data who require exactly this capability, including groups affiliated with T2T, 1KG, CPTAC, NIH NIMH, and a Harvard COPD cohort.

The pipeline's flexible data requirements are a key strength: LRP2 accommodates long-read RNA-seq alone, mass spectrometry alone, or paired datasets, with RNA subworkflows (S1 - S4) and the proteomic subworkflow (S5) operable independently based on samplesheet contents. This modularity makes LRP2 useful across a wide range of experimental designs.

Primary audiences include:

  • Proteogenomics researchers studying protein isoform diversity across tissues, conditions, or disease states
  • Cancer biologists investigating tumor-specific isoforms and their translational products
  • Rare disease researchers for whom short-read sequencing misses clinically relevant variants
  • Consortia-level long-read RNA and/or MS data generation

LRP2 is designed for scalability across hundreds of samples and supports both human and mouse data, making it well-suited for consortium-scale and multi-species studies.

What has been done so far

  • Codebase: The main pipeline code is complete, with all 5 proposed subworkflows implemented on the main branch:
    • S1: PacBio Isocall (alignment, profiling, isoform calling)
    • S2: Transcriptome QC and filtering (SQANTI3, hashids, artifact removal)
    • S3: ORF prediction and protein classification (CPAT, SQANTI-protein, filtering)
    • S4: Multi-sample differential analysis (LR LeafCutter, edgeR, DRIMSeq)
    • S5: Proteomics (custom database generation, MSConvert, FragPipe/MetaMorpheus, novel peptide mapping)
  • Testing and validation:
    • LRP2 has been tested on paired LRS and MS data from ENCODE4 (Reese et al. 2023) and ProteomeXchange (PXD024364) for K562 and HepG2 cell lines (Sinitcyn et al. 2023).
    • We have added two test profiles based on minimal subsets of this data: test_dda (tests combined LRS and DDA MS data) and test_rna (tests only LRS).
    • Unit testing development for all modules is currently underway.
    • We have tested the full pipeline on three HPCs: UVA’s Rivanna (Slurm), NYGC (Slurm), and Mt Sinai (LSF).
    • We have stress tested up to 26 matched RNA samples in a single run and up to 96 MS fractions.
  • Nf-core compliance and linting:
    • The pipeline was created following the nf-core pipelines template.
    • Work is currently underway on the fix/nfcore_linting branch to ensure full compliance to all CI/CD checks and linting standards.

URL to existing work (if applicable)

https://github.com/sheynkman-lab/LRP2

Are there any similar existing nf-core pipelines?

mspepid, quantms (no longer maintained), Proteomicslfq (deprecated as of DSL2), proteogenomicsdb (deprecated as of DSL2)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions