Skip to content

Q: best practices? #357

@sklages

Description

@sklages

I am playing around to setup a stable pipeline for Oxford Nanopore dRNA data.

Currently I have a dataset of 20 samples/fastqs (each is 6-7GB in size).

In a first (naïv) approach, I sent 20 isoquant jobs into our Slurm cluster.
Each job with 24 cores, pre-build minimap2 index and also with pre-built hg38.gencode.v47.annotation.db .

Jobs were distributed with a maximum of three on each node.

But - and this something I cannot avoid at the moment - data and working directory are all on the same storage, mounted via InfiniBand, providing up to 100 Gbit/s and, due to the protocol, with very low latency.

I do observe very high I/O for these jobs (I even used --high_memory) while "<..>processing chromosome<..>", pipeline took about 40h-54h to finish.

command

  • $yml_input contains a single fastq record
isoquant.py \
    --clean_start \
    --force \
    --high_memory \
    --yaml $yml_input \
    --data_type nanopore \
    --aligner minimap2 \
    --index $ref_genome_mmi \
    --reference $ref_genome_fa \
    --genedb $ref_anno_db \
    --complete_genedb \
    --fl_data \
    --transcript_quantification unique_only \
    --threads $THREADS \
    --output $RES_DIR

some data info

2025-12-11 11:00:52,064 - INFO - Running IsoQuant version 3.10.0
2025-12-11 11:00:52,084 - INFO - Novel unspliced transcripts will not be reported, set --report_novel_unspliced true to discover them
2025-12-11 11:00:52,084 - INFO -  === IsoQuant pipeline started ===
2025-12-11 11:00:52,084 - INFO - Python version: 3.13.11 | packaged by conda-forge | (main, Dec  6 2025, 11:24:03) [GCC 14.3.0]
2025-12-11 11:00:52,084 - INFO - gffutils version: 0.13
2025-12-11 11:00:52,084 - INFO - pysam version: 0.23.3
2025-12-11 11:00:52,084 - INFO - pyfaidx version: 0.9.0.3
2025-12-11 11:00:52,084 - INFO - Reading reference genome from /path/to/coldstore/references/hg38/hg38.fa
2025-12-11 11:00:52,087 - INFO - Converting gene annotation file /path/to/hg38.gencode.v47.annotation.db to .bed format
2025-12-11 11:04:10,337 - INFO - Gene database BED written to results_sub07_Z44/hg38.gencode.v47.annotation.bed
2025-12-11 11:04:10,342 - INFO - Aligning /path/to/sample_clean.fq.gz to the reference, alignments will be saved to /path/to/sample_clean_7ba614_3b811a_1665c2.bam
2025-12-11 11:04:10,350 - INFO - Running minimap2 version 2.30-r1287 (takes a while)
2025-12-11 11:38:09,968 - INFO - Sorting alignments
2025-12-11 11:39:35,069 - INFO - Indexing alignments
2025-12-11 11:40:38,224 - INFO - Loading gene database from /path/to/ref_data/hg38.gencode.v47.annotation.db
2025-12-11 11:40:38,498 - INFO - Loading reference genome from /path/to/coldstore/references/hg38/hg38.fa
2025-12-11 11:40:38,502 - INFO - Processing 1 experiment
2025-12-11 11:40:38,502 - INFO - Secondary alignments will not be used
2025-12-11 11:40:38,502 - INFO - Processing experiment sub07_Z4
2025-12-11 11:40:38,502 - INFO - Experiment has 1 BAM file: results/sample_clean_7ba614_3b811a_1665c2.bam
2025-12-11 11:42:49,016 - INFO - Total number of chromosomes to be processed 25: chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY
2025-12-11 11:42:49,020 - INFO - Collecting read alignments
2025-12-11 11:42:50,143 - INFO - Processing chromosome chrY
<..>
2025-12-13 15:15:29,326 - INFO - Finished processing chromosome chr2
<..>
2025-12-13 15:15:29,739 - INFO - primary: 6974053
2025-12-13 15:15:29,739 - INFO - secondary: 4930682
2025-12-13 15:15:29,739 - INFO - supplementary: 605636
2025-12-13 15:15:29,739 - INFO - unaligned: 47775
2025-12-13 15:15:29,745 - INFO - Finishing read assignment, total assignments 6479381, polyA percentage 82.3
<..>
2025-12-13 15:15:29,749 - INFO - Total assignments used for analysis: 6479381, polyA tail detected in 5329436 (82.3%)
2025-12-13 15:15:29,750 - INFO - Processing assigned reads XXX
2025-12-13 15:15:29,750 - INFO - Transcript models construction is turned on
2025-12-13 15:15:29,767 - INFO - Transcript construction options:
2025-12-13 15:15:29,767 - INFO -   Novel monoexonic transcripts will be reported: no
2025-12-13 15:15:29,767 - INFO -   PolyA tails are required for multi-exon transcripts to be reported: yes
2025-12-13 15:15:29,767 - INFO -   PolyA tails are required for 2-exon transcripts to be reported: yes
2025-12-13 15:15:29,768 - INFO -   PolyA tails are required for known monoexon transcripts to be reported: yes
2025-12-13 15:15:29,768 - INFO -   PolyA tails are required for novel monoexon transcripts to be reported: yes
2025-12-13 15:15:29,768 - INFO -   Splice site reporting level: only_canonical
2025-12-13 15:15:29,902 - INFO - Processing chromosome chr2
<..>
2025-12-13 16:50:57,261 - INFO - Read assignment statistics
2025-12-13 16:50:57,262 - INFO - ambiguous: 1639559
2025-12-13 16:50:57,262 - INFO - inconsistent: 294791
2025-12-13 16:50:57,262 - INFO - inconsistent_ambiguous: 516353
2025-12-13 16:50:57,262 - INFO - inconsistent_non_intronic: 707139
2025-12-13 16:50:57,262 - INFO - intergenic: 13754
2025-12-13 16:50:57,262 - INFO - noninformative: 1008044
2025-12-13 16:50:57,262 - INFO - unique: 2047350
2025-12-13 16:50:57,262 - INFO - unique_minor_difference: 252391
<..>
2025-12-13 16:50:57,791 - INFO - Transcript model statistics
2025-12-13 16:50:57,791 - INFO - known: 21688
2025-12-13 16:50:57,791 - INFO - novel_in_catalog: 2420
2025-12-13 16:50:57,791 - INFO - novel_not_in_catalog: 2164
<..>
2025-12-13 16:51:30,430 - INFO - Processed experiment XXX
2025-12-13 16:51:30,430 - INFO - Processed 1 experiment
2025-12-13 16:51:30,430 - INFO -  === IsoQuant pipeline finished ===

This one took about 54h to finish. Is this something you'd expect?
Is there a way to reduce I/O and/or to speed up things?
Or did I choose the parameters - umm - unwisely?

Any hints, recommendations, help are welcome :-)

best,
Sven

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceIssues related to computational perfromance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions