-
Notifications
You must be signed in to change notification settings - Fork 20
Description
I am playing around to setup a stable pipeline for Oxford Nanopore dRNA data.
Currently I have a dataset of 20 samples/fastqs (each is 6-7GB in size).
In a first (naïv) approach, I sent 20 isoquant jobs into our Slurm cluster.
Each job with 24 cores, pre-build minimap2 index and also with pre-built hg38.gencode.v47.annotation.db .
Jobs were distributed with a maximum of three on each node.
But - and this something I cannot avoid at the moment - data and working directory are all on the same storage, mounted via InfiniBand, providing up to 100 Gbit/s and, due to the protocol, with very low latency.
I do observe very high I/O for these jobs (I even used --high_memory) while "<..>processing chromosome<..>", pipeline took about 40h-54h to finish.
command
$yml_inputcontains a single fastq record
isoquant.py \
--clean_start \
--force \
--high_memory \
--yaml $yml_input \
--data_type nanopore \
--aligner minimap2 \
--index $ref_genome_mmi \
--reference $ref_genome_fa \
--genedb $ref_anno_db \
--complete_genedb \
--fl_data \
--transcript_quantification unique_only \
--threads $THREADS \
--output $RES_DIRsome data info
2025-12-11 11:00:52,064 - INFO - Running IsoQuant version 3.10.0
2025-12-11 11:00:52,084 - INFO - Novel unspliced transcripts will not be reported, set --report_novel_unspliced true to discover them
2025-12-11 11:00:52,084 - INFO - === IsoQuant pipeline started ===
2025-12-11 11:00:52,084 - INFO - Python version: 3.13.11 | packaged by conda-forge | (main, Dec 6 2025, 11:24:03) [GCC 14.3.0]
2025-12-11 11:00:52,084 - INFO - gffutils version: 0.13
2025-12-11 11:00:52,084 - INFO - pysam version: 0.23.3
2025-12-11 11:00:52,084 - INFO - pyfaidx version: 0.9.0.3
2025-12-11 11:00:52,084 - INFO - Reading reference genome from /path/to/coldstore/references/hg38/hg38.fa
2025-12-11 11:00:52,087 - INFO - Converting gene annotation file /path/to/hg38.gencode.v47.annotation.db to .bed format
2025-12-11 11:04:10,337 - INFO - Gene database BED written to results_sub07_Z44/hg38.gencode.v47.annotation.bed
2025-12-11 11:04:10,342 - INFO - Aligning /path/to/sample_clean.fq.gz to the reference, alignments will be saved to /path/to/sample_clean_7ba614_3b811a_1665c2.bam
2025-12-11 11:04:10,350 - INFO - Running minimap2 version 2.30-r1287 (takes a while)
2025-12-11 11:38:09,968 - INFO - Sorting alignments
2025-12-11 11:39:35,069 - INFO - Indexing alignments
2025-12-11 11:40:38,224 - INFO - Loading gene database from /path/to/ref_data/hg38.gencode.v47.annotation.db
2025-12-11 11:40:38,498 - INFO - Loading reference genome from /path/to/coldstore/references/hg38/hg38.fa
2025-12-11 11:40:38,502 - INFO - Processing 1 experiment
2025-12-11 11:40:38,502 - INFO - Secondary alignments will not be used
2025-12-11 11:40:38,502 - INFO - Processing experiment sub07_Z4
2025-12-11 11:40:38,502 - INFO - Experiment has 1 BAM file: results/sample_clean_7ba614_3b811a_1665c2.bam
2025-12-11 11:42:49,016 - INFO - Total number of chromosomes to be processed 25: chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY
2025-12-11 11:42:49,020 - INFO - Collecting read alignments
2025-12-11 11:42:50,143 - INFO - Processing chromosome chrY
<..>
2025-12-13 15:15:29,326 - INFO - Finished processing chromosome chr2
<..>
2025-12-13 15:15:29,739 - INFO - primary: 6974053
2025-12-13 15:15:29,739 - INFO - secondary: 4930682
2025-12-13 15:15:29,739 - INFO - supplementary: 605636
2025-12-13 15:15:29,739 - INFO - unaligned: 47775
2025-12-13 15:15:29,745 - INFO - Finishing read assignment, total assignments 6479381, polyA percentage 82.3
<..>
2025-12-13 15:15:29,749 - INFO - Total assignments used for analysis: 6479381, polyA tail detected in 5329436 (82.3%)
2025-12-13 15:15:29,750 - INFO - Processing assigned reads XXX
2025-12-13 15:15:29,750 - INFO - Transcript models construction is turned on
2025-12-13 15:15:29,767 - INFO - Transcript construction options:
2025-12-13 15:15:29,767 - INFO - Novel monoexonic transcripts will be reported: no
2025-12-13 15:15:29,767 - INFO - PolyA tails are required for multi-exon transcripts to be reported: yes
2025-12-13 15:15:29,767 - INFO - PolyA tails are required for 2-exon transcripts to be reported: yes
2025-12-13 15:15:29,768 - INFO - PolyA tails are required for known monoexon transcripts to be reported: yes
2025-12-13 15:15:29,768 - INFO - PolyA tails are required for novel monoexon transcripts to be reported: yes
2025-12-13 15:15:29,768 - INFO - Splice site reporting level: only_canonical
2025-12-13 15:15:29,902 - INFO - Processing chromosome chr2
<..>
2025-12-13 16:50:57,261 - INFO - Read assignment statistics
2025-12-13 16:50:57,262 - INFO - ambiguous: 1639559
2025-12-13 16:50:57,262 - INFO - inconsistent: 294791
2025-12-13 16:50:57,262 - INFO - inconsistent_ambiguous: 516353
2025-12-13 16:50:57,262 - INFO - inconsistent_non_intronic: 707139
2025-12-13 16:50:57,262 - INFO - intergenic: 13754
2025-12-13 16:50:57,262 - INFO - noninformative: 1008044
2025-12-13 16:50:57,262 - INFO - unique: 2047350
2025-12-13 16:50:57,262 - INFO - unique_minor_difference: 252391
<..>
2025-12-13 16:50:57,791 - INFO - Transcript model statistics
2025-12-13 16:50:57,791 - INFO - known: 21688
2025-12-13 16:50:57,791 - INFO - novel_in_catalog: 2420
2025-12-13 16:50:57,791 - INFO - novel_not_in_catalog: 2164
<..>
2025-12-13 16:51:30,430 - INFO - Processed experiment XXX
2025-12-13 16:51:30,430 - INFO - Processed 1 experiment
2025-12-13 16:51:30,430 - INFO - === IsoQuant pipeline finished ===
This one took about 54h to finish. Is this something you'd expect?
Is there a way to reduce I/O and/or to speed up things?
Or did I choose the parameters - umm - unwisely?
Any hints, recommendations, help are welcome :-)
best,
Sven