Q: best practices?

I am playing around to setup a stable pipeline for **Oxford Nanopore dRNA** data.

Currently I have a dataset of 20 samples/fastqs (each is 6-7GB in size).

In a first (naïv) approach, I sent 20 `isoquant` jobs into our Slurm cluster.
Each job with 24 cores, pre-build `minimap2` index and also with pre-built `hg38.gencode.v47.annotation.db` .

Jobs were distributed with a maximum of three on each node. 

But - and this something I cannot avoid at the moment - data and working directory are all on the same storage, mounted via InfiniBand, providing up to 100 Gbit/s and, due to the protocol, with very low latency.

I do observe very high I/O for these jobs (I even used `--high_memory`) while "<..>processing chromosome<..>", pipeline took about 40h-54h to finish.

### command 
- `$yml_input` contains a single fastq record
```bash
isoquant.py \
    --clean_start \
    --force \
    --high_memory \
    --yaml $yml_input \
    --data_type nanopore \
    --aligner minimap2 \
    --index $ref_genome_mmi \
    --reference $ref_genome_fa \
    --genedb $ref_anno_db \
    --complete_genedb \
    --fl_data \
    --transcript_quantification unique_only \
    --threads $THREADS \
    --output $RES_DIR
```

### some data info
```
2025-12-11 11:00:52,064 - INFO - Running IsoQuant version 3.10.0
2025-12-11 11:00:52,084 - INFO - Novel unspliced transcripts will not be reported, set --report_novel_unspliced true to discover them
2025-12-11 11:00:52,084 - INFO -  === IsoQuant pipeline started ===
2025-12-11 11:00:52,084 - INFO - Python version: 3.13.11 | packaged by conda-forge | (main, Dec  6 2025, 11:24:03) [GCC 14.3.0]
2025-12-11 11:00:52,084 - INFO - gffutils version: 0.13
2025-12-11 11:00:52,084 - INFO - pysam version: 0.23.3
2025-12-11 11:00:52,084 - INFO - pyfaidx version: 0.9.0.3
2025-12-11 11:00:52,084 - INFO - Reading reference genome from /path/to/coldstore/references/hg38/hg38.fa
2025-12-11 11:00:52,087 - INFO - Converting gene annotation file /path/to/hg38.gencode.v47.annotation.db to .bed format
2025-12-11 11:04:10,337 - INFO - Gene database BED written to results_sub07_Z44/hg38.gencode.v47.annotation.bed
2025-12-11 11:04:10,342 - INFO - Aligning /path/to/sample_clean.fq.gz to the reference, alignments will be saved to /path/to/sample_clean_7ba614_3b811a_1665c2.bam
2025-12-11 11:04:10,350 - INFO - Running minimap2 version 2.30-r1287 (takes a while)
2025-12-11 11:38:09,968 - INFO - Sorting alignments
2025-12-11 11:39:35,069 - INFO - Indexing alignments
2025-12-11 11:40:38,224 - INFO - Loading gene database from /path/to/ref_data/hg38.gencode.v47.annotation.db
2025-12-11 11:40:38,498 - INFO - Loading reference genome from /path/to/coldstore/references/hg38/hg38.fa
2025-12-11 11:40:38,502 - INFO - Processing 1 experiment
2025-12-11 11:40:38,502 - INFO - Secondary alignments will not be used
2025-12-11 11:40:38,502 - INFO - Processing experiment sub07_Z4
2025-12-11 11:40:38,502 - INFO - Experiment has 1 BAM file: results/sample_clean_7ba614_3b811a_1665c2.bam
2025-12-11 11:42:49,016 - INFO - Total number of chromosomes to be processed 25: chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY
2025-12-11 11:42:49,020 - INFO - Collecting read alignments
2025-12-11 11:42:50,143 - INFO - Processing chromosome chrY
<..>
2025-12-13 15:15:29,326 - INFO - Finished processing chromosome chr2
<..>
2025-12-13 15:15:29,739 - INFO - primary: 6974053
2025-12-13 15:15:29,739 - INFO - secondary: 4930682
2025-12-13 15:15:29,739 - INFO - supplementary: 605636
2025-12-13 15:15:29,739 - INFO - unaligned: 47775
2025-12-13 15:15:29,745 - INFO - Finishing read assignment, total assignments 6479381, polyA percentage 82.3
<..>
2025-12-13 15:15:29,749 - INFO - Total assignments used for analysis: 6479381, polyA tail detected in 5329436 (82.3%)
2025-12-13 15:15:29,750 - INFO - Processing assigned reads XXX
2025-12-13 15:15:29,750 - INFO - Transcript models construction is turned on
2025-12-13 15:15:29,767 - INFO - Transcript construction options:
2025-12-13 15:15:29,767 - INFO -   Novel monoexonic transcripts will be reported: no
2025-12-13 15:15:29,767 - INFO -   PolyA tails are required for multi-exon transcripts to be reported: yes
2025-12-13 15:15:29,767 - INFO -   PolyA tails are required for 2-exon transcripts to be reported: yes
2025-12-13 15:15:29,768 - INFO -   PolyA tails are required for known monoexon transcripts to be reported: yes
2025-12-13 15:15:29,768 - INFO -   PolyA tails are required for novel monoexon transcripts to be reported: yes
2025-12-13 15:15:29,768 - INFO -   Splice site reporting level: only_canonical
2025-12-13 15:15:29,902 - INFO - Processing chromosome chr2
<..>
2025-12-13 16:50:57,261 - INFO - Read assignment statistics
2025-12-13 16:50:57,262 - INFO - ambiguous: 1639559
2025-12-13 16:50:57,262 - INFO - inconsistent: 294791
2025-12-13 16:50:57,262 - INFO - inconsistent_ambiguous: 516353
2025-12-13 16:50:57,262 - INFO - inconsistent_non_intronic: 707139
2025-12-13 16:50:57,262 - INFO - intergenic: 13754
2025-12-13 16:50:57,262 - INFO - noninformative: 1008044
2025-12-13 16:50:57,262 - INFO - unique: 2047350
2025-12-13 16:50:57,262 - INFO - unique_minor_difference: 252391
<..>
2025-12-13 16:50:57,791 - INFO - Transcript model statistics
2025-12-13 16:50:57,791 - INFO - known: 21688
2025-12-13 16:50:57,791 - INFO - novel_in_catalog: 2420
2025-12-13 16:50:57,791 - INFO - novel_not_in_catalog: 2164
<..>
2025-12-13 16:51:30,430 - INFO - Processed experiment XXX
2025-12-13 16:51:30,430 - INFO - Processed 1 experiment
2025-12-13 16:51:30,430 - INFO -  === IsoQuant pipeline finished ===
```

This one took about 54h to finish. Is this something you'd expect? 
Is there a way to reduce I/O and/or to speed up things?
Or did I choose the parameters - umm - unwisely?

Any hints, recommendations, help are welcome :-)

best,
Sven


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Q: best practices? #357

command

some data info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Q: best practices? #357

Description

command

some data info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions