Description
Hi @andrewprzh
Thanks a lot for developing IsoQuant. It's been a pleasure using it for isoform quantification/discovery.
My apologies for the long question, but this is a performance issue I've been facing for a while. I'm following up from a previous issue I posted in July (#209), and I suspect others have posted similar issues about memory usage. That being said, I've been very keen on using a combination of isoseq+IsoQuant because I find IsoQuant's approach to identifying read assignments and declaring novel transcripts quite reasonable.
However, since last July, we have expanded our cohort from 15 samples to 61 samples with hundreds of thousands of cells and with tens of millions of HiFi reads per sample. Already with 15 samples, I was struggling with memory requirements, so at the time I thought to split the BAM files by chromosome. The obvious disadvantage is that this approach cannot handle supplementary alignments and could potentially affect fusion genes/transcripts discovery. But that seemed like a fair price to pay.
Now with 61 samples, even that approach didn't work as some chromosomes took > 5 days, > 400 GB and 20 CPUs to run using IsoQuant v3.6.2 (also tried with as few as 2 CPUs in command line argument -t
). So I had to think about another approach. I'm currently implementing a further sharding approach where I find valid "splitting points" across all samples. I do this by:
1-Finding unmapped regions per sample:
samtools view -F 4 -b ${sampleBam} ${chrom} | bedtools bamtobed -i - | bedtools merge > $mappedRegionsBed
bedtools complement -i $mappedRegionsBed -g chrom.sizes.txt > $unMappedRegionsBed
2-Intersection of all unmapped regions across samples:
bedtools multiinter -i ${unMappedRegionsBed[@]} | awk -v count_all_files="${#unMappedRegionsBed[@]}" '$4==count_all_files' | cut -f1,2,3 > unmapped_regions_across_all_samples.bed
Now these represent regions that are uncovered by any reads (including reads containing gaps i.e N
in their CIGAR string). Again, this doesn't take care of supplementary reads so I just decide to filter them out and accept that this is the cost of parallelisation.
3-Chunking strategy
I'm trying multiple strategies to choose the best set of "splittting points" to make the sharded BAMs as equal in terms of numbers of reads as possible, but this is another story (can provide code if interested).
My question is:
Can you foresee any serious problems with this approach? The obvious advantage is that it is now possible to run much smaller chunks that don't require 100s of GBs of memory. The disadvantage is that there will have to be a bit of playing around to ensure things can be collected appropriately afterwards. For example, I will have to modify novel transcript names (e.g. there will be multiple transcript123.chr1 in multple chunks).
Other than supplementary alignments and transcript naming, is there any other why not to continue with this approach?
My command
isoquant.py --reference ${fasta} --genedb gencodev46.db--complete_genedb --sqanti_output --bam ${bams[@]} --labels ${sample_ids[@]} --data_type pacbio_ccs -o ${region} -p ${region} --count_exons --check_canonical --read_group tag:CB -t 3 --counts_format linear --bam_tags CB --no_secondary --clean_start
Activity