-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Description
The software KIN appears to be much (much) slower when running badger rerun on previously archived bam files.
This 'sluggishness' is found most prominent in the makeHapProbs() function of KIN when computing pseudo-haploid variant calls (See: input_preparation_functions.py:42-70)
Assumed cause(s)
This slower performance appears to be due to the accumulation of @PG header lines in the archived simulated bam files ; first when running samtools cram in the archiving procedure of bam files, and then when running samtools split when decompressing the archive.
On the other hand, KIN makes use of the pysam library, which appears to re-read header lines between every record of the bam file. See: Pysam FAQ - BAM files with a large number of reference sequences are slow
The slower performance is thus due to the accumulation of @PG header lines in the bam files, which are continuously being read by pysam when processing random pseudo-haploid variant calls.
Proposed solution(s):
- Add
--no-PGwhen runningsamtools cramin the archive procedure, and when runningsamtools splitin the decompressing procedure. - Alternatively,
samtools resetcan be used to remove unwanted@PGlines. This requires samtools-1.16. - ...Or simply use a simple grep command when splitting files.