Skip to content

KIN becomes significantly slower when running badger rerun. #5

@MaelLefeuvre

Description

@MaelLefeuvre

Description

The software KIN appears to be much (much) slower when running badger rerun on previously archived bam files.

This 'sluggishness' is found most prominent in the makeHapProbs() function of KIN when computing pseudo-haploid variant calls (See: input_preparation_functions.py:42-70)

Assumed cause(s)

This slower performance appears to be due to the accumulation of @PG header lines in the archived simulated bam files ; first when running samtools cram in the archiving procedure of bam files, and then when running samtools split when decompressing the archive.

On the other hand, KIN makes use of the pysam library, which appears to re-read header lines between every record of the bam file. See: Pysam FAQ - BAM files with a large number of reference sequences are slow

The slower performance is thus due to the accumulation of @PG header lines in the bam files, which are continuously being read by pysam when processing random pseudo-haploid variant calls.

Proposed solution(s):

  • Add --no-PG when running samtools cram in the archive procedure, and when running samtools split in the decompressing procedure.
  • Alternatively, samtools reset can be used to remove unwanted @PG lines. This requires samtools-1.16.
  • ...Or simply use a simple grep command when splitting files.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingperfIssues related to runtime and memory performance

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions