`KIN` becomes significantly slower when running `badger rerun`.

# Description

The software [`KIN`](https://github.com/DivyaratanPopli/Kinship_Inference/tree/v3.1.3) appears to be much *(**much**)* slower when running `badger rerun` on previously archived bam files.

This 'sluggishness' is found most prominent in the `makeHapProbs()` function of `KIN` when computing pseudo-haploid variant calls (See: [input_preparation_functions.py:42-70](https://github.com/DivyaratanPopli/Kinship_Inference/blob/891bcaa263ba249439daa7e9e89277555a664cdd/pypackage/kingaroo/KINgaroo/KINgaroo_scripts/input_preparation_functions.py#L42-L70))

# Assumed cause(s)
This slower performance appears to be due to the accumulation of `@PG` header lines in the archived simulated bam files ; first when running `samtools cram` in the archiving procedure of bam files, and then when running `samtools split` when decompressing the archive.

On the other hand, `KIN` makes use of the `pysam` library, which appears to re-read header lines between every record of the bam file. See: [Pysam FAQ - BAM files with a large number of reference sequences are slow](https://pysam.readthedocs.io/en/latest/faq.html#bam-files-with-a-large-number-of-reference-sequences-are-slow)

The slower performance is thus due to the accumulation of `@PG` header lines in the bam files, which are continuously being read by `pysam` when processing random pseudo-haploid variant calls.

# Proposed solution(s):

- [ ] Add `--no-PG` when running `samtools cram` in the archive procedure, and when running `samtools split` in the decompressing procedure.
- [ ] Alternatively, `samtools reset` can be used to remove unwanted `@PG` lines. This requires samtools-1.16.
- [ ] ...Or simply use a simple grep command when splitting files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`KIN` becomes significantly slower when running `badger rerun`. #5

Description

Assumed cause(s)

Proposed solution(s):

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

KIN becomes significantly slower when running badger rerun. #5

Description

Description

Assumed cause(s)

Proposed solution(s):

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`KIN` becomes significantly slower when running `badger rerun`. #5