Skip to content

BUG: VCF requirements not clearly stated & wrong seqnames derived #70

Open
@KuechlerO

Description

Firstly thx for your amazing work.
I really like the concept behind your tool :)

Now concerning the bug:
I worked with a fresh Colab notebook

  • MMSplice version:
    Installation with pip: mmsplice-2.4.0

  • Python version:
    Python 3.10.13

  • Operating System:
    Colab notebook

Description

I wanted to try out mmsplice, so I've downloaded your example data and gave it a go.
But even though I used your test data, I constantly received the following error:
ValueError: Fasta chrom names do not match with vcf chrom names.

After hours of wrapping my head around and hacking with the package code I found out:

  1. Not only a VCF-file is required as input, but also its index version (i.e. vcf.gz.tbi).
    That was never stated (check your ReadMe)...
    Please fix that.

  2. Also, when hacking around, I realized the parsing of the seqnames seems buggy.

  • It only parses seqnames when an indexed VCF file is provided
  • But it also always includes {'1'} as seqname for the VCF, no matter what is provided?!
# In your vcf_dataloader.py
def _check_chrom_annotation(self):
        # I've added these two lines
        fasta_chroms = set(self.fasta.fasta.keys())
        vcf_chroms = set(self.vcf.seqnames)

        print("fasta: ", fasta_chroms, flush=True)
        print("vcf_chroms", vcf_chroms, flush=True)
        if not fasta_chroms.intersection(vcf_chroms):
            raise ValueError(
                'Fasta chrom names do not match with vcf chrom names')

--> Output:
fasta:  {'17'}
vcf_chroms {'1', '17'}
...

The VCF seqnames should only include 17, since I've just provided your example.

What I Did

Take a look:
https://colab.research.google.com/drive/1hx6PAYT_lKuEYtnHCq0PN2lyvNBNDud1?usp=sharing

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions