Skip to content

MMSEQ error #22

@pbmch

Description

@pbmch

Hi all,
I'm working with the Pandagma fam pipeline to identify and analyze gene families in Arachis hypogaea. I'm following the recommended step-by-step flow from the GitHub repository, and the ingest step runs without any apparent errors:

Run ingest: from fasta and gff or bed data, create fasta with IDs containing positional info.
Get position information from the main annotation sets (protein).
Adding positional information to fasta file arahy.Tifrunner.gnm2.ann2.PVFB.protein_FIX.faa
Adding positional information to fasta file arahy.Tifrunner.gnm2.ann1.4K0L.protein_FIX.faa
Adding positional information to fasta file arahy.Tifrunner.gnm1.ann1.CCJH.protein_FIX.faa
Adding positional information to fasta file arahy.BaileyII.gnm1.ann1.PQM7.protein_FIX.faa
Get position information from the main annotation sets (cds).
Adding positional information to fasta file arahy.Tifrunner.gnm2.ann2.PVFB.cds.fna
Adding positional information to fasta file arahy.Tifrunner.gnm2.ann1.4K0L.cds_FIX.fna
Adding positional information to fasta file arahy.Tifrunner.gnm1.ann1.CCJH.cds_FIX.fna
Adding positional information to fasta file arahy.BaileyII.gnm1.ann1.PQM7.cds_FIX.fna
Get position information from the extra annotation sets (protein), if any.
Adding positional information to extra fasta file Prot-TIFRUNNER-BES1BZR1.faa
Get position information from the extra annotation sets (cds), if any.
Adding positional information to extra fasta file cds.fna
Count starting sequences, for later comparisons
run_clean

However, the mmseqs step gives me two different kinds of issues depending on how I filter my .faa files:

  1. With long sequences (up to 4000–2000 aa)
    If I keep all protein sequences, including some longer than 2000–4000 amino acids, I get segmentation fault errors such as:

scoreIdentical has different length L: ...
Segmentation fault (core dumped)

  1. Filtering to ≤ 1000 aa
    If I filter the .faa files to include only sequences shorter than 1000 amino acids, the mmseqs step completes without errors, but the resulting .m8 files in 03_mmseqs/ are empty (0 bytes).
    This prevents downstream clustering steps (e.g. mcl) from forming valid families, since no sequence pairs are found.

Things I've already checked

All .faa files are correctly formatted: one line per sequence, no blank headers, and only valid amino acids.

I used awk and grep to verify FASTA formatting and detect any problematic entries.

I converted all files to UNIX format using dos2unix.

I adjusted the clust_iden and clust_cov parameters in fam.conf to 0.30 to make clustering more permissive.

The 02_fasta_prot/ directory contains all processed FASTA files after ingest, as expected.

I am running Pandagma inside a local Conda environment, without using Singularity or Docker.

System info

RAM: ~16 GB
Running Pandagma with -n 2 or -n 4 threads
Using: pandagma fam -c fam.conf -s mmseqs

Any ideas about why this is happening?
Could the empty .m8 files be caused by too few sequences after filtering, or might it still be a memory-related issue with longer sequences?

Thank you very much in advance for any suggestions or help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions