MMSEQ error

Hi all,
I'm working with the Pandagma fam pipeline to identify and analyze gene families in Arachis hypogaea. I'm following the recommended step-by-step flow from the GitHub repository, and the ingest step runs without any apparent errors:

> Run ingest: from fasta and gff or bed data, create fasta with IDs containing positional info.
  Get position information from the main annotation sets (protein).
  Adding positional information to fasta file arahy.Tifrunner.gnm2.ann2.PVFB.protein_FIX.faa
  Adding positional information to fasta file arahy.Tifrunner.gnm2.ann1.4K0L.protein_FIX.faa
  Adding positional information to fasta file arahy.Tifrunner.gnm1.ann1.CCJH.protein_FIX.faa
  Adding positional information to fasta file arahy.BaileyII.gnm1.ann1.PQM7.protein_FIX.faa
  Get position information from the main annotation sets (cds).
  Adding positional information to fasta file arahy.Tifrunner.gnm2.ann2.PVFB.cds.fna
  Adding positional information to fasta file arahy.Tifrunner.gnm2.ann1.4K0L.cds_FIX.fna
  Adding positional information to fasta file arahy.Tifrunner.gnm1.ann1.CCJH.cds_FIX.fna
  Adding positional information to fasta file arahy.BaileyII.gnm1.ann1.PQM7.cds_FIX.fna
  Get position information from the extra annotation sets (protein), if any.
  Adding positional information to extra fasta file Prot-TIFRUNNER-BES1BZR1.faa
  Get position information from the extra annotation sets (cds), if any.
  Adding positional information to extra fasta file cds.fna
  Count starting sequences, for later comparisons
run_clean

However, the mmseqs step gives me two different kinds of issues depending on how I filter my .faa files:
1. With long sequences (up to 4000–2000 aa)
If I keep all protein sequences, including some longer than 2000–4000 amino acids, I get segmentation fault errors such as:

scoreIdentical has different length L: ...
Segmentation fault (core dumped)

2. Filtering to ≤ 1000 aa
If I filter the .faa files to include only sequences shorter than 1000 amino acids, the mmseqs step completes without errors, but the resulting .m8 files in 03_mmseqs/ are empty (0 bytes).
This prevents downstream clustering steps (e.g. mcl) from forming valid families, since no sequence pairs are found.


Things I've already checked

All .faa files are correctly formatted: one line per sequence, no blank headers, and only valid amino acids.

I used awk and grep to verify FASTA formatting and detect any problematic entries.

I converted all files to UNIX format using dos2unix.

I adjusted the clust_iden and clust_cov parameters in fam.conf to 0.30 to make clustering more permissive.

The 02_fasta_prot/ directory contains all processed FASTA files after ingest, as expected.

I am running Pandagma inside a local Conda environment, without using Singularity or Docker.


System info

RAM: ~16 GB
Running Pandagma with -n 2 or -n 4 threads
Using: pandagma fam -c fam.conf -s mmseqs

Any ideas about why this is happening?
Could the empty .m8 files be caused by too few sequences after filtering, or might it still be a memory-related issue with longer sequences?

Thank you very much in advance for any suggestions or help! 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMSEQ error #22

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MMSEQ error #22

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions