Hi all,
I'm working with the Pandagma fam pipeline to identify and analyze gene families in Arachis hypogaea. I'm following the recommended step-by-step flow from the GitHub repository, and the ingest step runs without any apparent errors:
Run ingest: from fasta and gff or bed data, create fasta with IDs containing positional info.
Get position information from the main annotation sets (protein).
Adding positional information to fasta file arahy.Tifrunner.gnm2.ann2.PVFB.protein_FIX.faa
Adding positional information to fasta file arahy.Tifrunner.gnm2.ann1.4K0L.protein_FIX.faa
Adding positional information to fasta file arahy.Tifrunner.gnm1.ann1.CCJH.protein_FIX.faa
Adding positional information to fasta file arahy.BaileyII.gnm1.ann1.PQM7.protein_FIX.faa
Get position information from the main annotation sets (cds).
Adding positional information to fasta file arahy.Tifrunner.gnm2.ann2.PVFB.cds.fna
Adding positional information to fasta file arahy.Tifrunner.gnm2.ann1.4K0L.cds_FIX.fna
Adding positional information to fasta file arahy.Tifrunner.gnm1.ann1.CCJH.cds_FIX.fna
Adding positional information to fasta file arahy.BaileyII.gnm1.ann1.PQM7.cds_FIX.fna
Get position information from the extra annotation sets (protein), if any.
Adding positional information to extra fasta file Prot-TIFRUNNER-BES1BZR1.faa
Get position information from the extra annotation sets (cds), if any.
Adding positional information to extra fasta file cds.fna
Count starting sequences, for later comparisons
run_clean
However, the mmseqs step gives me two different kinds of issues depending on how I filter my .faa files:
- With long sequences (up to 4000–2000 aa)
If I keep all protein sequences, including some longer than 2000–4000 amino acids, I get segmentation fault errors such as:
scoreIdentical has different length L: ...
Segmentation fault (core dumped)
- Filtering to ≤ 1000 aa
If I filter the .faa files to include only sequences shorter than 1000 amino acids, the mmseqs step completes without errors, but the resulting .m8 files in 03_mmseqs/ are empty (0 bytes).
This prevents downstream clustering steps (e.g. mcl) from forming valid families, since no sequence pairs are found.
Things I've already checked
All .faa files are correctly formatted: one line per sequence, no blank headers, and only valid amino acids.
I used awk and grep to verify FASTA formatting and detect any problematic entries.
I converted all files to UNIX format using dos2unix.
I adjusted the clust_iden and clust_cov parameters in fam.conf to 0.30 to make clustering more permissive.
The 02_fasta_prot/ directory contains all processed FASTA files after ingest, as expected.
I am running Pandagma inside a local Conda environment, without using Singularity or Docker.
System info
RAM: ~16 GB
Running Pandagma with -n 2 or -n 4 threads
Using: pandagma fam -c fam.conf -s mmseqs
Any ideas about why this is happening?
Could the empty .m8 files be caused by too few sequences after filtering, or might it still be a memory-related issue with longer sequences?
Thank you very much in advance for any suggestions or help!
Hi all,
I'm working with the Pandagma fam pipeline to identify and analyze gene families in Arachis hypogaea. I'm following the recommended step-by-step flow from the GitHub repository, and the ingest step runs without any apparent errors:
However, the mmseqs step gives me two different kinds of issues depending on how I filter my .faa files:
If I keep all protein sequences, including some longer than 2000–4000 amino acids, I get segmentation fault errors such as:
scoreIdentical has different length L: ...
Segmentation fault (core dumped)
If I filter the .faa files to include only sequences shorter than 1000 amino acids, the mmseqs step completes without errors, but the resulting .m8 files in 03_mmseqs/ are empty (0 bytes).
This prevents downstream clustering steps (e.g. mcl) from forming valid families, since no sequence pairs are found.
Things I've already checked
All .faa files are correctly formatted: one line per sequence, no blank headers, and only valid amino acids.
I used awk and grep to verify FASTA formatting and detect any problematic entries.
I converted all files to UNIX format using dos2unix.
I adjusted the clust_iden and clust_cov parameters in fam.conf to 0.30 to make clustering more permissive.
The 02_fasta_prot/ directory contains all processed FASTA files after ingest, as expected.
I am running Pandagma inside a local Conda environment, without using Singularity or Docker.
System info
RAM: ~16 GB
Running Pandagma with -n 2 or -n 4 threads
Using: pandagma fam -c fam.conf -s mmseqs
Any ideas about why this is happening?
Could the empty .m8 files be caused by too few sequences after filtering, or might it still be a memory-related issue with longer sequences?
Thank you very much in advance for any suggestions or help!