Skip to content

Expose mmseqs --db-load-mode in precompute_alignments_mmseqs.py#579

Open
XunOuyang wants to merge 1 commit into
aqlaboratory:mainfrom
XunOuyang:feat/190-mmseqs-db-load-mode
Open

Expose mmseqs --db-load-mode in precompute_alignments_mmseqs.py#579
XunOuyang wants to merge 1 commit into
aqlaboratory:mainfrom
XunOuyang:feat/190-mmseqs-db-load-mode

Conversation

@XunOuyang

Copy link
Copy Markdown

Summary

Addresses #190.

scripts/precompute_alignments_mmseqs.py forwards a fixed set of positional
arguments to scripts/colabfold_search.sh, the last of which is the
db-load-mode value used by every underlying mmseqs call
(search, expandaln, align, filterresult, result2msa, ...). That value
was hardcoded to "0", so there was no way to change it from the Python
entry point — exactly the problem reported in #190.

This PR exposes it as a CLI argument and forwards it through:

parser.add_argument(
    "--db_load_mode", type=int, default=0, choices=[0, 1, 2, 3],
    help="mmseqs database preload mode, forwarded as --db-load-mode ...",
)

and replaces the hardcoded literal with str(args.db_load_mode).

Why it helps

--db-load-mode controls how mmseqs loads databases (0: auto, 1: fread,
2: mmap, 3: mmap+touch). When a large multi-sequence FASTA is processed in
chunks (--fasta_chunk_size) against the same precomputed index, the wrapper
is invoked once per chunk. With mode 0, the databases are re-read from disk
every time; with mode 2/3 they stay resident in memory, which can
substantially cut wall-clock time at the cost of higher memory use. This is the
same knob ColabFold exposes.

Behavior / compatibility

  • Default is 0, so existing invocations are unchanged.
  • choices=[0, 1, 2, 3] rejects invalid values early with a clear argparse error.

Note on the CPU question in the issue

The issue also asks how to maximize CPU usage. mmseqs already uses all
available cores by default (it has its own --threads default), so no change is
needed for that; users who want to limit threads can set the MMSEQS_NUM_THREADS
environment variable. I kept this PR focused on the db-load-mode ask in the title.

Test plan

  • python -m py_compile scripts/precompute_alignments_mmseqs.py passes.
  • Argparse verified: default resolves to 0, --db_load_mode 2 parses to 2,
    and out-of-range values (e.g. 5) are rejected.
  • The forwarded positional ordering into colabfold_search.sh is unchanged
    except for the now-configurable final value.

…qs.py

The db-load-mode forwarded to colabfold_search.sh (and onward to every
mmseqs call) was hardcoded to "0", so users had no way to change how
databases are loaded from the precompute_alignments_mmseqs.py entry point.

Expose a --db_load_mode argument (0: auto, 1: fread, 2: mmap, 3: mmap+touch)
and forward it to the search wrapper. The default remains 0 to preserve the
current behavior. When a large input FASTA is processed in chunks against the
same precomputed index, keeping the databases resident in memory (mode 2 or
3) avoids re-reading them from disk on every chunk and can substantially
speed up the search.

Addresses aqlaboratory#190

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant