Skip to content

Drop the legacy per-KO HMM path from getKEGGModelForOrganism#636

Merged
edkerk merged 4 commits into
develop3from
refactor/kegg-drop-legacy-hmm
Jun 11, 2026
Merged

Drop the legacy per-KO HMM path from getKEGGModelForOrganism#636
edkerk merged 4 commits into
develop3from
refactor/kegg-drop-legacy-hmm

Conversation

@edkerk

@edkerk edkerk commented Jun 11, 2026

Copy link
Copy Markdown
Member

Summary

Stacks on #630. Now that the proteome is searched against the prebuilt concatenated kegg116 KO HMM library in a single hmmsearch, the legacy per-KO branch that built HMMs locally and searched them one KO at a time is no longer needed. This removes it and everything only it used.
Making new KEGG artefacts and HMMs will be done via raven-python from now on.

Removed

  • The else useConcatLib block: legacy per-KO search and the local HMM-training pipeline (cd-hit clustering, mafft alignment, hmmbuild). The function now always uses the concatenated library (downloading/extracting it if needed).
  • Legacy dataDir detection (pre-extracted per-KO directory, .zip archive) and the useConcatLib flag.
  • The fasta/aligned/hmms working sub-directories.
  • The now-unused seqIdentity and nSequences parameters.
  • Bundled binaries: software/cd-hit/ (3 files), software/mafft/ (83 files), software/hmmer/hmmbuild(.mac). hmmsearch is kept.
  • checkInstallation CD-HIT/MAFFT checks + chmod entries; .gitattributes, software/versions.txt, CONTRIBUTING.md entries.
  • cdhitTests.m, mafftTests.m; hmmerTests.m becomes an hmmsearch smoke test.

Kept

hmmsearch, getPhylDist + keggPhylDist.mat, the maxPhylDist parameter, the keggdb/getModelFromKEGG global-model path, and the no-FASTA KEGG-annotation reconstruction mode (which uses getPhylDist/maxPhylDist). Docstring rewritten to describe the two remaining modes.

Verification

getKEGGModelForOrganism.m: 1171 → ~745 lines. checkcode clean on the function and checkInstallation. hmmerTests (hmmsearch smoke) passes; tReconstruction has 0 failures (data-gated tests skip as before). No remaining .m/config references to cd-hit/mafft/hmmbuild.

⚠️ The end-to-end reconstruction (KEGG dump, network download, external tools) can't be exercised in CI here, so reviewers with KEGG access should sanity-check a real homology run.

edkerk added 3 commits June 11, 2026 15:19
…search

Replace the per-KO hmmsearch loop in getKEGGModelForOrganism with a single
hmmsearch of the whole query proteome against the concatenated kegg116_<domain>
HMM library, downloaded as a gzip-compressed flatfile and gunzipped on demand.

The profile library is the query and the proteome the target sequence database,
so the reported per-hit E-values match RAVEN's historical per-KO hmmsearch
(same search direction, same effective database size). The cut-off,
minScoreRatioKO/minScoreRatioG filters and model assembly are therefore
unchanged - thousands of hmmsearch invocations simply collapse into one, with
no hmmpress/hmmscan and no new bundled binary (hmmsearch already ships).

A legacy directory of per-KO HMMs is still honoured when already extracted, and
per-organism phylogenetic-distance subsampling is skipped for the fixed
prebuilt prok90/euk90 libraries.
Added timeout to function tests to prevent hanging.
The reconstruction now relies solely on the prebuilt concatenated KO HMM
library queried in a single hmmsearch. The legacy branch that built HMMs
locally and searched them per-KO is removed, together with everything only
it used:
- the per-KO-directory and .zip dataDir detection (the concatenated
  library is always used / downloaded)
- the local HMM-training pipeline (cd-hit clustering, mafft alignment,
  hmmbuild) and the fasta/aligned/hmms working directories
- the now-unused seqIdentity and nSequences parameters
- the bundled cd-hit and mafft binaries and the hmmbuild binary (hmmsearch
  is kept), with matching changes in checkInstallation, .gitattributes,
  software/versions.txt and CONTRIBUTING.md
- the cdhitTests and mafftTests; hmmerTests becomes an hmmsearch smoke test

hmmsearch, getPhylDist/keggPhylDist.mat, maxPhylDist, the keggdb global
model path and the no-FASTA KEGG-annotation reconstruction mode are all
retained. The docstring is updated to describe the two remaining modes.
Base automatically changed from feat/kegg-hmmsearch to develop3 June 11, 2026 19:06
@github-actions

Copy link
Copy Markdown

Function test results

190 tests   170 ✅  35s ⏱️
 21 suites   20 💤
  1 files      0 ❌

Results for commit 435aba3.

@edkerk edkerk merged commit 570a490 into develop3 Jun 11, 2026
2 checks passed
@edkerk edkerk deleted the refactor/kegg-drop-legacy-hmm branch June 11, 2026 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant