Drop the legacy per-KO HMM path from getKEGGModelForOrganism#636
Merged
Conversation
…search Replace the per-KO hmmsearch loop in getKEGGModelForOrganism with a single hmmsearch of the whole query proteome against the concatenated kegg116_<domain> HMM library, downloaded as a gzip-compressed flatfile and gunzipped on demand. The profile library is the query and the proteome the target sequence database, so the reported per-hit E-values match RAVEN's historical per-KO hmmsearch (same search direction, same effective database size). The cut-off, minScoreRatioKO/minScoreRatioG filters and model assembly are therefore unchanged - thousands of hmmsearch invocations simply collapse into one, with no hmmpress/hmmscan and no new bundled binary (hmmsearch already ships). A legacy directory of per-KO HMMs is still honoured when already extracted, and per-organism phylogenetic-distance subsampling is skipped for the fixed prebuilt prok90/euk90 libraries.
Added timeout to function tests to prevent hanging.
The reconstruction now relies solely on the prebuilt concatenated KO HMM library queried in a single hmmsearch. The legacy branch that built HMMs locally and searched them per-KO is removed, together with everything only it used: - the per-KO-directory and .zip dataDir detection (the concatenated library is always used / downloaded) - the local HMM-training pipeline (cd-hit clustering, mafft alignment, hmmbuild) and the fasta/aligned/hmms working directories - the now-unused seqIdentity and nSequences parameters - the bundled cd-hit and mafft binaries and the hmmbuild binary (hmmsearch is kept), with matching changes in checkInstallation, .gitattributes, software/versions.txt and CONTRIBUTING.md - the cdhitTests and mafftTests; hmmerTests becomes an hmmsearch smoke test hmmsearch, getPhylDist/keggPhylDist.mat, maxPhylDist, the keggdb global model path and the no-FASTA KEGG-annotation reconstruction mode are all retained. The docstring is updated to describe the two remaining modes.
Function test results190 tests 170 ✅ 35s ⏱️ Results for commit 435aba3. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacks on #630. Now that the proteome is searched against the prebuilt concatenated
kegg116KO HMM library in a singlehmmsearch, the legacy per-KO branch that built HMMs locally and searched them one KO at a time is no longer needed. This removes it and everything only it used.Making new KEGG artefacts and HMMs will be done via raven-python from now on.
Removed
else useConcatLibblock: legacy per-KO search and the local HMM-training pipeline (cd-hit clustering, mafft alignment, hmmbuild). The function now always uses the concatenated library (downloading/extracting it if needed).dataDirdetection (pre-extracted per-KO directory,.ziparchive) and theuseConcatLibflag.fasta/aligned/hmmsworking sub-directories.seqIdentityandnSequencesparameters.software/cd-hit/(3 files),software/mafft/(83 files),software/hmmer/hmmbuild(.mac).hmmsearchis kept.checkInstallationCD-HIT/MAFFT checks + chmod entries;.gitattributes,software/versions.txt,CONTRIBUTING.mdentries.cdhitTests.m,mafftTests.m;hmmerTests.mbecomes anhmmsearchsmoke test.Kept
hmmsearch,getPhylDist+keggPhylDist.mat, themaxPhylDistparameter, thekeggdb/getModelFromKEGGglobal-model path, and the no-FASTA KEGG-annotation reconstruction mode (which usesgetPhylDist/maxPhylDist). Docstring rewritten to describe the two remaining modes.Verification
getKEGGModelForOrganism.m: 1171 → ~745 lines.checkcodeclean on the function andcheckInstallation.hmmerTests(hmmsearch smoke) passes;tReconstructionhas 0 failures (data-gated tests skip as before). No remaining.m/config references to cd-hit/mafft/hmmbuild.