Logan Proteins minor fix#1860
Conversation
…n search_database.xml
bgruening
left a comment
There was a problem hiding this comment.
Thanks for the fix! The three changes here are correct and well-scoped:
.shed.ymlURL typo (search_proteins→logan_proteins) ✅@VERSION_SUFFIX@bump ✅- FASTA-extension symlink in
search_database.xml✅
While reviewing I went through the rest of the wrapper and found a few things worth folding in (or addressing in a quick follow-up). Requesting changes mostly because of #1 and #2 below — they're closely related to the fix in this PR and the same class of bug.
Most likely needs to land here
-
embed_query.xmlvery likely has the same FASTA-extension bug this PR fixes forsearch_database.xml.
Atembed_query.xmlline 14–15,--query_sequences '$query_sequences'is passed the raw Galaxy.datpath. Ifembed_query.pyinfers FASTA from the suffix (which the sibling fix strongly implies), embedding will fail on real datasets the same waysearch_database.pydid. Suggest applying the sameln -s '$query_sequences' query_sequences.fastapattern here. -
embed_query.xmlhelp text describes a UI toggle that doesn't exist.
Line 55 says "Force CPU usage: By default, CPU will be used. Disable this option to use GPU instead if available." — but the only mechanism is theGALAXY_LOGAN_PROTEIN_SEARCH_FORCE_CPUenv var read at lines 8–12. There's no<param>exposing this. Either add a boolean param, or reword the help so users know this is an admin/job-destination setting.
Worth fixing while you're in here (cheap)
-
tool-data/faiss_database.loc.sampleheader is boilerplate from another tool — it claims the file is for "metagenomics files." Replace with text that actually describes the FAISS protein database loc format. Also align the example row with the canonical 4-column format used intest-data/faiss_database.loc(faiss-demo-db-20260203\tFAISS Test Database\t1.2.0\t/path); the current sample row uses a different value style. -
test-data/queries.fastacontains a nucleotide sequence labeledtest_protein_2. Line 4 startsATGGCTAGCAAAGGAGAAGAA…(looks like GFP CDS). It's labeled as a protein but isn't one. Either rename the record to indicate it's a nucleotide test case, or replace it with a real protein sequence — otherwise it's a copy-paste footgun for users using this as a reference. -
outfmtparam insearch_database.xmlis markedoptional="true"but always has a default. Lines 31–38 setvalue="0" selected="true", so the#if $outfmt:guard at line 17 is effectively dead code —0is always passed. Either dropoptional="true"or remove the default if "unset" should mean "use upstream default." -
-Fflag inembed_query.xmlline 17 is undocumented. A short trailing comment (<!-- force overwrite of existing output files -->or whatever it actually means) would help future maintainers.
Follow-up PR material (don't block this one)
-
Test coverage is thin. The
search_databasetest at lines 47–57 only asserts"No results, exiting"because the bundled FAISS DB attest-data/test-db/faiss/db.txtis a 0-byte stub. So we currently verify the script bails gracefully on an empty index — not that the FAISS+MMseqs2 pipeline actually returns matches. A small real index that produces ≥1 hit would substantially raise the safety floor; today, a regression in the actual search logic could land green. -
macros.xmlcitations are thin. Only the upstreamsearch_proteinGitHub repo is cited. Please add at minimum:- FAISS — Johnson et al.,
10.1109/TBDATA.2019.2921572 - MMseqs2 — Steinegger & Söding,
10.1038/nbt.3988 - The GLM2 / gLM2 model citation (whichever paper the container actually uses)
- FAISS — Johnson et al.,
-
Docker-only requirements.
macros.xmllines 5–9 declare onlyquay.io/bgruening/logan-protein:1.2.0— no<requirement type="package">. Galaxy instances without Docker support can't install this tool. Likely unavoidable given the GLM2 + torch + FAISS + MMseqs2 stack, but worth a one-liner in.shed.yml/ help so admins know what to expect. -
search_database.xmlhelp mentions intermediate files that aren't exposed as outputs (lines 94–99:query_results.tsv,unique_centroids.fasta,matches.top_hit, …). Either expose the useful ones as<data>outputs, or trim the help so users aren't told about files they can't access. -
Suite naming inconsistency — directory is
logan_proteins(plural), suite issuite_logan_protein(singular), tool IDs uselogan_protein_*(singular). Tool IDs are stable so don't change them; just flagging for new files.
Happy to approve once #1 and #2 are addressed (or with a quick justification that #1 isn't actually a problem). Everything else can be its own PR.
FOR CONTRIBUTOR:
There are two labels that allow to ignore specific (false positive) tool linter errors:
skip-version-check: Use it if only a subset of the tools has been updated in a suite.skip-url-check: Use it if github CI sees 403 errors, but the URLs work.