Skip to content

Logan Proteins minor fix#1860

Open
SaimMomin12 wants to merge 4 commits intobgruening:masterfrom
SaimMomin12:tools/fix-search-proteins
Open

Logan Proteins minor fix#1860
SaimMomin12 wants to merge 4 commits intobgruening:masterfrom
SaimMomin12:tools/fix-search-proteins

Conversation

@SaimMomin12
Copy link
Copy Markdown
Collaborator

FOR CONTRIBUTOR:

  • I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • License permits unrestricted use (educational + commercial)
  • This PR adds a new tool or tool collection
  • This PR updates an existing tool or tool collection
  • This PR does something else (explain below)

There are two labels that allow to ignore specific (false positive) tool linter errors:

  • skip-version-check: Use it if only a subset of the tools has been updated in a suite.
  • skip-url-check: Use it if github CI sees 403 errors, but the URLs work.

@SaimMomin12 SaimMomin12 requested a review from bgruening May 6, 2026 07:18
Copy link
Copy Markdown
Owner

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! The three changes here are correct and well-scoped:

  • .shed.yml URL typo (search_proteinslogan_proteins) ✅
  • @VERSION_SUFFIX@ bump ✅
  • FASTA-extension symlink in search_database.xml

While reviewing I went through the rest of the wrapper and found a few things worth folding in (or addressing in a quick follow-up). Requesting changes mostly because of #1 and #2 below — they're closely related to the fix in this PR and the same class of bug.

Most likely needs to land here

  1. embed_query.xml very likely has the same FASTA-extension bug this PR fixes for search_database.xml.
    At embed_query.xml line 14–15, --query_sequences '$query_sequences' is passed the raw Galaxy .dat path. If embed_query.py infers FASTA from the suffix (which the sibling fix strongly implies), embedding will fail on real datasets the same way search_database.py did. Suggest applying the same ln -s '$query_sequences' query_sequences.fasta pattern here.

  2. embed_query.xml help text describes a UI toggle that doesn't exist.
    Line 55 says "Force CPU usage: By default, CPU will be used. Disable this option to use GPU instead if available." — but the only mechanism is the GALAXY_LOGAN_PROTEIN_SEARCH_FORCE_CPU env var read at lines 8–12. There's no <param> exposing this. Either add a boolean param, or reword the help so users know this is an admin/job-destination setting.

Worth fixing while you're in here (cheap)

  1. tool-data/faiss_database.loc.sample header is boilerplate from another tool — it claims the file is for "metagenomics files." Replace with text that actually describes the FAISS protein database loc format. Also align the example row with the canonical 4-column format used in test-data/faiss_database.loc (faiss-demo-db-20260203\tFAISS Test Database\t1.2.0\t/path); the current sample row uses a different value style.

  2. test-data/queries.fasta contains a nucleotide sequence labeled test_protein_2. Line 4 starts ATGGCTAGCAAAGGAGAAGAA… (looks like GFP CDS). It's labeled as a protein but isn't one. Either rename the record to indicate it's a nucleotide test case, or replace it with a real protein sequence — otherwise it's a copy-paste footgun for users using this as a reference.

  3. outfmt param in search_database.xml is marked optional="true" but always has a default. Lines 31–38 set value="0" selected="true", so the #if $outfmt: guard at line 17 is effectively dead code — 0 is always passed. Either drop optional="true" or remove the default if "unset" should mean "use upstream default."

  4. -F flag in embed_query.xml line 17 is undocumented. A short trailing comment (<!-- force overwrite of existing output files --> or whatever it actually means) would help future maintainers.

Follow-up PR material (don't block this one)

  1. Test coverage is thin. The search_database test at lines 47–57 only asserts "No results, exiting" because the bundled FAISS DB at test-data/test-db/faiss/db.txt is a 0-byte stub. So we currently verify the script bails gracefully on an empty index — not that the FAISS+MMseqs2 pipeline actually returns matches. A small real index that produces ≥1 hit would substantially raise the safety floor; today, a regression in the actual search logic could land green.

  2. macros.xml citations are thin. Only the upstream search_protein GitHub repo is cited. Please add at minimum:

    • FAISS — Johnson et al., 10.1109/TBDATA.2019.2921572
    • MMseqs2 — Steinegger & Söding, 10.1038/nbt.3988
    • The GLM2 / gLM2 model citation (whichever paper the container actually uses)
  3. Docker-only requirements. macros.xml lines 5–9 declare only quay.io/bgruening/logan-protein:1.2.0 — no <requirement type="package">. Galaxy instances without Docker support can't install this tool. Likely unavoidable given the GLM2 + torch + FAISS + MMseqs2 stack, but worth a one-liner in .shed.yml / help so admins know what to expect.

  4. search_database.xml help mentions intermediate files that aren't exposed as outputs (lines 94–99: query_results.tsv, unique_centroids.fasta, matches.top_hit, …). Either expose the useful ones as <data> outputs, or trim the help so users aren't told about files they can't access.

  5. Suite naming inconsistency — directory is logan_proteins (plural), suite is suite_logan_protein (singular), tool IDs use logan_protein_* (singular). Tool IDs are stable so don't change them; just flagging for new files.

Happy to approve once #1 and #2 are addressed (or with a quick justification that #1 isn't actually a problem). Everything else can be its own PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants