Skip to content

Rag tool#1818

Draft
LucaBEMan wants to merge 13 commits intobgruening:masterfrom
LucaBEMan:rag-tool
Draft

Rag tool#1818
LucaBEMan wants to merge 13 commits intobgruening:masterfrom
LucaBEMan:rag-tool

Conversation

@LucaBEMan
Copy link

@LucaBEMan LucaBEMan commented Mar 19, 2026

FOR CONTRIBUTOR:

  • I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • License permits unrestricted use (educational + commercial)
  • This PR adds a new tool or tool collection
  • This PR updates an existing tool or tool collection
  • This PR does something else (explain below)

Description

This PR adds a new Galaxy tool: RAG Retriever.

The tool performs document retrieval for Retrieval-Augmented Generation (RAG) workflows using LlamaIndex and HuggingFace sentence-transformers embeddings. It extracts the most relevant text chunks from input documents based on semantic similarity to a user query.

The output is a context file that can be used as input for downstream LLM tools.


Example Workflow

This tool is intended to be used together with the LLM Hub tool.

Typical workflow:

  1. Upload one or more documents (PDF, JSON, TXT, CSV, Markdown, etc.)
  2. Upload an embedding model archive (.tar.gz / .tgz) containing a HuggingFace sentence-transformers model
  3. Run RAG Retriever to extract relevant context chunks
  4. Pass the generated rag_context.txt file to the LLM Hub tool
  5. Ask a question using the retrieved context

Pipeline overview:

Documents + Embedding Model Archive + Question → RAG Retriever → Context → LLM Hub → Answer

Example Galaxy workflow combining RAG Retriever and LLM Hub:

Galaxy workflow

Notes

  • The tool does not generate answers itself, but prepares high-quality context for LLMs.
  • Supports multiple input formats including PDF, JSON, TXT, CSV, and Markdown.
  • Accepts a HuggingFace sentence-transformers embedding model as a .tar.gz / .tgz archive input.
  • The archive-based model input preserves the original model directory structure required for loading custom embedding models.

@anuprulez
Copy link
Contributor

planemo lint on the tool's XML throws following error:

.. ERROR (XSD): Invalid XML: Element 'param', attribute 'separator': The attribute 'separator' is not allowed.
.. CHECK (TestsNoValid): 1 test(s) found.
.. INFO (OutputsNumber): 1 outputs found.
.. INFO (InputsNum): Found 4 input parameters.
.. WARNING (HelpInvalidRST): Invalid reStructuredText found in help - [<string>:59: (WARNING/2) Title underline too short.

How it works
.....
].
.. CHECK (HelpPresent): Tool contains help section.
.. CHECK (ToolIDValid): Tool defines an id [rag_retriever].
.. CHECK (ToolNameValid): Tool defines a name [RAG Retriever].
.. CHECK (ToolProfileValid): Tool specifies profile version [24.2].
.. CHECK (ToolVersionValid): Tool defines a version [1.0.0].
.. ERROR (ValidDatatypes): Unknown datatype [jsonl] used in param element
.. ERROR (ValidDatatypes): Unknown datatype [md] used in param element
.. INFO (CommandInfo): Tool contains a command.
.. WARNING (CitationsMissing): No citations found, consider adding citations to your tool.
Failed linting


from sentence_transformers import SentenceTransformer

models = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we take these models as inputs to the tool? Otherwise we are stuck with these models and if users want, they cannot use their favorite embedding models. Now its really easy to accept these models. Entire hugging face in available inside Galaxy which can be directly imported from file uploader to any history and then to the tool. Maybe in future, a better embedding model becomes available.

Image Image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure if I fully understand your suggestion.
Do you mean that the tool should accept any Hugging Face embedding model as a dataset input (e.g., uploaded or provided within a Galaxy history)?
If so, is there already an existing Galaxy tool that can download a Hugging Face model and make it available as a dataset, or would this require adding a separate tool for that purpose?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was wrong with the huggingface loc file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants