Skip to content

feat: implement hybrid retrieval (semantic search + BM25 + metadata filtering)#14

Open
GovindhKishore wants to merge 8 commits intoga4gh:mainfrom
GovindhKishore:feat/hybrid-retrieval
Open

feat: implement hybrid retrieval (semantic search + BM25 + metadata filtering)#14
GovindhKishore wants to merge 8 commits intoga4gh:mainfrom
GovindhKishore:feat/hybrid-retrieval

Conversation

@GovindhKishore
Copy link
Copy Markdown

Closes #13

What This PR Does

Implements the retrieval layer for GA4GH RegBot. Given a compliance check query string, the retriever fetches the most relevant GA4GH policy chunks from ChromaDB using a hybrid search approach that combines semantic search and BM25 keyword search.

Why Hybrid Search

Pure semantic search is insufficient for legal compliance use cases. Exact legal terms like "opt-out", "withdrawal", "de-identification" need precise keyword matching that vector search alone misses. BM25 handles this. Combined they cover both meaning and exact terms.

How It Works

CHECK_QUERY string (from config.py) -> ChromaDB semantic retriever (cosine similarity) + BM25 retriever (exact keyword matching) -> EnsembleRetriever merges + deduplicates (50/50 weight) -> optional category/subcategory filter applied -> top k chunks returned with full metadata.

Category and Subcategory Filtering

Retrieval scopes to relevant documents per check type:

  • Universal checks -> no filter, search all documents
  • Pediatric checks -> subcategory="pediatric" chunks only
  • Familial checks -> subcategory="familial" chunks only

This prevents irrelevant toolkit documents from polluting results for checks that only apply to specific study types.

Files Added

  • src/retrieval/retriever.py : hybrid retrieval engine
  • src/retrieval/init.py : module init
  • tests/test_retriever.py : 7 tests covering retrieval correctness
  • tests/init.py : module init

Tests - All 7 Passing

  • retriever loads without errors
  • returns results for real CHECK_QUERY strings from config.py
  • every chunk carries source, category, subcategory, section metadata
  • k parameter is respected
  • category filter returns only matching chunks
  • subcategory filter returns only matching chunks
  • BM25 catches exact legal terms: "withdraw", "opt-out"

How to Run Tests

pytest tests/test_retriever.py -v

Next Step

Compliance checker - feeds retrieved chunks + uploaded consent form text into LLM prompt and generates verdict with citations.

AI Use Transparency

Claude was used as a reference tool to clarify concepts and validate thinking. All architectural decisions and code review were done by me.

@ReemHamraz
Copy link
Copy Markdown

Hi @GovindhKishore, I noticed a few things I'd like to point out (this is purely from a learning perspective because I am trying to understand all the different approaches and what would work best for this project). So, looking closely at the implementation for a regulatory compliance use case, there are a few architectural vulnerabilities in how the retrieval and fusion are handled here:

Score Normalization in EnsembleRetriever: Using a 50/50 weight in LangChain's EnsembleRetriever without normalizing the underlying scores is mathematically risky for this domain. BM25 scores are unbounded (scaling with term frequency), whereas ChromaDB’s cosine similarity is bounded between -1 and 1. Henceforth, in practice, this means BM25 will frequently overpower the semantic scores.

Upstream Dependency on Arbitrary Chunking: While this PR handles the retrieval logic, its effectiveness is severely bottlenecked by the 500-character chunking implemented in your previous PR. BM25 relies heavily on exact keyword matching within specific contextual boundaries. Because 500-character splits arbitrarily slice through legal sub-clauses, the BM25 retriever will miss exact matches that span across those hard token boundaries.

Given that RegBot requires zero-hallucination citation grounding, you might need to rethink the chunking strategy upstream and implement a more mathematically sound fusion algorithm here to guarantee retrieval quality.

@dedyli I'd love to get your thoughts on these architectural choices, especially regarding the strict citation constraints of the project (and suggestions/tips that I could keep in mind too)

@GovindhKishore
Copy link
Copy Markdown
Author

Hi @ReemHamraz, thanks for the detailed feedback, genuinely appreciate it.

On score normalization: LangChain's EnsembleRetriever actually uses Reciprocal Rank Fusion internally, which works on rank positions rather than raw scores. So BM25's unbounded scores do not directly compete against cosine similarity values the way you described. That said, your point about empirically validating the weights is fair and something worth testing.

On chunking boundaries: It is a valid concern. Hard splits can break legal sub-clauses across chunks and hurt BM25 exact matching. The 500-character chunk size was intentionally chosen to balance two things - keeping chunks small enough for precise retrieval while keeping them large enough to provide sufficient context to the LLM when generating compliance verdicts. Too small and the LLM loses context, too large and retrieval precision drops. The overlap helps reduce boundary breakage but does not fully eliminate it. Clause-aware splitting would be a stronger long term approach for regulatory text specifically.

I will wait for @dedyli 's thoughts before making any architectural changes. Happy to work together on improvements once there is direction from the maintainer.

@ReemHamraz
Copy link
Copy Markdown

Thanks for clarifying that!!
I totally missed that LangChain’s EnsembleRetriever uses Reciprocal Rank Fusion internally, so the raw score scale issue I mentioned would indeed not apply in that case. Appreciate the explanation!
Your point about validating the weights empirically makes sense as well.
I’ll wait to see what @dedyli thinks about the direction here. This has been really helpful for understanding the tradeoffs involved.

Thanks!

@GovindhKishore
Copy link
Copy Markdown
Author

@ReemHamraz Glad it was helpful! Looking forward to hearing @dedyli's thoughts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: hybrid retrieval engine - semantic search + BM25 + category filtering

2 participants