feat: implement hybrid retrieval (semantic search + BM25 + metadata filtering)#14
feat: implement hybrid retrieval (semantic search + BM25 + metadata filtering)#14GovindhKishore wants to merge 8 commits intoga4gh:mainfrom
Conversation
|
Hi @GovindhKishore, I noticed a few things I'd like to point out (this is purely from a learning perspective because I am trying to understand all the different approaches and what would work best for this project). So, looking closely at the implementation for a regulatory compliance use case, there are a few architectural vulnerabilities in how the retrieval and fusion are handled here: Score Normalization in EnsembleRetriever: Using a 50/50 weight in LangChain's EnsembleRetriever without normalizing the underlying scores is mathematically risky for this domain. BM25 scores are unbounded (scaling with term frequency), whereas ChromaDB’s cosine similarity is bounded between -1 and 1. Henceforth, in practice, this means BM25 will frequently overpower the semantic scores. Upstream Dependency on Arbitrary Chunking: While this PR handles the retrieval logic, its effectiveness is severely bottlenecked by the 500-character chunking implemented in your previous PR. BM25 relies heavily on exact keyword matching within specific contextual boundaries. Because 500-character splits arbitrarily slice through legal sub-clauses, the BM25 retriever will miss exact matches that span across those hard token boundaries. Given that RegBot requires zero-hallucination citation grounding, you might need to rethink the chunking strategy upstream and implement a more mathematically sound fusion algorithm here to guarantee retrieval quality. @dedyli I'd love to get your thoughts on these architectural choices, especially regarding the strict citation constraints of the project (and suggestions/tips that I could keep in mind too) |
|
Hi @ReemHamraz, thanks for the detailed feedback, genuinely appreciate it. On score normalization: LangChain's EnsembleRetriever actually uses Reciprocal Rank Fusion internally, which works on rank positions rather than raw scores. So BM25's unbounded scores do not directly compete against cosine similarity values the way you described. That said, your point about empirically validating the weights is fair and something worth testing. On chunking boundaries: It is a valid concern. Hard splits can break legal sub-clauses across chunks and hurt BM25 exact matching. The 500-character chunk size was intentionally chosen to balance two things - keeping chunks small enough for precise retrieval while keeping them large enough to provide sufficient context to the LLM when generating compliance verdicts. Too small and the LLM loses context, too large and retrieval precision drops. The overlap helps reduce boundary breakage but does not fully eliminate it. Clause-aware splitting would be a stronger long term approach for regulatory text specifically. I will wait for @dedyli 's thoughts before making any architectural changes. Happy to work together on improvements once there is direction from the maintainer. |
|
Thanks for clarifying that!! Thanks! |
|
@ReemHamraz Glad it was helpful! Looking forward to hearing @dedyli's thoughts |
Closes #13
What This PR Does
Implements the retrieval layer for GA4GH RegBot. Given a compliance check query string, the retriever fetches the most relevant GA4GH policy chunks from ChromaDB using a hybrid search approach that combines semantic search and BM25 keyword search.
Why Hybrid Search
Pure semantic search is insufficient for legal compliance use cases. Exact legal terms like "opt-out", "withdrawal", "de-identification" need precise keyword matching that vector search alone misses. BM25 handles this. Combined they cover both meaning and exact terms.
How It Works
CHECK_QUERY string (from config.py) -> ChromaDB semantic retriever (cosine similarity) + BM25 retriever (exact keyword matching) -> EnsembleRetriever merges + deduplicates (50/50 weight) -> optional category/subcategory filter applied -> top k chunks returned with full metadata.
Category and Subcategory Filtering
Retrieval scopes to relevant documents per check type:
This prevents irrelevant toolkit documents from polluting results for checks that only apply to specific study types.
Files Added
Tests - All 7 Passing
How to Run Tests
pytest tests/test_retriever.py -v
Next Step
Compliance checker - feeds retrieved chunks + uploaded consent form text into LLM prompt and generates verdict with citations.
AI Use Transparency
Claude was used as a reference tool to clarify concepts and validate thinking. All architectural decisions and code review were done by me.