feat: implement hybrid retrieval (semantic search + BM25 + metadata filtering) by GovindhKishore · Pull Request #14 · ga4gh/GA4GH-RegBot

GovindhKishore · 2026-03-08T11:44:51Z

Closes #13

What This PR Does

Implements the retrieval layer for GA4GH RegBot. Given a compliance check query string, the retriever fetches the most relevant GA4GH policy chunks from ChromaDB using a hybrid search approach that combines semantic search and BM25 keyword search.

Why Hybrid Search

Pure semantic search is insufficient for legal compliance use cases. Exact legal terms like "opt-out", "withdrawal", "de-identification" need precise keyword matching that vector search alone misses. BM25 handles this. Combined they cover both meaning and exact terms.

How It Works

CHECK_QUERY string (from config.py) -> ChromaDB semantic retriever (cosine similarity) + BM25 retriever (exact keyword matching) -> EnsembleRetriever merges + deduplicates (50/50 weight) -> optional category/subcategory filter applied -> top k chunks returned with full metadata.

Category and Subcategory Filtering

Retrieval scopes to relevant documents per check type:

Universal checks -> no filter, search all documents
Pediatric checks -> subcategory="pediatric" chunks only
Familial checks -> subcategory="familial" chunks only

This prevents irrelevant toolkit documents from polluting results for checks that only apply to specific study types.

Files Added

src/retrieval/retriever.py : hybrid retrieval engine
src/retrieval/init.py : module init
tests/test_retriever.py : 7 tests covering retrieval correctness
tests/init.py : module init

Tests - All 7 Passing

retriever loads without errors
returns results for real CHECK_QUERY strings from config.py
every chunk carries source, category, subcategory, section metadata
k parameter is respected
category filter returns only matching chunks
subcategory filter returns only matching chunks
BM25 catches exact legal terms: "withdraw", "opt-out"

How to Run Tests

pytest tests/test_retriever.py -v

Next Step

Compliance checker - feeds retrieved chunks + uploaded consent form text into LLM prompt and generates verdict with citations.

AI Use Transparency

Claude was used as a reference tool to clarify concepts and validate thinking. All architectural decisions and code review were done by me.

…README

…ltering + tests"

ReemHamraz · 2026-03-08T14:39:41Z

Hi @GovindhKishore, I noticed a few things I'd like to point out (this is purely from a learning perspective because I am trying to understand all the different approaches and what would work best for this project). So, looking closely at the implementation for a regulatory compliance use case, there are a few architectural vulnerabilities in how the retrieval and fusion are handled here:

Score Normalization in EnsembleRetriever: Using a 50/50 weight in LangChain's EnsembleRetriever without normalizing the underlying scores is mathematically risky for this domain. BM25 scores are unbounded (scaling with term frequency), whereas ChromaDB’s cosine similarity is bounded between -1 and 1. Henceforth, in practice, this means BM25 will frequently overpower the semantic scores.

Upstream Dependency on Arbitrary Chunking: While this PR handles the retrieval logic, its effectiveness is severely bottlenecked by the 500-character chunking implemented in your previous PR. BM25 relies heavily on exact keyword matching within specific contextual boundaries. Because 500-character splits arbitrarily slice through legal sub-clauses, the BM25 retriever will miss exact matches that span across those hard token boundaries.

Given that RegBot requires zero-hallucination citation grounding, you might need to rethink the chunking strategy upstream and implement a more mathematically sound fusion algorithm here to guarantee retrieval quality.

@dedyli I'd love to get your thoughts on these architectural choices, especially regarding the strict citation constraints of the project (and suggestions/tips that I could keep in mind too)

GovindhKishore · 2026-03-08T15:17:05Z

Hi @ReemHamraz, thanks for the detailed feedback, genuinely appreciate it.

On score normalization: LangChain's EnsembleRetriever actually uses Reciprocal Rank Fusion internally, which works on rank positions rather than raw scores. So BM25's unbounded scores do not directly compete against cosine similarity values the way you described. That said, your point about empirically validating the weights is fair and something worth testing.

On chunking boundaries: It is a valid concern. Hard splits can break legal sub-clauses across chunks and hurt BM25 exact matching. The 500-character chunk size was intentionally chosen to balance two things - keeping chunks small enough for precise retrieval while keeping them large enough to provide sufficient context to the LLM when generating compliance verdicts. Too small and the LLM loses context, too large and retrieval precision drops. The overlap helps reduce boundary breakage but does not fully eliminate it. Clause-aware splitting would be a stronger long term approach for regulatory text specifically.

I will wait for @dedyli 's thoughts before making any architectural changes. Happy to work together on improvements once there is direction from the maintainer.

ReemHamraz · 2026-03-08T15:24:00Z

Thanks for clarifying that!!
I totally missed that LangChain’s EnsembleRetriever uses Reciprocal Rank Fusion internally, so the raw score scale issue I mentioned would indeed not apply in that case. Appreciate the explanation!
Your point about validating the weights empirically makes sense as well.
I’ll wait to see what @dedyli thinks about the direction here. This has been really helpful for understanding the tradeoffs involved.

Thanks!

GovindhKishore · 2026-03-08T15:28:00Z

@ReemHamraz Glad it was helpful! Looking forward to hearing @dedyli's thoughts

GovindhKishore added 8 commits March 5, 2026 20:57

feat: add project structure, config.py and data folders

501bb11

feat: add ingestion pipeline — load, chunk, embed, vectorstore

86d37dc

updated requirements.txt

2f1eb76

chore: remove .idea from tracking, update .gitignore

1245439

chore: remove GA4GH PDFs from tracking, add download instructions to …

80ddbde

…README

removed redundant comment

abb5c84

"feat: hybrid retrieval engine — semantic search + BM25 + category fi…

ee3ee83

…ltering + tests"

added __init__.py to tests directory

d19a3ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement hybrid retrieval (semantic search + BM25 + metadata filtering)#14

feat: implement hybrid retrieval (semantic search + BM25 + metadata filtering)#14
GovindhKishore wants to merge 8 commits intoga4gh:mainfrom
GovindhKishore:feat/hybrid-retrieval

GovindhKishore commented Mar 8, 2026

Uh oh!

ReemHamraz commented Mar 8, 2026

Uh oh!

GovindhKishore commented Mar 8, 2026

Uh oh!

ReemHamraz commented Mar 8, 2026

Uh oh!

GovindhKishore commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

GovindhKishore commented Mar 8, 2026

What This PR Does

Why Hybrid Search

How It Works

Category and Subcategory Filtering

Files Added

Tests - All 7 Passing

How to Run Tests

Next Step

AI Use Transparency

Uh oh!

ReemHamraz commented Mar 8, 2026

Uh oh!

GovindhKishore commented Mar 8, 2026

Uh oh!

ReemHamraz commented Mar 8, 2026

Uh oh!

GovindhKishore commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants