feat: implement complete document ingestion pipeline - load, chunk, embed, vectorstore by GovindhKishore · Pull Request #10 · ga4gh/GA4GH-RegBot

GovindhKishore · 2026-03-05T23:00:53Z

What This PR Does

Implements the complete document ingestion pipeline for GA4GH RegBotfrom scratch. A researcher or contributor can now clone the repo, run python ingest.py, and have a fully populated ChromaDB vector store built from all 10 (I used 10 documents for giving the proper consent information) official GA4GH policy documents in a single command.

Pipeline Flow

GA4GH PDFs -> pdf_loader.py -> embedder.py -> vector_store.py -> chroma_db/

Step 1: pdf_loader.py reads each PDF page by page, splits into 500 character chunks, and attaches metadata to every chunk including source name, category, subcategory, section heading, and page number.

Step 2: embedder.py initialises the all-MiniLM-L6-v2 model locally. No API key required. Runs entirely on CPU.

Step 3: vector_store.py takes the chunks and embedder, converts every chunk into a 384-dimension vector, and persists everything to chroma_db/.

Documents Ingested

The following 10 GA4GH policy documents were added to data/frameworks/ and successfully embedded during testing:

Framework for Responsible Sharing of Genomic Data
GA4GH Consent Policy (POL 002 v2.0)
GA4GH Data Privacy and Security Policy (POL 001 v2.0)
Machine Readable Consent Guidance (MRCG)
Consent Clauses for Genomic Research (2020)
Consent Toolkit: Clinical Genomic (D015 v6.0)
Consent Toolkit: Rare Disease
Familial Consent Clauses (D011 v1.0)
Consent Clauses for Large Scale Initiatives (D014 v1.0)
Pediatric Consent to Genetic Research (D012a v1.0)

All 10 documents were successfully loaded, chunked, and embedded. Total vectors stored in chroma_db/: 637

Metadata Schema Per Chunk

Every chunk stored in ChromaDB carries:

source : human readable document name for citations
filename : raw filename
category : document role (consent_requirements, duo_mapping, etc.)
subcategory : study context for toolkit docs (pediatric, familial, etc.)
section : detected section heading for citation grounding
page : page number set by PyPDFLoader

This metadata is what allows RegBot to produce citations like:
"Non-compliant: required by GA4GH Consent Policy (POL 002 v2.0), Section II Transparency" instead of returning anonymous text.

Files Added

config.py : single source of truth for all constants
src/ingestion/pdf_loader.py : load PDFs, chunk, attach metadata
src/embeddings/embedder.py : initialise HuggingFace embedding model
src/ingestion/vector_store.py : build and persist ChromaDB
ingest.py : CLI orchestrator, chains all three above
requirements.txt : updated with pinned dependency versions
.gitignore : excludes chroma_db/, .venv/, pycache, data/frameworks/
README.md : project overview, setup instructions, compliance check catalogue

Dependency Conflicts Faced and Resolved

During local testing the ingestion pipeline failed with:

ImportError: cannot import name 'cached_download' from huggingface_hub

Root cause: sentence-transformers==2.2.2 internally calls from huggingface_hub import cached_download which was removed in huggingface-hub >= 0.20. The originally listed requirements pulled in incompatible versions of four packages.

The following versions were pinned to resolve all conflicts:

huggingface-hub 0.36.2 -> 0.20.3
reason: cached_download still exists in 0.20.3
transformers 4.57.6 -> 4.36.2
reason: compatible with huggingface-hub 0.20.3
numpy 2.4.2 -> 1.26.4
reason: langchain requires numpy < 2
packaging 26.0 -> 23.2
reason: langchain-core requires packaging < 24

All versions are now pinned in requirements.txt. Anyone running pip install -r requirements.txt will get a working environment with no dependency conflicts.

Verified with:
python -c "from sentence_transformers import SentenceTransformer; print('ok')"
output: ok

How to Run

pip install -r requirements.txt
python ingest.py

Result

Found 10 PDF(s) in data/frameworks/
Total chunks ready for embedding: 637
Loading embedding model: all-MiniLM-L6-v2
ChromaDB built at: chroma_db/
Total vectors stored: 637

Next Step

Hybrid retrieval pipeline: semantic search (ChromaDB) + BM25 keyword search combined via LangChain EnsembleRetriever.

AI Use Transparency

Claude (Anthropic) was used as a coding assistant to accelerate implementation. All architectural decisions and code review were done by the me.

Closes #9

…README

GovindhKishore · 2026-03-05T23:25:36Z

Hi @dedyli , just submitted this PR and happy to make any changes based on your feedback. Please let me know if anything looks off or needs improvement.

GovindhKishore · 2026-03-06T07:19:03Z

Hi @dedyli , note that the ingestion pipeline is currently implemented as a standalone ingest.py at the root rather than inside the existing main.py scaffold. Happy to move it into main.py or restructure based on your feedback

GovindhKishore added 6 commits March 5, 2026 20:57

feat: add project structure, config.py and data folders

501bb11

feat: add ingestion pipeline — load, chunk, embed, vectorstore

86d37dc

updated requirements.txt

2f1eb76

chore: remove .idea from tracking, update .gitignore

1245439

chore: remove GA4GH PDFs from tracking, add download instructions to …

80ddbde

…README

removed redundant comment

abb5c84

This was referenced Mar 8, 2026

feat: hybrid retrieval engine - semantic search + BM25 + category filtering #13

Open

feat: compliance checker - study type detection, single LLM call, citation grounding #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement complete document ingestion pipeline - load, chunk, embed, vectorstore#10

feat: implement complete document ingestion pipeline - load, chunk, embed, vectorstore#10
GovindhKishore wants to merge 6 commits intoga4gh:mainfrom
GovindhKishore:feat/ingest-embed-vectorstore

GovindhKishore commented Mar 5, 2026

Uh oh!

GovindhKishore commented Mar 5, 2026

Uh oh!

GovindhKishore commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

GovindhKishore commented Mar 5, 2026

What This PR Does

Pipeline Flow

Documents Ingested

Metadata Schema Per Chunk

Files Added

Dependency Conflicts Faced and Resolved

How to Run

Result

Next Step

AI Use Transparency

Uh oh!

GovindhKishore commented Mar 5, 2026

Uh oh!

GovindhKishore commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant