feat: implement complete document ingestion pipeline - load, chunk, embed, vectorstore#10
Open
GovindhKishore wants to merge 6 commits intoga4gh:mainfrom
Open
Conversation
Author
|
Hi @dedyli , just submitted this PR and happy to make any changes based on your feedback. Please let me know if anything looks off or needs improvement. |
Author
|
Hi @dedyli , note that the ingestion pipeline is currently implemented as a standalone ingest.py at the root rather than inside the existing main.py scaffold. Happy to move it into main.py or restructure based on your feedback |
This was referenced Mar 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What This PR Does
Implements the complete document ingestion pipeline for GA4GH RegBotfrom scratch. A researcher or contributor can now clone the repo, run
python ingest.py, and have a fully populated ChromaDB vector store built from all 10 (I used 10 documents for giving the proper consent information) official GA4GH policy documents in a single command.Pipeline Flow
GA4GH PDFs -> pdf_loader.py -> embedder.py -> vector_store.py -> chroma_db/
Step 1: pdf_loader.py reads each PDF page by page, splits into 500 character chunks, and attaches metadata to every chunk including source name, category, subcategory, section heading, and page number.
Step 2: embedder.py initialises the all-MiniLM-L6-v2 model locally. No API key required. Runs entirely on CPU.
Step 3: vector_store.py takes the chunks and embedder, converts every chunk into a 384-dimension vector, and persists everything to chroma_db/.
Documents Ingested
The following 10 GA4GH policy documents were added to data/frameworks/ and successfully embedded during testing:
All 10 documents were successfully loaded, chunked, and embedded. Total vectors stored in chroma_db/: 637
Metadata Schema Per Chunk
Every chunk stored in ChromaDB carries:
source : human readable document name for citations
filename : raw filename
category : document role (consent_requirements, duo_mapping, etc.)
subcategory : study context for toolkit docs (pediatric, familial, etc.)
section : detected section heading for citation grounding
page : page number set by PyPDFLoader
This metadata is what allows RegBot to produce citations like:
"Non-compliant: required by GA4GH Consent Policy (POL 002 v2.0), Section II Transparency" instead of returning anonymous text.
Files Added
Dependency Conflicts Faced and Resolved
During local testing the ingestion pipeline failed with:
Root cause: sentence-transformers==2.2.2 internally calls
from huggingface_hub import cached_downloadwhich was removed in huggingface-hub >= 0.20. The originally listed requirements pulled in incompatible versions of four packages.The following versions were pinned to resolve all conflicts:
huggingface-hub 0.36.2 -> 0.20.3
reason: cached_download still exists in 0.20.3
transformers 4.57.6 -> 4.36.2
reason: compatible with huggingface-hub 0.20.3
numpy 2.4.2 -> 1.26.4
reason: langchain requires numpy < 2
packaging 26.0 -> 23.2
reason: langchain-core requires packaging < 24
All versions are now pinned in requirements.txt. Anyone running
pip install -r requirements.txtwill get a working environment with no dependency conflicts.Verified with:
python -c "from sentence_transformers import SentenceTransformer; print('ok')"
output: ok
How to Run
Result
Next Step
Hybrid retrieval pipeline: semantic search (ChromaDB) + BM25 keyword search combined via LangChain EnsembleRetriever.
AI Use Transparency
Claude (Anthropic) was used as a coding assistant to accelerate implementation. All architectural decisions and code review were done by the me.
Closes #9