Skip to content

feat: implement complete document ingestion pipeline - load, chunk, embed, vectorstore#10

Open
GovindhKishore wants to merge 6 commits intoga4gh:mainfrom
GovindhKishore:feat/ingest-embed-vectorstore
Open

feat: implement complete document ingestion pipeline - load, chunk, embed, vectorstore#10
GovindhKishore wants to merge 6 commits intoga4gh:mainfrom
GovindhKishore:feat/ingest-embed-vectorstore

Conversation

@GovindhKishore
Copy link
Copy Markdown

What This PR Does

Implements the complete document ingestion pipeline for GA4GH RegBotfrom scratch. A researcher or contributor can now clone the repo, run python ingest.py, and have a fully populated ChromaDB vector store built from all 10 (I used 10 documents for giving the proper consent information) official GA4GH policy documents in a single command.

Pipeline Flow

GA4GH PDFs -> pdf_loader.py -> embedder.py -> vector_store.py -> chroma_db/

Step 1: pdf_loader.py reads each PDF page by page, splits into 500 character chunks, and attaches metadata to every chunk including source name, category, subcategory, section heading, and page number.

Step 2: embedder.py initialises the all-MiniLM-L6-v2 model locally. No API key required. Runs entirely on CPU.

Step 3: vector_store.py takes the chunks and embedder, converts every chunk into a 384-dimension vector, and persists everything to chroma_db/.

Documents Ingested

The following 10 GA4GH policy documents were added to data/frameworks/ and successfully embedded during testing:

  • Framework for Responsible Sharing of Genomic Data
  • GA4GH Consent Policy (POL 002 v2.0)
  • GA4GH Data Privacy and Security Policy (POL 001 v2.0)
  • Machine Readable Consent Guidance (MRCG)
  • Consent Clauses for Genomic Research (2020)
  • Consent Toolkit: Clinical Genomic (D015 v6.0)
  • Consent Toolkit: Rare Disease
  • Familial Consent Clauses (D011 v1.0)
  • Consent Clauses for Large Scale Initiatives (D014 v1.0)
  • Pediatric Consent to Genetic Research (D012a v1.0)

All 10 documents were successfully loaded, chunked, and embedded. Total vectors stored in chroma_db/: 637

Metadata Schema Per Chunk

Every chunk stored in ChromaDB carries:

source : human readable document name for citations
filename : raw filename
category : document role (consent_requirements, duo_mapping, etc.)
subcategory : study context for toolkit docs (pediatric, familial, etc.)
section : detected section heading for citation grounding
page : page number set by PyPDFLoader

This metadata is what allows RegBot to produce citations like:
"Non-compliant: required by GA4GH Consent Policy (POL 002 v2.0), Section II Transparency" instead of returning anonymous text.

Files Added

  • config.py : single source of truth for all constants
  • src/ingestion/pdf_loader.py : load PDFs, chunk, attach metadata
  • src/embeddings/embedder.py : initialise HuggingFace embedding model
  • src/ingestion/vector_store.py : build and persist ChromaDB
  • ingest.py : CLI orchestrator, chains all three above
  • requirements.txt : updated with pinned dependency versions
  • .gitignore : excludes chroma_db/, .venv/, pycache, data/frameworks/
  • README.md : project overview, setup instructions, compliance check catalogue

Dependency Conflicts Faced and Resolved

During local testing the ingestion pipeline failed with:

ImportError: cannot import name 'cached_download' from huggingface_hub

Root cause: sentence-transformers==2.2.2 internally calls from huggingface_hub import cached_download which was removed in huggingface-hub >= 0.20. The originally listed requirements pulled in incompatible versions of four packages.

The following versions were pinned to resolve all conflicts:

  • huggingface-hub 0.36.2 -> 0.20.3
    reason: cached_download still exists in 0.20.3

  • transformers 4.57.6 -> 4.36.2
    reason: compatible with huggingface-hub 0.20.3

  • numpy 2.4.2 -> 1.26.4
    reason: langchain requires numpy < 2

  • packaging 26.0 -> 23.2
    reason: langchain-core requires packaging < 24

All versions are now pinned in requirements.txt. Anyone running pip install -r requirements.txt will get a working environment with no dependency conflicts.

Verified with:
python -c "from sentence_transformers import SentenceTransformer; print('ok')"
output: ok

How to Run

pip install -r requirements.txt
python ingest.py

Result

Found 10 PDF(s) in data/frameworks/
Total chunks ready for embedding: 637
Loading embedding model: all-MiniLM-L6-v2
ChromaDB built at: chroma_db/
Total vectors stored: 637

Next Step

Hybrid retrieval pipeline: semantic search (ChromaDB) + BM25 keyword search combined via LangChain EnsembleRetriever.

AI Use Transparency

Claude (Anthropic) was used as a coding assistant to accelerate implementation. All architectural decisions and code review were done by the me.

Closes #9

@GovindhKishore
Copy link
Copy Markdown
Author

Hi @dedyli , just submitted this PR and happy to make any changes based on your feedback. Please let me know if anything looks off or needs improvement.

@GovindhKishore
Copy link
Copy Markdown
Author

Hi @dedyli , note that the ingestion pipeline is currently implemented as a standalone ingest.py at the root rather than inside the existing main.py scaffold. Happy to move it into main.py or restructure based on your feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Implement document ingestion pipeline - load, chunk, embed, vectorstore

1 participant