Summary
This aims to build the foundational document ingestion pipeline that loads GA4GH policy documents (PDF/text), splits them into chunks, generates embeddings, and stores them in a vector database for later retrieval.
This is technically Phase 1 of the GA4GH-RegBot roadmap and is a prerequisite for the RAG-based compliance analysis pipeline (Phase 2).
Requirements
Technical Decisions
- Embedding model:
all-MiniLM-L6-v2 via sentence-transformers (local, free, no API key)
- Vector store: ChromaDB with persistent storage
- Text splitting: LangChain
RecursiveCharacterTextSplitter (500 char chunks, 50 char overlap)
Acceptance Criteria
- Running
python -m src.main ingest data/ga4gh_framework_excerpt.txt successfully ingests the sample document
- Running
python -m src.main query "consent requirements" returns relevant clauses with source citations
- All unit tests pass (
python -m pytest tests/ -v)
Summary
This aims to build the foundational document ingestion pipeline that loads GA4GH policy documents (PDF/text), splits them into chunks, generates embeddings, and stores them in a vector database for later retrieval.
This is technically Phase 1 of the GA4GH-RegBot roadmap and is a prerequisite for the RAG-based compliance analysis pipeline (Phase 2).
Requirements
Technical Decisions
all-MiniLM-L6-v2via sentence-transformers (local, free, no API key)RecursiveCharacterTextSplitter(500 char chunks, 50 char overlap)Acceptance Criteria
python -m src.main ingest data/ga4gh_framework_excerpt.txtsuccessfully ingests the sample documentpython -m src.main query "consent requirements"returns relevant clauses with source citationspython -m pytest tests/ -v)