Build document ingestion pipeline for GA4GH policy documents

## Summary
This aims to build the foundational document ingestion pipeline that loads GA4GH policy documents (PDF/text), splits them into chunks, generates embeddings, and stores them in a vector database for later retrieval.
This is technically  **Phase 1** of the GA4GH-RegBot roadmap and is a prerequisite for the RAG-based compliance analysis pipeline (Phase 2).

## Requirements
- [ ] Load PDF and plain-text documents with source metadata (filename, page number)
- [ ] Split documents into overlapping text chunks (configurable size/overlap)
- [ ] Generate embeddings using a local model (no API key required)
- [ ] Store embeddings in ChromaDB with metadata for citation support
- [ ] Provide a retrieval interface that returns relevant chunks with source citations
- [ ] CLI commands for ingesting documents and querying the vector store
- [ ] Unit tests covering loader, chunker, and retriever modules
- [ ] Sample GA4GH policy excerpt for development and testing
## Technical Decisions
- **Embedding model:** `all-MiniLM-L6-v2` via sentence-transformers (local, free, no API key)
- **Vector store:** ChromaDB with persistent storage
- **Text splitting:** LangChain `RecursiveCharacterTextSplitter` (500 char chunks, 50 char overlap)
## Acceptance Criteria
1. Running `python -m src.main ingest data/ga4gh_framework_excerpt.txt` successfully ingests the sample document
2. Running `python -m src.main query "consent requirements"` returns relevant clauses with source citations
3. All unit tests pass (`python -m pytest tests/ -v`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build document ingestion pipeline for GA4GH policy documents #11

Summary

Requirements

Technical Decisions

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Build document ingestion pipeline for GA4GH policy documents #11

Description

Summary

Requirements

Technical Decisions

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions