Skip to content

Build document ingestion pipeline for GA4GH policy documents #11

@ReemHamraz

Description

@ReemHamraz

Summary

This aims to build the foundational document ingestion pipeline that loads GA4GH policy documents (PDF/text), splits them into chunks, generates embeddings, and stores them in a vector database for later retrieval.
This is technically Phase 1 of the GA4GH-RegBot roadmap and is a prerequisite for the RAG-based compliance analysis pipeline (Phase 2).

Requirements

  • Load PDF and plain-text documents with source metadata (filename, page number)
  • Split documents into overlapping text chunks (configurable size/overlap)
  • Generate embeddings using a local model (no API key required)
  • Store embeddings in ChromaDB with metadata for citation support
  • Provide a retrieval interface that returns relevant chunks with source citations
  • CLI commands for ingesting documents and querying the vector store
  • Unit tests covering loader, chunker, and retriever modules
  • Sample GA4GH policy excerpt for development and testing

Technical Decisions

  • Embedding model: all-MiniLM-L6-v2 via sentence-transformers (local, free, no API key)
  • Vector store: ChromaDB with persistent storage
  • Text splitting: LangChain RecursiveCharacterTextSplitter (500 char chunks, 50 char overlap)

Acceptance Criteria

  1. Running python -m src.main ingest data/ga4gh_framework_excerpt.txt successfully ingests the sample document
  2. Running python -m src.main query "consent requirements" returns relevant clauses with source citations
  3. All unit tests pass (python -m pytest tests/ -v)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions