Skip to content

ga4gh/GA4GH-RegBot

Repository files navigation

Google Summer of Code – Global Alliance for Genomics and Health

Overview RegBot is a Global Alliance for Genomics and Health Regulatory and Ethics Work Stream (REWS) open-source tool in cross-border genomic data sharing. It complements the Alliance’s Regulatory & Ethics Toolkit by retrieving GA4GH and related policy provisions against researcher-supplied consent / data-use text and returning citation-grounded JSON for DPO, IRB, and DAC review—not compliance rulings or legal advice.

Documentation

  • docs/DESIGN.md — architecture, data model, evaluation plan (GSoC design doc)
  • docs/corpus_manifest.yaml — planned regulatory corpus inventory (placeholder; documents added as they are mentor-approved)
  • examples/DEMO.md — local end-to-end demo

What works today

  • Ingest policy PDFs or .txt files into a local Chroma store plus a JSON manifest (chunk ids, page hints, source metadata).
  • Hybrid retrieval: embedding search + BM25, merged with reciprocal rank fusion.
  • Compliance pass: JSON-mode LLM via Ollama by default (e.g. llama3, configurable with REGBOT_OLLAMA_MODEL). Set REGBOT_LLM_PROVIDER=openai and OPENAI_API_KEY to use OpenAI instead. If no LLM is reachable (or on API failure), a keyword heuristic fallback still returns grounded chunk ids.
  • Web UI (recommended): FastAPI + Next.js in frontend/ — see Run the web UI below.
  • Streamlit UI (legacy): upload + paste flows (src/streamlit_app.py).
  • CLI: python -m src.main … (see below).
  • Citation grounding (programmatic): Each recommendations[] item must be { "text": "...", "evidence_chunk_ids": ["..."] } with ids taken only from retrieved chunks; optional citations[] must also respect the same allow-list. Failed checks trigger automatic rewrite requests with the allow-list; optional token-overlap filtering on the LLM path (REGBOT_MIN_TOKEN_OVERLAP).
  • PDF eval harness: eval subcommand ingests a real GA4GH PDF and prints retrieval hits for built-in or custom queries (for manual review / building a gold set later).

Quickstart (Development)

  • Prerequisites: Python 3.10–3.12 (CI uses 3.11). Python 3.14 is not supported yet for the full stack (native wheels for parts of the ML/Chroma toolchain often lag).
  • Create a virtual environment and install dependencies:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
  • Configure environment variables:

    • Export variables in your shell (recommended)
    • If you use a local .env, keep it private and do not commit it
  • LLM (default: local Ollama)
    Install Ollama, run ollama pull llama3 (or another tag you set in REGBOT_OLLAMA_MODEL), and keep the daemon running (ollama serve or brew services start ollama on macOS). No OPENAI_API_KEY is required for this path.

  • Embeddings (first ingest)
    The embedding model is downloaded from Hugging Face on first use. If downloads are slow or fail, try a longer timeout (HF_HUB_DOWNLOAD_TIMEOUT, seconds) or a mirror (REGBOT_HF_ENDPOINT=https://hf-mirror.com — sets HF_ENDPOINT for the Hub client).

  • Ingest a policy file into ./data/regbot_store (use --reset when reloading the same corpus):

python -m src.main ingest --path path/to/policy.pdf --reset
  • Batch ingest from the corpus inventory (downloads go under data/corpus/; see docs/corpus_manifest.yaml):
python -m src.main ingest-manifest --dry-run
python -m src.main ingest-manifest --reset
python -m src.main ingest-manifest --tier P0 --reset
  • Check a consent / data-use text file:
python -m src.main check --consent path/to/consent.txt
  • Run the web UI (FastAPI + Next.js) from the repo root:
# Terminal 1 — API (repo root, venv active)
uvicorn src.api.app:app --reload --port 8000

# Terminal 2 — frontend
cd frontend && npm install && npm run dev

Open http://localhost:3000. The Next.js dev server proxies /api/* to the API on port 8000.

  • Run the legacy Streamlit UI:
python -m streamlit run src/streamlit_app.py
  • End-to-end sample (synthetic policy + consent under examples/):
python examples/run_demo.py

Evaluate retrieval on a real GA4GH PDF (use --reset when reloading the same corpus):

python -m src.main eval --pdf path/to/ga4gh_policy.pdf --reset --top-k 8

Use your own query list (one line per query):

python -m src.main eval --pdf path/to/ga4gh_policy.pdf --reset --queries-file examples/eval/queries_ga4gh.txt

Optionally append a full compliance JSON report for a consent file:

python -m src.main eval --pdf path/to/ga4gh_policy.pdf --reset --consent path/to/consent.txt

Run tests

python -m unittest discover -s tests -p "test*.py" -v

Environment Variables

  • REGBOT_LLM_PROVIDER: ollama (default) — local LLM via Ollama’s OpenAI-compatible HTTP API (no OpenAI key). Set to openai to use OpenAI’s hosted API instead.
  • OPENAI_API_KEY: Required only when REGBOT_LLM_PROVIDER=openai. Model: REGBOT_LLM_MODEL (default gpt-4o-mini).
  • REGBOT_OLLAMA_MODEL: Tag known to Ollama (default llama3). Examples: llama3, mistral, mistral:latest.
  • REGBOT_OLLAMA_BASE_URL: Ollama HTTP host only (default http://127.0.0.1:11434); /v1 is appended automatically for the OpenAI-compatible routes.
  • REGBOT_OLLAMA_API_KEY: Sent as the Bearer/API key to Ollama’s shim (default ollama; ignored by Ollama).
  • REGBOT_STORE: On-disk store directory (default ./data/regbot_store).
  • REGBOT_EMBEDDING_MODEL: SentenceTransformers model id (default sentence-transformers/all-MiniLM-L6-v2).
  • HF_HUB_DOWNLOAD_TIMEOUT: Hugging Face Hub download timeout in seconds (embedding model on first use). The app sets a higher default when unset; increase if you see read timeouts.
  • REGBOT_HF_ENDPOINT: If set, copied to HF_ENDPOINT (e.g. https://hf-mirror.com where Hub mirrors are used).
  • REGBOT_MIN_TOKEN_OVERLAP: On the LLM path, minimum token recall between each recommendation and cited chunk texts (default 0.06). Set to 0 to disable dropping low-overlap rows.
  • REGBOT_CHROMA_ANONYMIZED_TELEMETRY: Set to 1 to enable Chroma client telemetry; default is off (0).
  • REGBOT_OPENAI_MAX_RETRIES: Retries for the OpenAI Python client (used for both OpenAI API and Ollama’s compatible endpoint; default 3).

Architecture (implemented vs planned)

  • Core: Python 3, package under src/regbot/ (ingest, hybrid retrieval, compliance, optional local embedding download helpers).
  • Embeddings: sentence-transformers + Hugging Face Hub (minimal file set; ONNX-heavy artifacts skipped where possible).
  • Vector store: Chroma persistent files under REGBOT_STORE/chroma plus manifest.json for BM25 text.
  • Retrieval: cosine similarity in Chroma + rank-bm25, fused via reciprocal rank fusion; optional metadata category filter.
  • LLM: Default: Ollama (llama3 or REGBOT_OLLAMA_MODEL) via OpenAI-compatible chat completions + JSON parsing. Optional: REGBOT_LLM_PROVIDER=openai with OPENAI_API_KEY. Fallback: keyword heuristic if OpenAI is selected without a key, or after LLM errors (e.g. Ollama not running).
  • UI: Streamlit (src/streamlit_app.py).
  • Optional / roadmap: LangChain or LlamaIndex adapters on top of the same stores (not required by the current code); richer offline evaluation (Ragas, human labels); structured per-recommendation evidence (e.g. quotes).

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors