GitHub - ga4gh/GA4GH-RegBot

Google Summer of Code – Global Alliance for Genomics and Health

Overview RegBot is a Global Alliance for Genomics and Health Regulatory and Ethics Work Stream (REWS) open-source tool in cross-border genomic data sharing. It complements the Alliance’s Regulatory & Ethics Toolkit by retrieving GA4GH and related policy provisions against researcher-supplied consent / data-use text and returning citation-grounded JSON for DPO, IRB, and DAC review—not compliance rulings or legal advice.

Documentation

docs/DESIGN.md — architecture, data model, evaluation plan (GSoC design doc)
docs/corpus_manifest.yaml — planned regulatory corpus inventory (placeholder; documents added as they are mentor-approved)
examples/DEMO.md — local end-to-end demo

What works today

Ingest policy PDFs or .txt files into a local Chroma store plus a JSON manifest (chunk ids, page hints, source metadata).
Hybrid retrieval: embedding search + BM25, merged with reciprocal rank fusion.
Compliance pass: JSON-mode LLM via Ollama by default (e.g. llama3, configurable with REGBOT_OLLAMA_MODEL). Set REGBOT_LLM_PROVIDER=openai and OPENAI_API_KEY to use OpenAI instead. If no LLM is reachable (or on API failure), a keyword heuristic fallback still returns grounded chunk ids.
Web UI (recommended): FastAPI + Next.js in frontend/ — see Run the web UI below.
Streamlit UI (legacy): upload + paste flows (src/streamlit_app.py).
CLI: python -m src.main … (see below).
Citation grounding (programmatic): Each recommendations[] item must be { "text": "...", "evidence_chunk_ids": ["..."] } with ids taken only from retrieved chunks; optional citations[] must also respect the same allow-list. Failed checks trigger automatic rewrite requests with the allow-list; optional token-overlap filtering on the LLM path (REGBOT_MIN_TOKEN_OVERLAP).
PDF eval harness: eval subcommand ingests a real GA4GH PDF and prints retrieval hits for built-in or custom queries (for manual review / building a gold set later).

Quickstart (Development)

Prerequisites: Python 3.10–3.12 (CI uses 3.11). Python 3.14 is not supported yet for the full stack (native wheels for parts of the ML/Chroma toolchain often lag).
Create a virtual environment and install dependencies:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

Configure environment variables:
- Export variables in your shell (recommended)
- If you use a local .env, keep it private and do not commit it
LLM (default: local Ollama)
Install Ollama, run ollama pull llama3 (or another tag you set in REGBOT_OLLAMA_MODEL), and keep the daemon running (ollama serve or brew services start ollama on macOS). No OPENAI_API_KEY is required for this path.
Embeddings (first ingest)
The embedding model is downloaded from Hugging Face on first use. If downloads are slow or fail, try a longer timeout (HF_HUB_DOWNLOAD_TIMEOUT, seconds) or a mirror (REGBOT_HF_ENDPOINT=https://hf-mirror.com — sets HF_ENDPOINT for the Hub client).
Ingest a policy file into ./data/regbot_store (use --reset when reloading the same corpus):

python -m src.main ingest --path path/to/policy.pdf --reset

Batch ingest from the corpus inventory (downloads go under data/corpus/; see docs/corpus_manifest.yaml):

python -m src.main ingest-manifest --dry-run
python -m src.main ingest-manifest --reset
python -m src.main ingest-manifest --tier P0 --reset

Check a consent / data-use text file:

python -m src.main check --consent path/to/consent.txt

Run the web UI (FastAPI + Next.js) from the repo root:

# Terminal 1 — API (repo root, venv active)
uvicorn src.api.app:app --reload --port 8000

# Terminal 2 — frontend
cd frontend && npm install && npm run dev

Open http://localhost:3000. The Next.js dev server proxies /api/* to the API on port 8000.

Run the legacy Streamlit UI:

python -m streamlit run src/streamlit_app.py

End-to-end sample (synthetic policy + consent under examples/):

python examples/run_demo.py

Evaluate retrieval on a real GA4GH PDF (use --reset when reloading the same corpus):

python -m src.main eval --pdf path/to/ga4gh_policy.pdf --reset --top-k 8

Use your own query list (one line per query):

python -m src.main eval --pdf path/to/ga4gh_policy.pdf --reset --queries-file examples/eval/queries_ga4gh.txt

Optionally append a full compliance JSON report for a consent file:

python -m src.main eval --pdf path/to/ga4gh_policy.pdf --reset --consent path/to/consent.txt

Run tests

python -m unittest discover -s tests -p "test*.py" -v

Environment Variables

REGBOT_LLM_PROVIDER: ollama (default) — local LLM via Ollama’s OpenAI-compatible HTTP API (no OpenAI key). Set to openai to use OpenAI’s hosted API instead.
OPENAI_API_KEY: Required only when REGBOT_LLM_PROVIDER=openai. Model: REGBOT_LLM_MODEL (default gpt-4o-mini).
REGBOT_OLLAMA_MODEL: Tag known to Ollama (default llama3). Examples: llama3, mistral, mistral:latest.
REGBOT_OLLAMA_BASE_URL: Ollama HTTP host only (default http://127.0.0.1:11434); /v1 is appended automatically for the OpenAI-compatible routes.
REGBOT_OLLAMA_API_KEY: Sent as the Bearer/API key to Ollama’s shim (default ollama; ignored by Ollama).
REGBOT_STORE: On-disk store directory (default ./data/regbot_store).
REGBOT_EMBEDDING_MODEL: SentenceTransformers model id (default sentence-transformers/all-MiniLM-L6-v2).
HF_HUB_DOWNLOAD_TIMEOUT: Hugging Face Hub download timeout in seconds (embedding model on first use). The app sets a higher default when unset; increase if you see read timeouts.
REGBOT_HF_ENDPOINT: If set, copied to HF_ENDPOINT (e.g. https://hf-mirror.com where Hub mirrors are used).
REGBOT_MIN_TOKEN_OVERLAP: On the LLM path, minimum token recall between each recommendation and cited chunk texts (default 0.06). Set to 0 to disable dropping low-overlap rows.
REGBOT_CHROMA_ANONYMIZED_TELEMETRY: Set to 1 to enable Chroma client telemetry; default is off (0).
REGBOT_OPENAI_MAX_RETRIES: Retries for the OpenAI Python client (used for both OpenAI API and Ollama’s compatible endpoint; default 3).

Architecture (implemented vs planned)

Core: Python 3, package under src/regbot/ (ingest, hybrid retrieval, compliance, optional local embedding download helpers).
Embeddings: sentence-transformers + Hugging Face Hub (minimal file set; ONNX-heavy artifacts skipped where possible).
Vector store: Chroma persistent files under REGBOT_STORE/chroma plus manifest.json for BM25 text.
Retrieval: cosine similarity in Chroma + rank-bm25, fused via reciprocal rank fusion; optional metadata category filter.
LLM: Default: Ollama (llama3 or REGBOT_OLLAMA_MODEL) via OpenAI-compatible chat completions + JSON parsing. Optional: REGBOT_LLM_PROVIDER=openai with OPENAI_API_KEY. Fallback: keyword heuristic if OpenAI is selected without a key, or after LLM errors (e.g. Ollama not running).
UI: Streamlit (src/streamlit_app.py).
Optional / roadmap: LangChain or LlamaIndex adapters on top of the same stores (not required by the current code); richer offline evaluation (Ragas, human labels); structured per-recommendation evidence (e.g. quotes).

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
data		data
docs		docs
examples		examples
frontend		frontend
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Google Summer of Code – Global Alliance for Genomics and Health

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Google Summer of Code – Global Alliance for Genomics and Health

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages