Overview RegBot is a Global Alliance for Genomics and Health Regulatory and Ethics Work Stream (REWS) open-source tool in cross-border genomic data sharing. It complements the Alliance’s Regulatory & Ethics Toolkit by retrieving GA4GH and related policy provisions against researcher-supplied consent / data-use text and returning citation-grounded JSON for DPO, IRB, and DAC review—not compliance rulings or legal advice.
Documentation
docs/DESIGN.md— architecture, data model, evaluation plan (GSoC design doc)docs/corpus_manifest.yaml— planned regulatory corpus inventory (placeholder; documents added as they are mentor-approved)examples/DEMO.md— local end-to-end demo
What works today
- Ingest policy PDFs or
.txtfiles into a local Chroma store plus a JSON manifest (chunk ids, page hints, source metadata). - Hybrid retrieval: embedding search + BM25, merged with reciprocal rank fusion.
- Compliance pass: JSON-mode LLM via Ollama by default (e.g.
llama3, configurable withREGBOT_OLLAMA_MODEL). SetREGBOT_LLM_PROVIDER=openaiandOPENAI_API_KEYto use OpenAI instead. If no LLM is reachable (or on API failure), a keyword heuristic fallback still returns grounded chunk ids. - Web UI (recommended): FastAPI + Next.js in
frontend/— see Run the web UI below. - Streamlit UI (legacy): upload + paste flows (
src/streamlit_app.py). - CLI:
python -m src.main …(see below). - Citation grounding (programmatic): Each
recommendations[]item must be{ "text": "...", "evidence_chunk_ids": ["..."] }with ids taken only from retrieved chunks; optionalcitations[]must also respect the same allow-list. Failed checks trigger automatic rewrite requests with the allow-list; optional token-overlap filtering on the LLM path (REGBOT_MIN_TOKEN_OVERLAP). - PDF eval harness:
evalsubcommand ingests a real GA4GH PDF and prints retrieval hits for built-in or custom queries (for manual review / building a gold set later).
Quickstart (Development)
- Prerequisites: Python 3.10–3.12 (CI uses 3.11). Python 3.14 is not supported yet for the full stack (native wheels for parts of the ML/Chroma toolchain often lag).
- Create a virtual environment and install dependencies:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt-
Configure environment variables:
- Export variables in your shell (recommended)
- If you use a local
.env, keep it private and do not commit it
-
LLM (default: local Ollama)
Install Ollama, runollama pull llama3(or another tag you set inREGBOT_OLLAMA_MODEL), and keep the daemon running (ollama serveorbrew services start ollamaon macOS). NoOPENAI_API_KEYis required for this path. -
Embeddings (first ingest)
The embedding model is downloaded from Hugging Face on first use. If downloads are slow or fail, try a longer timeout (HF_HUB_DOWNLOAD_TIMEOUT, seconds) or a mirror (REGBOT_HF_ENDPOINT=https://hf-mirror.com— setsHF_ENDPOINTfor the Hub client). -
Ingest a policy file into
./data/regbot_store(use--resetwhen reloading the same corpus):
python -m src.main ingest --path path/to/policy.pdf --reset- Batch ingest from the corpus inventory (downloads go under
data/corpus/; seedocs/corpus_manifest.yaml):
python -m src.main ingest-manifest --dry-run
python -m src.main ingest-manifest --reset
python -m src.main ingest-manifest --tier P0 --reset- Check a consent / data-use text file:
python -m src.main check --consent path/to/consent.txt- Run the web UI (FastAPI + Next.js) from the repo root:
# Terminal 1 — API (repo root, venv active)
uvicorn src.api.app:app --reload --port 8000
# Terminal 2 — frontend
cd frontend && npm install && npm run devOpen http://localhost:3000. The Next.js dev server proxies /api/* to the API on port 8000.
- Run the legacy Streamlit UI:
python -m streamlit run src/streamlit_app.py- End-to-end sample (synthetic policy + consent under
examples/):
python examples/run_demo.pyEvaluate retrieval on a real GA4GH PDF (use --reset when reloading the same corpus):
python -m src.main eval --pdf path/to/ga4gh_policy.pdf --reset --top-k 8Use your own query list (one line per query):
python -m src.main eval --pdf path/to/ga4gh_policy.pdf --reset --queries-file examples/eval/queries_ga4gh.txtOptionally append a full compliance JSON report for a consent file:
python -m src.main eval --pdf path/to/ga4gh_policy.pdf --reset --consent path/to/consent.txtRun tests
python -m unittest discover -s tests -p "test*.py" -vEnvironment Variables
REGBOT_LLM_PROVIDER:ollama(default) — local LLM via Ollama’s OpenAI-compatible HTTP API (no OpenAI key). Set toopenaito use OpenAI’s hosted API instead.OPENAI_API_KEY: Required only whenREGBOT_LLM_PROVIDER=openai. Model:REGBOT_LLM_MODEL(defaultgpt-4o-mini).REGBOT_OLLAMA_MODEL: Tag known to Ollama (defaultllama3). Examples:llama3,mistral,mistral:latest.REGBOT_OLLAMA_BASE_URL: Ollama HTTP host only (defaulthttp://127.0.0.1:11434);/v1is appended automatically for the OpenAI-compatible routes.REGBOT_OLLAMA_API_KEY: Sent as the Bearer/API key to Ollama’s shim (defaultollama; ignored by Ollama).REGBOT_STORE: On-disk store directory (default./data/regbot_store).REGBOT_EMBEDDING_MODEL: SentenceTransformers model id (defaultsentence-transformers/all-MiniLM-L6-v2).HF_HUB_DOWNLOAD_TIMEOUT: Hugging Face Hub download timeout in seconds (embedding model on first use). The app sets a higher default when unset; increase if you see read timeouts.REGBOT_HF_ENDPOINT: If set, copied toHF_ENDPOINT(e.g.https://hf-mirror.comwhere Hub mirrors are used).REGBOT_MIN_TOKEN_OVERLAP: On the LLM path, minimum token recall between each recommendation and cited chunk texts (default0.06). Set to0to disable dropping low-overlap rows.REGBOT_CHROMA_ANONYMIZED_TELEMETRY: Set to1to enable Chroma client telemetry; default is off (0).REGBOT_OPENAI_MAX_RETRIES: Retries for the OpenAI Python client (used for both OpenAI API and Ollama’s compatible endpoint; default3).
Architecture (implemented vs planned)
- Core: Python 3, package under
src/regbot/(ingest, hybrid retrieval, compliance, optional local embedding download helpers). - Embeddings:
sentence-transformers+ Hugging Face Hub (minimal file set; ONNX-heavy artifacts skipped where possible). - Vector store: Chroma persistent files under
REGBOT_STORE/chromaplusmanifest.jsonfor BM25 text. - Retrieval: cosine similarity in Chroma +
rank-bm25, fused via reciprocal rank fusion; optional metadata category filter. - LLM: Default: Ollama (
llama3orREGBOT_OLLAMA_MODEL) via OpenAI-compatible chat completions + JSON parsing. Optional:REGBOT_LLM_PROVIDER=openaiwithOPENAI_API_KEY. Fallback: keyword heuristic if OpenAI is selected without a key, or after LLM errors (e.g. Ollama not running). - UI: Streamlit (
src/streamlit_app.py). - Optional / roadmap: LangChain or LlamaIndex adapters on top of the same stores (not required by the current code); richer offline evaluation (Ragas, human labels); structured per-recommendation evidence (e.g. quotes).