ga4gh
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 2 additions & 0 deletions b/‎CONTRIBUTING.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 29 additions & 15 deletions b/‎README.md‎
Lines changed: 29 additions & 15 deletions
diff --git a/‎examples/DEMO.md‎
Lines changed: 9 additions & 3 deletions b/‎examples/DEMO.md‎
Lines changed: 9 additions & 3 deletions
diff --git a/‎examples/data/sample_consent_complex.txt‎
Lines changed: 33 additions & 0 deletions b/‎examples/data/sample_consent_complex.txt‎
Lines changed: 33 additions & 0 deletions
diff --git a/‎examples/run_demo.py‎
Lines changed: 5 additions & 1 deletion b/‎examples/run_demo.py‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎src/main.py‎
Lines changed: 7 additions & 2 deletions b/‎src/main.py‎
Lines changed: 7 additions & 2 deletions
diff --git a/‎src/regbot/__init__.py‎
Lines changed: 1 addition & 1 deletion b/‎src/regbot/__init__.py‎
Lines changed: 1 addition & 1 deletion
@@ -21,6 +21,8 @@ pre-commit install
 
 Run `pre-commit run --all-files` before pushing if you use the hook.
 
+Unit tests do **not** require Ollama or `OPENAI_API_KEY`; the pipeline test forces the offline path.
+
 ## Tests
 
 ```bash
 
@@ -1,16 +1,16 @@
 GA4GH-RegBot: Compliance Assistant
-Status: **MVP available** — ingest, hybrid retrieval, optional LLM compliance + programmatic citation checks, CLI, Streamlit, and a small PDF eval harness. Ongoing work: real-corpus evaluation, stricter schemas, and contributor tooling.
+Status: **MVP available** — ingest, hybrid retrieval, **local-first LLM** (Ollama / Llama 3 by default) or optional OpenAI, programmatic citation checks, CLI, Streamlit, and a small PDF eval harness. Ongoing work: real-corpus evaluation, stricter evidence objects, and contributor tooling.
 
 Overview
 RegBot is an LLM-powered tool designed to help researchers map their consent forms against GA4GH regulatory frameworks. It uses RAG (Retrieval-Augmented Generation) to flag compliance gaps automatically.
 
 What works today
 - **Ingest** policy PDFs or `.txt` files into a local **Chroma** store plus a JSON manifest (chunk ids, page hints, source metadata).
 - **Hybrid retrieval**: embedding search + **BM25**, merged with reciprocal rank fusion.
-- **Compliance pass**: one OpenAI JSON call when `OPENAI_API_KEY` is set; otherwise a small keyword gap heuristic that still returns chunk citations.
+- **Compliance pass**: JSON-mode LLM via **[Ollama](https://ollama.com) by default** (e.g. `llama3`, configurable with `REGBOT_OLLAMA_MODEL`). Set `REGBOT_LLM_PROVIDER=openai` and `OPENAI_API_KEY` to use OpenAI instead. If no LLM is reachable (or on API failure), a **keyword heuristic fallback** still returns grounded chunk ids.
 - **Streamlit UI** for upload + paste flows (`src/streamlit_app.py`).
 - **CLI**: `python -m src.main …` (see below).
-- **Citation grounding (programmatic):** Each `recommendations[]` item must be `{ "text": "...", "evidence_chunk_ids": ["..."] }` with ids taken **only** from retrieved chunks; optional `citations[]` must also respect the same allow-list. Failed checks trigger **one automatic rewrite request** with the allow-list.
+- **Citation grounding (programmatic):** Each `recommendations[]` item must be `{ "text": "...", "evidence_chunk_ids": ["..."] }` with ids taken **only** from retrieved chunks; optional `citations[]` must also respect the same allow-list. Failed checks trigger **automatic rewrite requests** with the allow-list; optional **token-overlap** filtering on the LLM path (`REGBOT_MIN_TOKEN_OVERLAP`).
 - **PDF eval harness:** `eval` subcommand ingests a real GA4GH PDF and prints retrieval hits for built-in or custom queries (for manual review / building a gold set later).
 
 Quickstart (Development)
@@ -28,6 +28,12 @@ python -m pip install -r requirements.txt
   - Export variables in your shell (recommended)
   - If you use a local `.env`, keep it private and do not commit it
 
+- **LLM (default: local Ollama)**  
+  Install [Ollama](https://ollama.com), run `ollama pull llama3` (or another tag you set in `REGBOT_OLLAMA_MODEL`), and keep the daemon running (`ollama serve` or `brew services start ollama` on macOS). No `OPENAI_API_KEY` is required for this path.
+
+- **Embeddings (first ingest)**  
+  The embedding model is downloaded from Hugging Face on first use. If downloads are slow or fail, try a longer timeout (`HF_HUB_DOWNLOAD_TIMEOUT`, seconds) or a mirror (`REGBOT_HF_ENDPOINT=https://hf-mirror.com` — sets `HF_ENDPOINT` for the Hub client).
+
 - Ingest a policy file into `./data/regbot_store` (use `--reset` when reloading the same corpus):
 
 ```bash
@@ -52,6 +58,8 @@ python -m streamlit run src/streamlit_app.py
 python examples/run_demo.py
 ```
 
+More detail: **`examples/DEMO.md`**.
+
 Evaluate retrieval on a **real** GA4GH PDF (resets the store by default if you pass `--reset`):
 
 ```bash
@@ -77,27 +85,33 @@ python -m unittest discover -s tests -p "test*.py" -v
 ```
 
 Environment Variables
-- `OPENAI_API_KEY`: Optional; enables the JSON LLM compliance pass via `REGBOT_LLM_MODEL` (default `gpt-4o-mini`).
-- `REGBOT_STORE`: Optional override for the on-disk store directory (default `./data/regbot_store`).
-- `REGBOT_EMBEDDING_MODEL`: Optional SentenceTransformers model id (default `sentence-transformers/all-MiniLM-L6-v2`).
-- `REGBOT_MIN_TOKEN_OVERLAP`: For the LLM path, minimum **token recall** between each recommendation and the cited chunk texts (default `0.06`). Set to `0` to disable dropping rows for low overlap (scores may still be attached).
+- `REGBOT_LLM_PROVIDER`: **`ollama` (default)** — local LLM via Ollama’s OpenAI-compatible HTTP API (no OpenAI key). Set to **`openai`** to use OpenAI’s hosted API instead.
+- `OPENAI_API_KEY`: Required only when `REGBOT_LLM_PROVIDER=openai`. Model: `REGBOT_LLM_MODEL` (default `gpt-4o-mini`).
+- `REGBOT_OLLAMA_MODEL`: Tag known to Ollama (default `llama3`). Examples: `llama3`, `mistral`, `mistral:latest`.
+- `REGBOT_OLLAMA_BASE_URL`: Ollama HTTP host only (default `http://127.0.0.1:11434`); `/v1` is appended automatically for the OpenAI-compatible routes.
+- `REGBOT_OLLAMA_API_KEY`: Sent as the Bearer/API key to Ollama’s shim (default `ollama`; ignored by Ollama).
+- `REGBOT_STORE`: On-disk store directory (default `./data/regbot_store`).
+- `REGBOT_EMBEDDING_MODEL`: SentenceTransformers model id (default `sentence-transformers/all-MiniLM-L6-v2`).
+- `HF_HUB_DOWNLOAD_TIMEOUT`: Hugging Face Hub download timeout in seconds (embedding model on first use). The app sets a higher default when unset; increase if you see read timeouts.
+- `REGBOT_HF_ENDPOINT`: If set, copied to `HF_ENDPOINT` (e.g. `https://hf-mirror.com` where Hub mirrors are used).
+- `REGBOT_MIN_TOKEN_OVERLAP`: On the LLM path, minimum **token recall** between each recommendation and cited chunk texts (default `0.06`). Set to `0` to disable dropping low-overlap rows.
 - `REGBOT_CHROMA_ANONYMIZED_TELEMETRY`: Set to `1` to enable Chroma client telemetry; default is off (`0`).
-- `REGBOT_OPENAI_MAX_RETRIES`: Maximum retries for transient OpenAI API errors (default `3`).
+- `REGBOT_OPENAI_MAX_RETRIES`: Retries for the **OpenAI Python client** (used for both OpenAI API and Ollama’s compatible endpoint; default `3`).
 
 Architecture (implemented vs planned)
-- **Core:** Python 3, modular package under `src/regbot/` (ingest, hybrid retrieval, compliance).
-- **Embeddings:** `sentence-transformers` (default `all-MiniLM-L6-v2`).
-- **Vector store:** Chroma persistent store under `REGBOT_STORE/chroma` plus `manifest.json` for BM25 text.
+- **Core:** Python 3, package under `src/regbot/` (ingest, hybrid retrieval, compliance, optional local embedding download helpers).
+- **Embeddings:** `sentence-transformers` + Hugging Face Hub (minimal file set; ONNX-heavy artifacts skipped where possible).
+- **Vector store:** Chroma persistent files under `REGBOT_STORE/chroma` plus `manifest.json` for BM25 text.
 - **Retrieval:** cosine similarity in Chroma + `rank-bm25`, fused via reciprocal rank fusion; optional metadata category filter.
-- **LLM:** OpenAI Chat Completions JSON mode when `OPENAI_API_KEY` is set; offline keyword-style fallback otherwise.
+- **LLM:** **Default:** Ollama (`llama3` or `REGBOT_OLLAMA_MODEL`) via OpenAI-compatible chat completions + JSON parsing. **Optional:** `REGBOT_LLM_PROVIDER=openai` with `OPENAI_API_KEY`. **Fallback:** keyword heuristic if OpenAI is selected without a key, or after LLM errors (e.g. Ollama not running).
 - **UI:** Streamlit (`src/streamlit_app.py`).
-- **Optional / roadmap:** optional LangChain/LlamaIndex adapters on top of the same stores; richer offline evaluation (Ragas, human labels); structured per-recommendation evidence fields.
+- **Optional / roadmap:** LangChain or LlamaIndex adapters on top of the same stores (not required by the current code); richer offline evaluation (Ragas, human labels); structured per-recommendation evidence (e.g. quotes).
 
 Next steps (suggested priorities)
 1. **Real GA4GH corpus**: ingest official PDFs, tune chunk size/overlap and hybrid fusion weights using `eval` + a small **gold query → chunk_id** list (manual or semi-automated).
-2. **Stricter outputs:** `evidence_chunk_ids[]` plus programmatic ID checks, token-overlap filtering on the LLM path (`REGBOT_MIN_TOKEN_OVERLAP`), and retries when grounding/overlap fails. **Next:** richer evidence objects (e.g. optional quotes), stricter refusal when excerpts are insufficient.
+2. **Richer evidence:** optional quoted spans, stricter refusal when retrieved excerpts are insufficient (grounding and token-overlap checks are already in place for the LLM path).
 3. **Contributor experience**: **Done in-repo:** separate **Lint** workflow (Ruff check + format check), `CONTRIBUTING.md`, `.pre-commit-config.yaml`, `pyproject.toml`, `requirements-dev.txt`. **Still open:** optional CI `mypy`, broader type hints, Black-only rules if the team wants them.
-4. **Operational hardening**: **Done in-repo:** Chroma telemetry off by default (`REGBOT_CHROMA_ANONYMIZED_TELEMETRY`), OpenAI client `max_retries` via `REGBOT_OPENAI_MAX_RETRIES`, clear `ValueError` when a PDF yields no extractable text. **Next:** optional request timeouts, Chroma/OpenAI observability hooks.
+4. **Operational hardening**: **Done in-repo:** Chroma telemetry off by default (`REGBOT_CHROMA_ANONYMIZED_TELEMETRY`), client `max_retries` (`REGBOT_OPENAI_MAX_RETRIES`), clear `ValueError` when a PDF yields no extractable text. **Next:** optional request timeouts, observability hooks.
 
 Contributing
 - See **`CONTRIBUTING.md`** for venv setup, **Ruff** lint/format, optional **pre-commit**, and tests.
 
@@ -1,6 +1,6 @@
 # Demo (local)
 
-From the repository root, with a virtualenv activated and dependencies installed:
+From the repository root, with a virtualenv activated and dependencies installed (`pip install -r requirements.txt`). For the compliance step, **embeddings** download on first ingest (Hugging Face); **default LLM is local Ollama** (`llama3` unless `REGBOT_OLLAMA_MODEL` is set)—install [Ollama](https://ollama.com), run `ollama pull llama3`, and keep the daemon running.
 
 1. Ingest the bundled synthetic policy text (resets the local store):
 
@@ -14,10 +14,16 @@ python -m src.main --store ./data/regbot_store ingest --path examples/data/sampl
 python -m src.main --store ./data/regbot_store check --consent examples/data/sample_consent_short.txt
 ```
 
-3. Optional UI:
+3. Or run both steps via the helper script (same as 1+2):
+
+```bash
+python examples/run_demo.py
+```
+
+4. Optional UI:
 
 ```bash
 python -m streamlit run src/streamlit_app.py
 ```
 
-Set `OPENAI_API_KEY` in your environment for JSON output from the configured chat model (`REGBOT_LLM_MODEL`, default `gpt-4o-mini`). Without a key, the tool still retrieves policy chunks and returns a small keyword-style gap summary.
+**LLM behavior:** With **Ollama** running and default settings (`REGBOT_LLM_PROVIDER` defaults to `ollama`), you get JSON compliance output from the local model without `OPENAI_API_KEY`. To use **OpenAI** instead: `export REGBOT_LLM_PROVIDER=openai` and set `OPENAI_API_KEY`; optional `REGBOT_LLM_MODEL` (default `gpt-4o-mini`). If no LLM is reachable, the tool still retrieves policy chunks and uses a **keyword-style fallback** with grounded chunk ids.
@@ -0,0 +1,33 @@
+Study consent and data use agreement — SYNTHETIC EXAMPLE (not a real legal document)
+
+Title: Example Regional Biobank — Cardiometabolic Genomics Cohort (fictional)
+
+1. Purpose
+You are invited to take part in a research study led by Example University School of Medicine and affiliated hospitals. The study aims to understand genetic and environmental contributors to cardiometabolic disease. If you agree, we will collect biological samples (blood or saliva, as described at your visit), extract DNA, and generate genomic and related laboratory data.
+
+2. What we will do with your data
+Genomic data may be combined with health information from your medical records (diagnoses, medications, and routine labs) using coded identifiers. Analysts within the study team will use these data to test scientific hypotheses described in the study protocol. We may also use aggregated or summary statistics that do not identify individuals in scientific publications.
+
+3. Sharing beyond the core study team
+De-identified genomic and phenotypic datasets may be shared with approved researchers at other academic institutions for research that is consistent with this study’s scientific aims. We do not sell your personal information. Commercial partners may receive only aggregated results unless you separately agree to a specific collaboration described in an addendum.
+
+4. International data flows
+Some analysis may occur on servers located in the United States and the European Union. Where data cross borders, we apply contractual safeguards and access controls that align with our institutional policies. Detailed country-by-country transfer mechanisms are available upon request from the study office.
+
+5. Secondary use and future research
+We may seek permission to use leftover samples or data for future research that is not yet fully specified, provided it falls within broad categories approved by our research ethics committee and is compatible with the scope you authorize below. You may decline future-use options without affecting your medical care.
+
+6. Recontact
+We may contact you again if we need updated health information or if we discover findings that might be important for your health and are clinically actionable, where such return of results is offered and you have opted in.
+
+7. Withdrawal
+You may withdraw from the study at any time. If you withdraw, we will stop collecting new information. We will not be able to retrieve data that have already been shared with external researchers in de-identified form, because we cannot track those copies; we will stop new sharing where feasible.
+
+8. Identifiers and re-identification risk
+Your name and direct identifiers will be stored separately from research datasets. Research datasets will be coded. Absolute anonymity cannot be guaranteed for genomic data; we describe residual re-identification risks in the full participant information sheet.
+
+9. Cloud and IT systems
+Some processing and storage may use certified cloud services operated under agreements that require security measures and breach notification procedures. A list of subprocessors is maintained by the sponsor and updated periodically.
+
+10. Contact
+Questions may be directed to the study coordinator at the address in your participant packet.
@@ -1,5 +1,9 @@
 #!/usr/bin/env python3
-"""Run ingest + check using the bundled sample files (no Streamlit)."""
+"""Run ingest + check using the bundled sample files (no Streamlit).
+
+Uses the same CLI as README/DEMO.md. Compliance analysis follows REGBOT_LLM_PROVIDER
+(default: ollama + local llama3) or OpenAI when REGBOT_LLM_PROVIDER=openai and OPENAI_API_KEY is set.
+"""
 
 from __future__ import annotations
 
 
@@ -24,7 +24,8 @@
 class RegBot:
     """
     GA4GH-oriented compliance assistant: ingest policy text, retrieve hybrid context,
-    then analyze consent / data-use language with optional OpenAI JSON output.
+    then analyze consent / data-use language with optional LLM JSON output
+    (OpenAI API or local Ollama: Llama 3, Mistral, etc.).
     """
 
     def __init__(
@@ -183,7 +184,11 @@ def _cmd_eval(args: argparse.Namespace) -> int:
 
 def build_parser() -> argparse.ArgumentParser:
     p = argparse.ArgumentParser(
-        description="GA4GH-RegBot: ingest policy docs and check consent text.",
+        description=(
+            "GA4GH-RegBot: ingest policy docs and check consent text. "
+            "LLM defaults to local Ollama (see README); set REGBOT_LLM_PROVIDER=openai and "
+            "OPENAI_API_KEY for OpenAI."
+        ),
     )
     p.add_argument(
         "--store",
 
@@ -1,4 +1,4 @@
-"""GA4GH RegBot: ingestion, hybrid retrieval, and compliance helpers."""
+"""GA4GH RegBot: ingestion, hybrid retrieval, and compliance helpers (default LLM: local Ollama)."""
 
 from src.regbot.config import DEFAULT_COLLECTION, DEFAULT_EMBEDDING_MODEL, MIN_TOKEN_OVERLAP
 from src.regbot.types import ChunkRecord, GroundingAudit, RecommendationItem
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-"""GA4GH RegBot: ingestion, hybrid retrieval, and compliance helpers."""`
	`1`	`+"""GA4GH RegBot: ingestion, hybrid retrieval, and compliance helpers (default LLM: local Ollama)."""`
`2`	`2`
`3`	`3`	`from src.regbot.config import DEFAULT_COLLECTION, DEFAULT_EMBEDDING_MODEL, MIN_TOKEN_OVERLAP`
`4`	`4`	`from src.regbot.types import ChunkRecord, GroundingAudit, RecommendationItem`