Skip to content

Commit dfb95c2

Browse files
committed
feat: Ollama-default LLM, resilient embeddings/compliance, doc and sample updates
1 parent 4155ab6 commit dfb95c2

14 files changed

Lines changed: 274 additions & 47 deletions

CONTRIBUTING.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@ pre-commit install
2121

2222
Run `pre-commit run --all-files` before pushing if you use the hook.
2323

24+
Unit tests do **not** require Ollama or `OPENAI_API_KEY`; the pipeline test forces the offline path.
25+
2426
## Tests
2527

2628
```bash

README.md

Lines changed: 29 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
GA4GH-RegBot: Compliance Assistant
2-
Status: **MVP available** — ingest, hybrid retrieval, optional LLM compliance + programmatic citation checks, CLI, Streamlit, and a small PDF eval harness. Ongoing work: real-corpus evaluation, stricter schemas, and contributor tooling.
2+
Status: **MVP available** — ingest, hybrid retrieval, **local-first LLM** (Ollama / Llama 3 by default) or optional OpenAI, programmatic citation checks, CLI, Streamlit, and a small PDF eval harness. Ongoing work: real-corpus evaluation, stricter evidence objects, and contributor tooling.
33

44
Overview
55
RegBot is an LLM-powered tool designed to help researchers map their consent forms against GA4GH regulatory frameworks. It uses RAG (Retrieval-Augmented Generation) to flag compliance gaps automatically.
66

77
What works today
88
- **Ingest** policy PDFs or `.txt` files into a local **Chroma** store plus a JSON manifest (chunk ids, page hints, source metadata).
99
- **Hybrid retrieval**: embedding search + **BM25**, merged with reciprocal rank fusion.
10-
- **Compliance pass**: one OpenAI JSON call when `OPENAI_API_KEY` is set; otherwise a small keyword gap heuristic that still returns chunk citations.
10+
- **Compliance pass**: JSON-mode LLM via **[Ollama](https://ollama.com) by default** (e.g. `llama3`, configurable with `REGBOT_OLLAMA_MODEL`). Set `REGBOT_LLM_PROVIDER=openai` and `OPENAI_API_KEY` to use OpenAI instead. If no LLM is reachable (or on API failure), a **keyword heuristic fallback** still returns grounded chunk ids.
1111
- **Streamlit UI** for upload + paste flows (`src/streamlit_app.py`).
1212
- **CLI**: `python -m src.main …` (see below).
13-
- **Citation grounding (programmatic):** Each `recommendations[]` item must be `{ "text": "...", "evidence_chunk_ids": ["..."] }` with ids taken **only** from retrieved chunks; optional `citations[]` must also respect the same allow-list. Failed checks trigger **one automatic rewrite request** with the allow-list.
13+
- **Citation grounding (programmatic):** Each `recommendations[]` item must be `{ "text": "...", "evidence_chunk_ids": ["..."] }` with ids taken **only** from retrieved chunks; optional `citations[]` must also respect the same allow-list. Failed checks trigger **automatic rewrite requests** with the allow-list; optional **token-overlap** filtering on the LLM path (`REGBOT_MIN_TOKEN_OVERLAP`).
1414
- **PDF eval harness:** `eval` subcommand ingests a real GA4GH PDF and prints retrieval hits for built-in or custom queries (for manual review / building a gold set later).
1515

1616
Quickstart (Development)
@@ -28,6 +28,12 @@ python -m pip install -r requirements.txt
2828
- Export variables in your shell (recommended)
2929
- If you use a local `.env`, keep it private and do not commit it
3030

31+
- **LLM (default: local Ollama)**
32+
Install [Ollama](https://ollama.com), run `ollama pull llama3` (or another tag you set in `REGBOT_OLLAMA_MODEL`), and keep the daemon running (`ollama serve` or `brew services start ollama` on macOS). No `OPENAI_API_KEY` is required for this path.
33+
34+
- **Embeddings (first ingest)**
35+
The embedding model is downloaded from Hugging Face on first use. If downloads are slow or fail, try a longer timeout (`HF_HUB_DOWNLOAD_TIMEOUT`, seconds) or a mirror (`REGBOT_HF_ENDPOINT=https://hf-mirror.com` — sets `HF_ENDPOINT` for the Hub client).
36+
3137
- Ingest a policy file into `./data/regbot_store` (use `--reset` when reloading the same corpus):
3238

3339
```bash
@@ -52,6 +58,8 @@ python -m streamlit run src/streamlit_app.py
5258
python examples/run_demo.py
5359
```
5460

61+
More detail: **`examples/DEMO.md`**.
62+
5563
Evaluate retrieval on a **real** GA4GH PDF (resets the store by default if you pass `--reset`):
5664

5765
```bash
@@ -77,27 +85,33 @@ python -m unittest discover -s tests -p "test*.py" -v
7785
```
7886

7987
Environment Variables
80-
- `OPENAI_API_KEY`: Optional; enables the JSON LLM compliance pass via `REGBOT_LLM_MODEL` (default `gpt-4o-mini`).
81-
- `REGBOT_STORE`: Optional override for the on-disk store directory (default `./data/regbot_store`).
82-
- `REGBOT_EMBEDDING_MODEL`: Optional SentenceTransformers model id (default `sentence-transformers/all-MiniLM-L6-v2`).
83-
- `REGBOT_MIN_TOKEN_OVERLAP`: For the LLM path, minimum **token recall** between each recommendation and the cited chunk texts (default `0.06`). Set to `0` to disable dropping rows for low overlap (scores may still be attached).
88+
- `REGBOT_LLM_PROVIDER`: **`ollama` (default)** — local LLM via Ollama’s OpenAI-compatible HTTP API (no OpenAI key). Set to **`openai`** to use OpenAI’s hosted API instead.
89+
- `OPENAI_API_KEY`: Required only when `REGBOT_LLM_PROVIDER=openai`. Model: `REGBOT_LLM_MODEL` (default `gpt-4o-mini`).
90+
- `REGBOT_OLLAMA_MODEL`: Tag known to Ollama (default `llama3`). Examples: `llama3`, `mistral`, `mistral:latest`.
91+
- `REGBOT_OLLAMA_BASE_URL`: Ollama HTTP host only (default `http://127.0.0.1:11434`); `/v1` is appended automatically for the OpenAI-compatible routes.
92+
- `REGBOT_OLLAMA_API_KEY`: Sent as the Bearer/API key to Ollama’s shim (default `ollama`; ignored by Ollama).
93+
- `REGBOT_STORE`: On-disk store directory (default `./data/regbot_store`).
94+
- `REGBOT_EMBEDDING_MODEL`: SentenceTransformers model id (default `sentence-transformers/all-MiniLM-L6-v2`).
95+
- `HF_HUB_DOWNLOAD_TIMEOUT`: Hugging Face Hub download timeout in seconds (embedding model on first use). The app sets a higher default when unset; increase if you see read timeouts.
96+
- `REGBOT_HF_ENDPOINT`: If set, copied to `HF_ENDPOINT` (e.g. `https://hf-mirror.com` where Hub mirrors are used).
97+
- `REGBOT_MIN_TOKEN_OVERLAP`: On the LLM path, minimum **token recall** between each recommendation and cited chunk texts (default `0.06`). Set to `0` to disable dropping low-overlap rows.
8498
- `REGBOT_CHROMA_ANONYMIZED_TELEMETRY`: Set to `1` to enable Chroma client telemetry; default is off (`0`).
85-
- `REGBOT_OPENAI_MAX_RETRIES`: Maximum retries for transient OpenAI API errors (default `3`).
99+
- `REGBOT_OPENAI_MAX_RETRIES`: Retries for the **OpenAI Python client** (used for both OpenAI API and Ollama’s compatible endpoint; default `3`).
86100

87101
Architecture (implemented vs planned)
88-
- **Core:** Python 3, modular package under `src/regbot/` (ingest, hybrid retrieval, compliance).
89-
- **Embeddings:** `sentence-transformers` (default `all-MiniLM-L6-v2`).
90-
- **Vector store:** Chroma persistent store under `REGBOT_STORE/chroma` plus `manifest.json` for BM25 text.
102+
- **Core:** Python 3, package under `src/regbot/` (ingest, hybrid retrieval, compliance, optional local embedding download helpers).
103+
- **Embeddings:** `sentence-transformers` + Hugging Face Hub (minimal file set; ONNX-heavy artifacts skipped where possible).
104+
- **Vector store:** Chroma persistent files under `REGBOT_STORE/chroma` plus `manifest.json` for BM25 text.
91105
- **Retrieval:** cosine similarity in Chroma + `rank-bm25`, fused via reciprocal rank fusion; optional metadata category filter.
92-
- **LLM:** OpenAI Chat Completions JSON mode when `OPENAI_API_KEY` is set; offline keyword-style fallback otherwise.
106+
- **LLM:** **Default:** Ollama (`llama3` or `REGBOT_OLLAMA_MODEL`) via OpenAI-compatible chat completions + JSON parsing. **Optional:** `REGBOT_LLM_PROVIDER=openai` with `OPENAI_API_KEY`. **Fallback:** keyword heuristic if OpenAI is selected without a key, or after LLM errors (e.g. Ollama not running).
93107
- **UI:** Streamlit (`src/streamlit_app.py`).
94-
- **Optional / roadmap:** optional LangChain/LlamaIndex adapters on top of the same stores; richer offline evaluation (Ragas, human labels); structured per-recommendation evidence fields.
108+
- **Optional / roadmap:** LangChain or LlamaIndex adapters on top of the same stores (not required by the current code); richer offline evaluation (Ragas, human labels); structured per-recommendation evidence (e.g. quotes).
95109

96110
Next steps (suggested priorities)
97111
1. **Real GA4GH corpus**: ingest official PDFs, tune chunk size/overlap and hybrid fusion weights using `eval` + a small **gold query → chunk_id** list (manual or semi-automated).
98-
2. **Stricter outputs:** `evidence_chunk_ids[]` plus programmatic ID checks, token-overlap filtering on the LLM path (`REGBOT_MIN_TOKEN_OVERLAP`), and retries when grounding/overlap fails. **Next:** richer evidence objects (e.g. optional quotes), stricter refusal when excerpts are insufficient.
112+
2. **Richer evidence:** optional quoted spans, stricter refusal when retrieved excerpts are insufficient (grounding and token-overlap checks are already in place for the LLM path).
99113
3. **Contributor experience**: **Done in-repo:** separate **Lint** workflow (Ruff check + format check), `CONTRIBUTING.md`, `.pre-commit-config.yaml`, `pyproject.toml`, `requirements-dev.txt`. **Still open:** optional CI `mypy`, broader type hints, Black-only rules if the team wants them.
100-
4. **Operational hardening**: **Done in-repo:** Chroma telemetry off by default (`REGBOT_CHROMA_ANONYMIZED_TELEMETRY`), OpenAI client `max_retries` via `REGBOT_OPENAI_MAX_RETRIES`, clear `ValueError` when a PDF yields no extractable text. **Next:** optional request timeouts, Chroma/OpenAI observability hooks.
114+
4. **Operational hardening**: **Done in-repo:** Chroma telemetry off by default (`REGBOT_CHROMA_ANONYMIZED_TELEMETRY`), client `max_retries` (`REGBOT_OPENAI_MAX_RETRIES`), clear `ValueError` when a PDF yields no extractable text. **Next:** optional request timeouts, observability hooks.
101115

102116
Contributing
103117
- See **`CONTRIBUTING.md`** for venv setup, **Ruff** lint/format, optional **pre-commit**, and tests.

examples/DEMO.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Demo (local)
22

3-
From the repository root, with a virtualenv activated and dependencies installed:
3+
From the repository root, with a virtualenv activated and dependencies installed (`pip install -r requirements.txt`). For the compliance step, **embeddings** download on first ingest (Hugging Face); **default LLM is local Ollama** (`llama3` unless `REGBOT_OLLAMA_MODEL` is set)—install [Ollama](https://ollama.com), run `ollama pull llama3`, and keep the daemon running.
44

55
1. Ingest the bundled synthetic policy text (resets the local store):
66

@@ -14,10 +14,16 @@ python -m src.main --store ./data/regbot_store ingest --path examples/data/sampl
1414
python -m src.main --store ./data/regbot_store check --consent examples/data/sample_consent_short.txt
1515
```
1616

17-
3. Optional UI:
17+
3. Or run both steps via the helper script (same as 1+2):
18+
19+
```bash
20+
python examples/run_demo.py
21+
```
22+
23+
4. Optional UI:
1824

1925
```bash
2026
python -m streamlit run src/streamlit_app.py
2127
```
2228

23-
Set `OPENAI_API_KEY` in your environment for JSON output from the configured chat model (`REGBOT_LLM_MODEL`, default `gpt-4o-mini`). Without a key, the tool still retrieves policy chunks and returns a small keyword-style gap summary.
29+
**LLM behavior:** With **Ollama** running and default settings (`REGBOT_LLM_PROVIDER` defaults to `ollama`), you get JSON compliance output from the local model without `OPENAI_API_KEY`. To use **OpenAI** instead: `export REGBOT_LLM_PROVIDER=openai` and set `OPENAI_API_KEY`; optional `REGBOT_LLM_MODEL` (default `gpt-4o-mini`). If no LLM is reachable, the tool still retrieves policy chunks and uses a **keyword-style fallback** with grounded chunk ids.
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
Study consent and data use agreement — SYNTHETIC EXAMPLE (not a real legal document)
2+
3+
Title: Example Regional Biobank — Cardiometabolic Genomics Cohort (fictional)
4+
5+
1. Purpose
6+
You are invited to take part in a research study led by Example University School of Medicine and affiliated hospitals. The study aims to understand genetic and environmental contributors to cardiometabolic disease. If you agree, we will collect biological samples (blood or saliva, as described at your visit), extract DNA, and generate genomic and related laboratory data.
7+
8+
2. What we will do with your data
9+
Genomic data may be combined with health information from your medical records (diagnoses, medications, and routine labs) using coded identifiers. Analysts within the study team will use these data to test scientific hypotheses described in the study protocol. We may also use aggregated or summary statistics that do not identify individuals in scientific publications.
10+
11+
3. Sharing beyond the core study team
12+
De-identified genomic and phenotypic datasets may be shared with approved researchers at other academic institutions for research that is consistent with this study’s scientific aims. We do not sell your personal information. Commercial partners may receive only aggregated results unless you separately agree to a specific collaboration described in an addendum.
13+
14+
4. International data flows
15+
Some analysis may occur on servers located in the United States and the European Union. Where data cross borders, we apply contractual safeguards and access controls that align with our institutional policies. Detailed country-by-country transfer mechanisms are available upon request from the study office.
16+
17+
5. Secondary use and future research
18+
We may seek permission to use leftover samples or data for future research that is not yet fully specified, provided it falls within broad categories approved by our research ethics committee and is compatible with the scope you authorize below. You may decline future-use options without affecting your medical care.
19+
20+
6. Recontact
21+
We may contact you again if we need updated health information or if we discover findings that might be important for your health and are clinically actionable, where such return of results is offered and you have opted in.
22+
23+
7. Withdrawal
24+
You may withdraw from the study at any time. If you withdraw, we will stop collecting new information. We will not be able to retrieve data that have already been shared with external researchers in de-identified form, because we cannot track those copies; we will stop new sharing where feasible.
25+
26+
8. Identifiers and re-identification risk
27+
Your name and direct identifiers will be stored separately from research datasets. Research datasets will be coded. Absolute anonymity cannot be guaranteed for genomic data; we describe residual re-identification risks in the full participant information sheet.
28+
29+
9. Cloud and IT systems
30+
Some processing and storage may use certified cloud services operated under agreements that require security measures and breach notification procedures. A list of subprocessors is maintained by the sponsor and updated periodically.
31+
32+
10. Contact
33+
Questions may be directed to the study coordinator at the address in your participant packet.

examples/run_demo.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
#!/usr/bin/env python3
2-
"""Run ingest + check using the bundled sample files (no Streamlit)."""
2+
"""Run ingest + check using the bundled sample files (no Streamlit).
3+
4+
Uses the same CLI as README/DEMO.md. Compliance analysis follows REGBOT_LLM_PROVIDER
5+
(default: ollama + local llama3) or OpenAI when REGBOT_LLM_PROVIDER=openai and OPENAI_API_KEY is set.
6+
"""
37

48
from __future__ import annotations
59

src/main.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,8 @@
2424
class RegBot:
2525
"""
2626
GA4GH-oriented compliance assistant: ingest policy text, retrieve hybrid context,
27-
then analyze consent / data-use language with optional OpenAI JSON output.
27+
then analyze consent / data-use language with optional LLM JSON output
28+
(OpenAI API or local Ollama: Llama 3, Mistral, etc.).
2829
"""
2930

3031
def __init__(
@@ -183,7 +184,11 @@ def _cmd_eval(args: argparse.Namespace) -> int:
183184

184185
def build_parser() -> argparse.ArgumentParser:
185186
p = argparse.ArgumentParser(
186-
description="GA4GH-RegBot: ingest policy docs and check consent text.",
187+
description=(
188+
"GA4GH-RegBot: ingest policy docs and check consent text. "
189+
"LLM defaults to local Ollama (see README); set REGBOT_LLM_PROVIDER=openai and "
190+
"OPENAI_API_KEY for OpenAI."
191+
),
187192
)
188193
p.add_argument(
189194
"--store",

src/regbot/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
"""GA4GH RegBot: ingestion, hybrid retrieval, and compliance helpers."""
1+
"""GA4GH RegBot: ingestion, hybrid retrieval, and compliance helpers (default LLM: local Ollama)."""
22

33
from src.regbot.config import DEFAULT_COLLECTION, DEFAULT_EMBEDDING_MODEL, MIN_TOKEN_OVERLAP
44
from src.regbot.types import ChunkRecord, GroundingAudit, RecommendationItem

0 commit comments

Comments
 (0)