Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: CI

on:
push:
pull_request:

jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.11"]
steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt

- name: Run unit tests
run: |
python -m unittest discover -s tests -p "test*.py" -v

32 changes: 32 additions & 0 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: Lint

on:
push:
pull_request:

jobs:
ruff:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Install Ruff
run: |
python -m pip install --upgrade pip
pip install ruff==0.8.4

- name: Ruff check
run: ruff check src tests

- name: Ruff format (check only)
run: ruff format --check src tests

- name: Mypy (package src.regbot)
run: |
pip install mypy==1.13.0
python -m mypy -p src.regbot
11 changes: 11 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
.env
.venv/
.venv*/
__pycache__/
*.pyc
.pytest_cache/
.DS_Store

# Local vector store + manifests
data/regbot_store/

7 changes: 7 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.8.4
hooks:
- id: ruff
args: [--fix]
- id: ruff-format
59 changes: 59 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Contributing to GA4GH-RegBot

## Environment

- Use **Python 3.10–3.12** (3.11 matches CI). Avoid 3.14 for the full ML/Chroma stack until wheels catch up.
- Create a venv and install runtime deps:

```bash
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements.txt
```

- Optional dev tools (lint + pre-commit):

```bash
pip install -r requirements-dev.txt
pre-commit install
```

Run `pre-commit run --all-files` before pushing if you use the hook.

## Tests

```bash
python -m unittest discover -s tests -p "test*.py" -v
```

## Lint

```bash
ruff check src tests
ruff format --check src tests
```

Auto-format:

```bash
ruff format src tests
```

## Type check (optional)

```bash
pip install -r requirements-dev.txt
python -m mypy -p src.regbot
```

This type-checks the `src.regbot` package (same as CI).

## Secrets and local data

- Do **not** commit `.env`, API keys, or your local vector store under `data/regbot_store/`.
- Keep PRs focused: one logical change per PR, update tests when behavior changes.

## Where to start

- See **Next steps** in `README.md` for suggested features (gold eval set, stricter JSON schema, ops hardening).
103 changes: 93 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,104 @@
GA4GH-RegBot: Compliance Assistant
Status: Proposal Stage for GSoC 2026
Status: **MVP available** — ingest, hybrid retrieval, optional LLM compliance + programmatic citation checks, CLI, Streamlit, and a small PDF eval harness. Ongoing work: real-corpus evaluation, stricter schemas, and contributor tooling.

Overview
RegBot is an LLM-powered tool designed to help researchers map their consent forms against GA4GH regulatory frameworks. It uses RAG (Retrieval-Augmented Generation) to flag compliance gaps automatically.

Architecture (Planned)
Core: Python
What works today
- **Ingest** policy PDFs or `.txt` files into a local **Chroma** store plus a JSON manifest (chunk ids, page hints, source metadata).
- **Hybrid retrieval**: embedding search + **BM25**, merged with reciprocal rank fusion.
- **Compliance pass**: one OpenAI JSON call when `OPENAI_API_KEY` is set; otherwise a small keyword gap heuristic that still returns chunk citations.
- **Streamlit UI** for upload + paste flows (`src/streamlit_app.py`).
- **CLI**: `python -m src.main …` (see below).
- **Citation grounding (programmatic):** Each `recommendations[]` item must be `{ "text": "...", "evidence_chunk_ids": ["..."] }` with ids taken **only** from retrieved chunks; optional `citations[]` must also respect the same allow-list. Failed checks trigger **one automatic rewrite request** with the allow-list.
- **PDF eval harness:** `eval` subcommand ingests a real GA4GH PDF and prints retrieval hits for built-in or custom queries (for manual review / building a gold set later).

LLM Framework: LangChain / LlamaIndex
Quickstart (Development)
- Prerequisites: **Python 3.10–3.12** (CI uses 3.11). Python 3.14 is not supported yet for the full stack (native wheels for parts of the ML/Chroma toolchain often lag).
- Create a virtual environment and install dependencies:

Vector Store: ChromaDB / FAISS
```bash
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
```

UI: Streamlit
- Configure environment variables:
- Export variables in your shell (recommended)
- If you use a local `.env`, keep it private and do not commit it

Roadmap
Phase 1: Ingest GA4GH "Framework for Responsible Sharing" policy documents.
- Ingest a policy file into `./data/regbot_store` (use `--reset` when reloading the same corpus):

Phase 2: Build RAG pipeline for clause extraction.
```bash
python -m src.main ingest --path path/to/policy.pdf --reset
```

Phase 3: Develop Streamlit frontend for user uploads.
- Check a consent / data-use text file:

```bash
python -m src.main check --consent path/to/consent.txt
```

- Run the Streamlit UI from the repo root:

```bash
python -m streamlit run src/streamlit_app.py
```

- End-to-end sample (synthetic policy + consent under `examples/`):

```bash
python examples/run_demo.py
```

Evaluate retrieval on a **real** GA4GH PDF (resets the store by default if you pass `--reset`):

```bash
python -m src.main eval --pdf path/to/ga4gh_policy.pdf --reset --top-k 8
```

Use your own query list (one line per query):

```bash
python -m src.main eval --pdf path/to/ga4gh_policy.pdf --reset --queries-file examples/eval/queries_ga4gh.txt
```

Optionally append a full compliance JSON report for a consent file:

```bash
python -m src.main eval --pdf path/to/ga4gh_policy.pdf --reset --consent path/to/consent.txt
```

Run tests

```bash
python -m unittest discover -s tests -p "test*.py" -v
```

Environment Variables
- `OPENAI_API_KEY`: Optional; enables the JSON LLM compliance pass via `REGBOT_LLM_MODEL` (default `gpt-4o-mini`).
- `REGBOT_STORE`: Optional override for the on-disk store directory (default `./data/regbot_store`).
- `REGBOT_EMBEDDING_MODEL`: Optional SentenceTransformers model id (default `sentence-transformers/all-MiniLM-L6-v2`).
- `REGBOT_MIN_TOKEN_OVERLAP`: For the LLM path, minimum **token recall** between each recommendation and the cited chunk texts (default `0.06`). Set to `0` to disable dropping rows for low overlap (scores may still be attached).
- `REGBOT_CHROMA_ANONYMIZED_TELEMETRY`: Set to `1` to enable Chroma client telemetry; default is off (`0`).
- `REGBOT_OPENAI_MAX_RETRIES`: Maximum retries for transient OpenAI API errors (default `3`).

Architecture (implemented vs planned)
- **Core:** Python 3, modular package under `src/regbot/` (ingest, hybrid retrieval, compliance).
- **Embeddings:** `sentence-transformers` (default `all-MiniLM-L6-v2`).
- **Vector store:** Chroma persistent store under `REGBOT_STORE/chroma` plus `manifest.json` for BM25 text.
- **Retrieval:** cosine similarity in Chroma + `rank-bm25`, fused via reciprocal rank fusion; optional metadata category filter.
- **LLM:** OpenAI Chat Completions JSON mode when `OPENAI_API_KEY` is set; offline keyword-style fallback otherwise.
- **UI:** Streamlit (`src/streamlit_app.py`).
- **Optional / roadmap:** optional LangChain/LlamaIndex adapters on top of the same stores; richer offline evaluation (Ragas, human labels); structured per-recommendation evidence fields.

Next steps (suggested priorities)
1. **Real GA4GH corpus**: ingest official PDFs, tune chunk size/overlap and hybrid fusion weights using `eval` + a small **gold query → chunk_id** list (manual or semi-automated).
2. **Stricter outputs:** `evidence_chunk_ids[]` plus programmatic ID checks, token-overlap filtering on the LLM path (`REGBOT_MIN_TOKEN_OVERLAP`), and retries when grounding/overlap fails. **Next:** richer evidence objects (e.g. optional quotes), stricter refusal when excerpts are insufficient.
3. **Contributor experience**: **Done in-repo:** separate **Lint** workflow (Ruff check + format check), `CONTRIBUTING.md`, `.pre-commit-config.yaml`, `pyproject.toml`, `requirements-dev.txt`. **Still open:** optional CI `mypy`, broader type hints, Black-only rules if the team wants them.
4. **Operational hardening**: **Done in-repo:** Chroma telemetry off by default (`REGBOT_CHROMA_ANONYMIZED_TELEMETRY`), OpenAI client `max_retries` via `REGBOT_OPENAI_MAX_RETRIES`, clear `ValueError` when a PDF yields no extractable text. **Next:** optional request timeouts, Chroma/OpenAI observability hooks.

Contributing
- See **`CONTRIBUTING.md`** for venv setup, **Ruff** lint/format, optional **pre-commit**, and tests.
- Open PRs against the upstream repo; keep changes scoped and tested (`python -m unittest discover -s tests -p "test*.py" -v`). Do not commit `.env`, API keys, or local `data/regbot_store/`.
23 changes: 23 additions & 0 deletions examples/DEMO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Demo (local)

From the repository root, with a virtualenv activated and dependencies installed:

1. Ingest the bundled synthetic policy text (resets the local store):

```bash
python -m src.main --store ./data/regbot_store ingest --path examples/data/sample_ga4gh_policy_stub.txt --reset
```

2. Run a check against the sample consent:

```bash
python -m src.main --store ./data/regbot_store check --consent examples/data/sample_consent_short.txt
```

3. Optional UI:

```bash
python -m streamlit run src/streamlit_app.py
```

Set `OPENAI_API_KEY` in your environment for JSON output from the configured chat model (`REGBOT_LLM_MODEL`, default `gpt-4o-mini`). Without a key, the tool still retrieves policy chunks and returns a small keyword-style gap summary.
3 changes: 3 additions & 0 deletions examples/data/sample_consent_short.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Study consent excerpt (synthetic)

We will collect blood samples for genomic analysis related to diabetes risk. Samples will be stored at Example University Biobank. Data may be shared with qualified researchers for the primary study aims described in the participant information sheet. We will not sell data. Participants may withdraw from the study at any time.
10 changes: 10 additions & 0 deletions examples/data/sample_ga4gh_policy_stub.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
GA4GH-style policy excerpt (synthetic, for demos only)

Section A — Responsible sharing
Researchers must document the purpose of the study, the categories of genomic data involved, and the geographic scope of sharing. Data use should be limited to the purposes described in the informed consent or data use agreement.

Section B — Transparency and participant rights
Participants should be informed about secondary uses, recontact policies, and withdrawal of consent. Where data are shared broadly, the consent should describe any international transfers and the safeguards applied (including access controls and re-identification risk management).

Section C — Security
Appropriate technical and organizational measures must protect data at rest and in transit. Cloud processing is allowed only when explicitly covered by the consent and compatible with applicable obligations.
5 changes: 5 additions & 0 deletions examples/eval/queries_ga4gh.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
responsible sharing framework consent withdrawal
secondary use of genomic data limitations
international data transfer safeguards
security measures for genomic data
participant transparency and recontact
40 changes: 40 additions & 0 deletions examples/run_demo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#!/usr/bin/env python3
"""Run ingest + check using the bundled sample files (no Streamlit)."""

from __future__ import annotations

import subprocess
import sys
from pathlib import Path

ROOT = Path(__file__).resolve().parents[1]


def main() -> int:
store = ROOT / "data" / "regbot_store"
policy = ROOT / "examples" / "data" / "sample_ga4gh_policy_stub.txt"
consent = ROOT / "examples" / "data" / "sample_consent_short.txt"
py = sys.executable
subprocess.check_call(
[
py,
"-m",
"src.main",
"--store",
str(store),
"ingest",
"--path",
str(policy),
"--reset",
],
cwd=str(ROOT),
)
subprocess.check_call(
[py, "-m", "src.main", "--store", str(store), "check", "--consent", str(consent)],
cwd=str(ROOT),
)
return 0


if __name__ == "__main__":
raise SystemExit(main())
32 changes: 32 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
[project]
name = "ga4gh-regbot"
version = "0.0.0"
requires-python = ">=3.10,<3.14"
readme = "README.md"
description = "GA4GH-RegBot compliance assistant (MVP)"

[tool.ruff]
target-version = "py310"
line-length = 100
src = ["src", "tests"]

[tool.ruff.lint]
select = ["E", "F", "I", "W"]
ignore = ["E501"]

[tool.ruff.lint.per-file-ignores]
# sys.path bootstrap must run before project imports
"src/main.py" = ["E402"]
"src/streamlit_app.py" = ["E402"]

[tool.ruff.format]
quote-style = "double"

[tool.mypy]
python_version = "3.10"
ignore_missing_imports = true
warn_unused_ignores = true
check_untyped_defs = true
disallow_untyped_defs = false
no_implicit_optional = true
explicit_package_bases = true
5 changes: 5 additions & 0 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Development-only tools (not required to run the app)
-r requirements.txt
ruff==0.8.4
pre-commit==3.6.0
mypy==1.13.0
Loading