ga4gh · ringochen06 · Mar 26, 2026 · Mar 26, 2026 · Mar 28, 2026 · Mar 29, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,29 @@
+name: CI
+
+on:
+  push:
+  pull_request:
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.11"]
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -r requirements.txt
+
+      - name: Run unit tests
+        run: |
+          python -m unittest discover -s tests -p "test*.py" -v
+
diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
@@ -0,0 +1,32 @@
+name: Lint
+
+on:
+  push:
+  pull_request:
+
+jobs:
+  ruff:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+
+      - name: Install Ruff
+        run: |
+          python -m pip install --upgrade pip
+          pip install ruff==0.8.4
+
+      - name: Ruff check
+        run: ruff check src tests
+
+      - name: Ruff format (check only)
+        run: ruff format --check src tests
+
+      - name: Mypy (package src.regbot)
+        run: |
+          pip install mypy==1.13.0
+          python -m mypy -p src.regbot
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,11 @@
+.env
+.venv/
+.venv*/
+__pycache__/
+*.pyc
+.pytest_cache/
+.DS_Store
+
+# Local vector store + manifests
+data/regbot_store/
+
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,7 @@
+repos:
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.8.4
+    hooks:
+      - id: ruff
+        args: [--fix]
+      - id: ruff-format
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,59 @@
+# Contributing to GA4GH-RegBot
+
+## Environment
+
+- Use **Python 3.10–3.12** (3.11 matches CI). Avoid 3.14 for the full ML/Chroma stack until wheels catch up.
+- Create a venv and install runtime deps:
+
+```bash
+python3 -m venv .venv
+source .venv/bin/activate
+python -m pip install --upgrade pip
+pip install -r requirements.txt
+```
+
+- Optional dev tools (lint + pre-commit):
+
+```bash
+pip install -r requirements-dev.txt
+pre-commit install
+```
+
+Run `pre-commit run --all-files` before pushing if you use the hook.
+
+## Tests
+
+```bash
+python -m unittest discover -s tests -p "test*.py" -v
+```
+
+## Lint
+
+```bash
+ruff check src tests
+ruff format --check src tests
+```
+
+Auto-format:
+
+```bash
+ruff format src tests
+```
+
+## Type check (optional)
+
+```bash
+pip install -r requirements-dev.txt
+python -m mypy -p src.regbot
+```
+
+This type-checks the `src.regbot` package (same as CI).
+
+## Secrets and local data
+
+- Do **not** commit `.env`, API keys, or your local vector store under `data/regbot_store/`.
+- Keep PRs focused: one logical change per PR, update tests when behavior changes.
+
+## Where to start
+
+- See **Next steps** in `README.md` for suggested features (gold eval set, stricter JSON schema, ops hardening).
diff --git a/README.md b/README.md
@@ -1,21 +1,104 @@
 GA4GH-RegBot: Compliance Assistant
-Status: Proposal Stage for GSoC 2026
+Status: **MVP available** — ingest, hybrid retrieval, optional LLM compliance + programmatic citation checks, CLI, Streamlit, and a small PDF eval harness. Ongoing work: real-corpus evaluation, stricter schemas, and contributor tooling.
 
 Overview
 RegBot is an LLM-powered tool designed to help researchers map their consent forms against GA4GH regulatory frameworks. It uses RAG (Retrieval-Augmented Generation) to flag compliance gaps automatically.
 
-Architecture (Planned)
-Core: Python
+What works today
+- **Ingest** policy PDFs or `.txt` files into a local **Chroma** store plus a JSON manifest (chunk ids, page hints, source metadata).
+- **Hybrid retrieval**: embedding search + **BM25**, merged with reciprocal rank fusion.
+- **Compliance pass**: one OpenAI JSON call when `OPENAI_API_KEY` is set; otherwise a small keyword gap heuristic that still returns chunk citations.
+- **Streamlit UI** for upload + paste flows (`src/streamlit_app.py`).
+- **CLI**: `python -m src.main …` (see below).
+- **Citation grounding (programmatic):** Each `recommendations[]` item must be `{ "text": "...", "evidence_chunk_ids": ["..."] }` with ids taken **only** from retrieved chunks; optional `citations[]` must also respect the same allow-list. Failed checks trigger **one automatic rewrite request** with the allow-list.
+- **PDF eval harness:** `eval` subcommand ingests a real GA4GH PDF and prints retrieval hits for built-in or custom queries (for manual review / building a gold set later).
 
-LLM Framework: LangChain / LlamaIndex
+Quickstart (Development)
+- Prerequisites: **Python 3.10–3.12** (CI uses 3.11). Python 3.14 is not supported yet for the full stack (native wheels for parts of the ML/Chroma toolchain often lag).
+- Create a virtual environment and install dependencies:
 
-Vector Store: ChromaDB / FAISS
+```bash
+python3 -m venv .venv
+source .venv/bin/activate
+python -m pip install --upgrade pip
+python -m pip install -r requirements.txt
+```
 
-UI: Streamlit
+- Configure environment variables:
+  - Export variables in your shell (recommended)
+  - If you use a local `.env`, keep it private and do not commit it
 
-Roadmap
-Phase 1: Ingest GA4GH "Framework for Responsible Sharing" policy documents.
+- Ingest a policy file into `./data/regbot_store` (use `--reset` when reloading the same corpus):
 
-Phase 2: Build RAG pipeline for clause extraction.
+```bash
+python -m src.main ingest --path path/to/policy.pdf --reset
+```
 
-Phase 3: Develop Streamlit frontend for user uploads.
+- Check a consent / data-use text file:
+
+```bash
+python -m src.main check --consent path/to/consent.txt
+```
+
+- Run the Streamlit UI from the repo root:
+
+```bash
+python -m streamlit run src/streamlit_app.py
+```
+
+- End-to-end sample (synthetic policy + consent under `examples/`):
+
+```bash
+python examples/run_demo.py
+```
+
+Evaluate retrieval on a **real** GA4GH PDF (resets the store by default if you pass `--reset`):
+
+```bash
+python -m src.main eval --pdf path/to/ga4gh_policy.pdf --reset --top-k 8
+```
+
+Use your own query list (one line per query):
+
+```bash
+python -m src.main eval --pdf path/to/ga4gh_policy.pdf --reset --queries-file examples/eval/queries_ga4gh.txt
+```
+
+Optionally append a full compliance JSON report for a consent file:
+
+```bash
+python -m src.main eval --pdf path/to/ga4gh_policy.pdf --reset --consent path/to/consent.txt
+```
+
+Run tests
+
+```bash
+python -m unittest discover -s tests -p "test*.py" -v
+```
+
+Environment Variables
+- `OPENAI_API_KEY`: Optional; enables the JSON LLM compliance pass via `REGBOT_LLM_MODEL` (default `gpt-4o-mini`).
+- `REGBOT_STORE`: Optional override for the on-disk store directory (default `./data/regbot_store`).
+- `REGBOT_EMBEDDING_MODEL`: Optional SentenceTransformers model id (default `sentence-transformers/all-MiniLM-L6-v2`).
+- `REGBOT_MIN_TOKEN_OVERLAP`: For the LLM path, minimum **token recall** between each recommendation and the cited chunk texts (default `0.06`). Set to `0` to disable dropping rows for low overlap (scores may still be attached).
+- `REGBOT_CHROMA_ANONYMIZED_TELEMETRY`: Set to `1` to enable Chroma client telemetry; default is off (`0`).
+- `REGBOT_OPENAI_MAX_RETRIES`: Maximum retries for transient OpenAI API errors (default `3`).
+
+Architecture (implemented vs planned)
+- **Core:** Python 3, modular package under `src/regbot/` (ingest, hybrid retrieval, compliance).
+- **Embeddings:** `sentence-transformers` (default `all-MiniLM-L6-v2`).
+- **Vector store:** Chroma persistent store under `REGBOT_STORE/chroma` plus `manifest.json` for BM25 text.
+- **Retrieval:** cosine similarity in Chroma + `rank-bm25`, fused via reciprocal rank fusion; optional metadata category filter.
+- **LLM:** OpenAI Chat Completions JSON mode when `OPENAI_API_KEY` is set; offline keyword-style fallback otherwise.
+- **UI:** Streamlit (`src/streamlit_app.py`).
+- **Optional / roadmap:** optional LangChain/LlamaIndex adapters on top of the same stores; richer offline evaluation (Ragas, human labels); structured per-recommendation evidence fields.
+
+Next steps (suggested priorities)
+1. **Real GA4GH corpus**: ingest official PDFs, tune chunk size/overlap and hybrid fusion weights using `eval` + a small **gold query → chunk_id** list (manual or semi-automated).
+2. **Stricter outputs:** `evidence_chunk_ids[]` plus programmatic ID checks, token-overlap filtering on the LLM path (`REGBOT_MIN_TOKEN_OVERLAP`), and retries when grounding/overlap fails. **Next:** richer evidence objects (e.g. optional quotes), stricter refusal when excerpts are insufficient.
+3. **Contributor experience**: **Done in-repo:** separate **Lint** workflow (Ruff check + format check), `CONTRIBUTING.md`, `.pre-commit-config.yaml`, `pyproject.toml`, `requirements-dev.txt`. **Still open:** optional CI `mypy`, broader type hints, Black-only rules if the team wants them.
+4. **Operational hardening**: **Done in-repo:** Chroma telemetry off by default (`REGBOT_CHROMA_ANONYMIZED_TELEMETRY`), OpenAI client `max_retries` via `REGBOT_OPENAI_MAX_RETRIES`, clear `ValueError` when a PDF yields no extractable text. **Next:** optional request timeouts, Chroma/OpenAI observability hooks.
+
+Contributing
+- See **`CONTRIBUTING.md`** for venv setup, **Ruff** lint/format, optional **pre-commit**, and tests.
+- Open PRs against the upstream repo; keep changes scoped and tested (`python -m unittest discover -s tests -p "test*.py" -v`). Do not commit `.env`, API keys, or local `data/regbot_store/`.
diff --git a/examples/DEMO.md b/examples/DEMO.md
@@ -0,0 +1,23 @@
+# Demo (local)
+
+From the repository root, with a virtualenv activated and dependencies installed:
+
+1. Ingest the bundled synthetic policy text (resets the local store):
+
+```bash
+python -m src.main --store ./data/regbot_store ingest --path examples/data/sample_ga4gh_policy_stub.txt --reset
+```
+
+2. Run a check against the sample consent:
+
+```bash
+python -m src.main --store ./data/regbot_store check --consent examples/data/sample_consent_short.txt
+```
+
+3. Optional UI:
+
+```bash
+python -m streamlit run src/streamlit_app.py
+```
+
+Set `OPENAI_API_KEY` in your environment for JSON output from the configured chat model (`REGBOT_LLM_MODEL`, default `gpt-4o-mini`). Without a key, the tool still retrieves policy chunks and returns a small keyword-style gap summary.
diff --git a/examples/data/sample_consent_short.txt b/examples/data/sample_consent_short.txt
@@ -0,0 +1,3 @@
+Study consent excerpt (synthetic)
+
+We will collect blood samples for genomic analysis related to diabetes risk. Samples will be stored at Example University Biobank. Data may be shared with qualified researchers for the primary study aims described in the participant information sheet. We will not sell data. Participants may withdraw from the study at any time.
diff --git a/examples/data/sample_ga4gh_policy_stub.txt b/examples/data/sample_ga4gh_policy_stub.txt
@@ -0,0 +1,10 @@
+GA4GH-style policy excerpt (synthetic, for demos only)
+
+Section A — Responsible sharing
+Researchers must document the purpose of the study, the categories of genomic data involved, and the geographic scope of sharing. Data use should be limited to the purposes described in the informed consent or data use agreement.
+
+Section B — Transparency and participant rights
+Participants should be informed about secondary uses, recontact policies, and withdrawal of consent. Where data are shared broadly, the consent should describe any international transfers and the safeguards applied (including access controls and re-identification risk management).
+
+Section C — Security
+Appropriate technical and organizational measures must protect data at rest and in transit. Cloud processing is allowed only when explicitly covered by the consent and compatible with applicable obligations.
diff --git a/examples/eval/queries_ga4gh.txt b/examples/eval/queries_ga4gh.txt
@@ -0,0 +1,5 @@
+responsible sharing framework consent withdrawal
+secondary use of genomic data limitations
+international data transfer safeguards
+security measures for genomic data
+participant transparency and recontact
diff --git a/examples/run_demo.py b/examples/run_demo.py
@@ -0,0 +1,40 @@
+#!/usr/bin/env python3
+"""Run ingest + check using the bundled sample files (no Streamlit)."""
+
+from __future__ import annotations
+
+import subprocess
+import sys
+from pathlib import Path
+
+ROOT = Path(__file__).resolve().parents[1]
+
+
+def main() -> int:
+    store = ROOT / "data" / "regbot_store"
+    policy = ROOT / "examples" / "data" / "sample_ga4gh_policy_stub.txt"
+    consent = ROOT / "examples" / "data" / "sample_consent_short.txt"
+    py = sys.executable
+    subprocess.check_call(
+        [
+            py,
+            "-m",
+            "src.main",
+            "--store",
+            str(store),
+            "ingest",
+            "--path",
+            str(policy),
+            "--reset",
+        ],
+        cwd=str(ROOT),
+    )
+    subprocess.check_call(
+        [py, "-m", "src.main", "--store", str(store), "check", "--consent", str(consent)],
+        cwd=str(ROOT),
+    )
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/pyproject.toml b/pyproject.toml
@@ -0,0 +1,32 @@
+[project]
+name = "ga4gh-regbot"
+version = "0.0.0"
+requires-python = ">=3.10,<3.14"
+readme = "README.md"
+description = "GA4GH-RegBot compliance assistant (MVP)"
+
+[tool.ruff]
+target-version = "py310"
+line-length = 100
+src = ["src", "tests"]
+
+[tool.ruff.lint]
+select = ["E", "F", "I", "W"]
+ignore = ["E501"]
+
+[tool.ruff.lint.per-file-ignores]
+# sys.path bootstrap must run before project imports
+"src/main.py" = ["E402"]
+"src/streamlit_app.py" = ["E402"]
+
+[tool.ruff.format]
+quote-style = "double"
+
+[tool.mypy]
+python_version = "3.10"
+ignore_missing_imports = true
+warn_unused_ignores = true
+check_untyped_defs = true
+disallow_untyped_defs = false
+no_implicit_optional = true
+explicit_package_bases = true
diff --git a/requirements-dev.txt b/requirements-dev.txt
@@ -0,0 +1,5 @@
+# Development-only tools (not required to run the app)
+-r requirements.txt
+ruff==0.8.4
+pre-commit==3.6.0
+mypy==1.13.0
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		Study consent excerpt (synthetic)

		We will collect blood samples for genomic analysis related to diabetes risk. Samples will be stored at Example University Biobank. Data may be shared with qualified researchers for the primary study aims described in the participant information sheet. We will not sell data. Participants may withdraw from the study at any time.