feat: RegBot MVP by ringochen06 · Pull Request #36 · ga4gh/GA4GH-RegBot

ringochen06 · 2026-03-26T04:42:19Z

Summary:
This PR turns GA4GH-RegBot from a proposal-stage stub into a runnable MVP: policy ingestion (PDF / text), hybrid retrieval (dense + BM25 + RRF), an optional OpenAI JSON compliance pass with programmatic citation grounding (allow-list + retry), a Streamlit UI, a small evaluation harness for real GA4GH PDFs, plus CI + tests and README updates.

Motivation:
Contributors need a reproducible path to clone, install, run, and verify behavior while the core RAG/compliance features evolve. The codebase also needs guardrails so LLM outputs cannot silently cite chunks that were never retrieved.

What changed:
Product / features

src/regbot/: ingestion → Chroma + manifest.json; hybrid retrieval; study-type heuristics; compliance analysis; grounding audit utilities.
src/main.py: RegBot facade + CLI subcommands: ingest, check, status, eval (ingest a PDF and run default/custom retrieval queries; optional --consent to append a full compliance report).
src/streamlit_app.py: upload policy, paste consent, download JSON report.
examples/: synthetic policy + consent, run_demo.py, eval query list under examples/eval/.
Engineering

.github/workflows/ci.yml: install deps + unittest on PR/push (Python 3.11).
Tests: smoke, pipeline (mocked embeddings), text utils, grounding audit.
.gitignore: .env, virtualenv variants, local vector store dir.
requirements.txt: Chroma bumped for pydantic compatibility; pins for httpx/huggingface_hub/transformers stack; conditional tiktoken on very new Python.
Grounding (citation discipline)

After the LLM returns JSON, citations are checked against the retrieved chunk id allow-list.
Default strictness: at least as many citations as recommendations.
On failure, the client issues one corrective regeneration with explicit allow-list JSON (tracked via grounding + grounding_attempts in the response).

Limitations / follow-ups:
“Hard grounding” here means id allow-list + cardinality checks + retry; deeper alignment (e.g., quote overlap per recommendation, structured recommendations[].evidence_chunk_ids) is future work.
Full stack is validated on Python 3.10–3.12; 3.14 may still fail on native ML/tokenizer wheels.
Evaluation harness is intentionally lightweight (manual review / future gold labels), not a full benchmark suite yet.

- Document a copy/pasteable local setup + test command in README - Add a minimal GitHub Actions workflow that runs unit tests on PRs - Add smoke tests to keep the repo in a runnable state while core features are built - Ignore local env/venv + common Python artifacts - Avoid install failures on Python 3.14 by not forcing tiktoken there

…mlit, eval harness

- Filter recommendations by token recall vs cited chunks; env MIN_TOKEN_OVERLAP - Compliance retry when overlap drops all rows; offline path skips overlap - Add types module, overlap tests, README and CONTRIBUTING updates

ringochen06 · 2026-03-28T22:12:27Z

The latest commit adds token overlap (REGBOT_MIN_TOKEN_OVERLAP), so saying “quote overlap is still future work” is outdated. Worth updating to: ID allow-list + token alignment against cited chunks (tunable threshold) are in; finer quote-level checks / eval can stay as follow-ups.

ringochen06 added 3 commits March 25, 2026 23:58

Implement RegBot MVP: ingest, hybrid retrieval, grounding, CLI, Strea…

859f633

…mlit, eval harness

feat: token overlap grounding, shared types, dev tooling and CI

1faf30c

- Filter recommendations by token recall vs cited chunks; env MIN_TOKEN_OVERLAP - Compliance retry when overlap drops all rows; offline path skips overlap - Add types module, overlap tests, README and CONTRIBUTING updates

chore: Chroma telemetry opt-in, OpenAI retries, clear error on empty PDF

4155ab6

ringochen06 changed the title ~~feat: RegBot MVP (ingest, hybrid retrieval, citation grounding, CLI, Streamlit, eval harness)~~ feat: RegBot MVP Mar 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: RegBot MVP#36

feat: RegBot MVP#36
ringochen06 wants to merge 4 commits intoga4gh:mainfrom
ringochen06:feat/regbot-mvp

ringochen06 commented Mar 26, 2026

Uh oh!

ringochen06 commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ringochen06 commented Mar 26, 2026

Uh oh!

ringochen06 commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant