Skip to content

feat: RegBot MVP#36

Open
ringochen06 wants to merge 4 commits intoga4gh:mainfrom
ringochen06:feat/regbot-mvp
Open

feat: RegBot MVP#36
ringochen06 wants to merge 4 commits intoga4gh:mainfrom
ringochen06:feat/regbot-mvp

Conversation

@ringochen06
Copy link
Copy Markdown

Summary:
This PR turns GA4GH-RegBot from a proposal-stage stub into a runnable MVP: policy ingestion (PDF / text), hybrid retrieval (dense + BM25 + RRF), an optional OpenAI JSON compliance pass with programmatic citation grounding (allow-list + retry), a Streamlit UI, a small evaluation harness for real GA4GH PDFs, plus CI + tests and README updates.

Motivation:
Contributors need a reproducible path to clone, install, run, and verify behavior while the core RAG/compliance features evolve. The codebase also needs guardrails so LLM outputs cannot silently cite chunks that were never retrieved.

What changed:
Product / features

src/regbot/: ingestion → Chroma + manifest.json; hybrid retrieval; study-type heuristics; compliance analysis; grounding audit utilities.
src/main.py: RegBot facade + CLI subcommands: ingest, check, status, eval (ingest a PDF and run default/custom retrieval queries; optional --consent to append a full compliance report).
src/streamlit_app.py: upload policy, paste consent, download JSON report.
examples/: synthetic policy + consent, run_demo.py, eval query list under examples/eval/.
Engineering

.github/workflows/ci.yml: install deps + unittest on PR/push (Python 3.11).
Tests: smoke, pipeline (mocked embeddings), text utils, grounding audit.
.gitignore: .env, virtualenv variants, local vector store dir.
requirements.txt: Chroma bumped for pydantic compatibility; pins for httpx/huggingface_hub/transformers stack; conditional tiktoken on very new Python.
Grounding (citation discipline)

After the LLM returns JSON, citations are checked against the retrieved chunk id allow-list.
Default strictness: at least as many citations as recommendations.
On failure, the client issues one corrective regeneration with explicit allow-list JSON (tracked via grounding + grounding_attempts in the response).

Limitations / follow-ups:
“Hard grounding” here means id allow-list + cardinality checks + retry; deeper alignment (e.g., quote overlap per recommendation, structured recommendations[].evidence_chunk_ids) is future work.
Full stack is validated on Python 3.10–3.12; 3.14 may still fail on native ML/tokenizer wheels.
Evaluation harness is intentionally lightweight (manual review / future gold labels), not a full benchmark suite yet.

- Document a copy/pasteable local setup + test command in README
- Add a minimal GitHub Actions workflow that runs unit tests on PRs
- Add smoke tests to keep the repo in a runnable state while core features are built
- Ignore local env/venv + common Python artifacts
- Avoid install failures on Python 3.14 by not forcing tiktoken there
- Filter recommendations by token recall vs cited chunks; env MIN_TOKEN_OVERLAP
- Compliance retry when overlap drops all rows; offline path skips overlap
- Add types module, overlap tests, README and CONTRIBUTING updates
@ringochen06
Copy link
Copy Markdown
Author

The latest commit adds token overlap (REGBOT_MIN_TOKEN_OVERLAP), so saying “quote overlap is still future work” is outdated. Worth updating to: ID allow-list + token alignment against cited chunks (tunable threshold) are in; finer quote-level checks / eval can stay as follow-ups.

@ringochen06 ringochen06 changed the title feat: RegBot MVP (ingest, hybrid retrieval, citation grounding, CLI, Streamlit, eval harness) feat: RegBot MVP Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant