feat: RegBot MVP#36
Open
ringochen06 wants to merge 4 commits intoga4gh:mainfrom
Open
Conversation
- Document a copy/pasteable local setup + test command in README - Add a minimal GitHub Actions workflow that runs unit tests on PRs - Add smoke tests to keep the repo in a runnable state while core features are built - Ignore local env/venv + common Python artifacts - Avoid install failures on Python 3.14 by not forcing tiktoken there
…mlit, eval harness
- Filter recommendations by token recall vs cited chunks; env MIN_TOKEN_OVERLAP - Compliance retry when overlap drops all rows; offline path skips overlap - Add types module, overlap tests, README and CONTRIBUTING updates
Author
|
The latest commit adds token overlap (REGBOT_MIN_TOKEN_OVERLAP), so saying “quote overlap is still future work” is outdated. Worth updating to: ID allow-list + token alignment against cited chunks (tunable threshold) are in; finer quote-level checks / eval can stay as follow-ups. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
This PR turns GA4GH-RegBot from a proposal-stage stub into a runnable MVP: policy ingestion (PDF / text), hybrid retrieval (dense + BM25 + RRF), an optional OpenAI JSON compliance pass with programmatic citation grounding (allow-list + retry), a Streamlit UI, a small evaluation harness for real GA4GH PDFs, plus CI + tests and README updates.
Motivation:
Contributors need a reproducible path to clone, install, run, and verify behavior while the core RAG/compliance features evolve. The codebase also needs guardrails so LLM outputs cannot silently cite chunks that were never retrieved.
What changed:
Product / features
src/regbot/: ingestion → Chroma + manifest.json; hybrid retrieval; study-type heuristics; compliance analysis; grounding audit utilities.
src/main.py: RegBot facade + CLI subcommands: ingest, check, status, eval (ingest a PDF and run default/custom retrieval queries; optional --consent to append a full compliance report).
src/streamlit_app.py: upload policy, paste consent, download JSON report.
examples/: synthetic policy + consent, run_demo.py, eval query list under examples/eval/.
Engineering
.github/workflows/ci.yml: install deps + unittest on PR/push (Python 3.11).
Tests: smoke, pipeline (mocked embeddings), text utils, grounding audit.
.gitignore: .env, virtualenv variants, local vector store dir.
requirements.txt: Chroma bumped for pydantic compatibility; pins for httpx/huggingface_hub/transformers stack; conditional tiktoken on very new Python.
Grounding (citation discipline)
After the LLM returns JSON, citations are checked against the retrieved chunk id allow-list.
Default strictness: at least as many citations as recommendations.
On failure, the client issues one corrective regeneration with explicit allow-list JSON (tracked via grounding + grounding_attempts in the response).
Limitations / follow-ups:
“Hard grounding” here means id allow-list + cardinality checks + retry; deeper alignment (e.g., quote overlap per recommendation, structured recommendations[].evidence_chunk_ids) is future work.
Full stack is validated on Python 3.10–3.12; 3.14 may still fail on native ML/tokenizer wheels.
Evaluation harness is intentionally lightweight (manual review / future gold labels), not a full benchmark suite yet.