ingest, hybrid RAG, citation grounding, CLI/Streamlit, and CI#35
Open
ringochen06 wants to merge 6 commits intoga4gh:mainfrom
Open
ingest, hybrid RAG, citation grounding, CLI/Streamlit, and CI#35ringochen06 wants to merge 6 commits intoga4gh:mainfrom
ringochen06 wants to merge 6 commits intoga4gh:mainfrom
Conversation
- Document a copy/pasteable local setup + test command in README - Add a minimal GitHub Actions workflow that runs unit tests on PRs - Add smoke tests to keep the repo in a runnable state while core features are built - Ignore local env/venv + common Python artifacts - Avoid install failures on Python 3.14 by not forcing tiktoken there
…mlit, eval harness
- Filter recommendations by token recall vs cited chunks; env MIN_TOKEN_OVERLAP - Compliance retry when overlap drops all rows; offline path skips overlap - Add types module, overlap tests, README and CONTRIBUTING updates
Author
|
Pushed latest: ops hardening (Chroma telemetry env, REGBOT_OPENAI_MAX_RETRIES, empty-PDF ValueError) + README/test updates. Ready for another look. |
62c7f0d to
4155ab6
Compare
- Add compliance_report_and_chunks() to return report + chunks in one pass - Add chat_followup_policy_qa() with policy+consent context and STUDY_TYPE_GUESS - Ask follow-up tab: session chat, form+Send (chat_input unsupported in tabs)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
This PR started as contributor bootstrap (CI, tests, packaging) and has grown into a working MVP: policy ingest into a local vector store, hybrid retrieval, LLM-assisted compliance with programmatic citation grounding, plus CLI and Streamlit. Follow-up commits add token-overlap filtering, shared types/dev tooling, and small operational improvements (Chroma telemetry opt-in, OpenAI retries, clearer errors on empty PDF text).
What’s included:
Ingestion & retrieval: Chunk policy PDFs/text into Chroma + manifest; embedding + BM25 with reciprocal rank fusion.
Compliance: OpenAI JSON path when OPENAI_API_KEY is set; keyword fallback otherwise. Grounding: recommendations[] must use evidence_chunk_ids from retrieved chunks; allow-list audit + rewrite on failure; token recall vs cited chunks (REGBOT_MIN_TOKEN_OVERLAP, 0 disables).
UX: CLI (python -m src.main …) and Streamlit app; PDF eval harness for retrieval review.
Repo hygiene: GitHub Actions (tests), .gitignore, README/CONTRIBUTING, Ruff/pre-commit/lint workflow, pyproject.toml / dev requirements where added.
Ops: REGBOT_CHROMA_ANONYMIZED_TELEMETRY, REGBOT_OPENAI_MAX_RETRIES, ValueError when a PDF yields no extractable text.
(Not legal advice; outputs depend on corpus quality and model behavior. Happy to split docs vs code in follow-ups if the maintainers prefer smaller PRs.)