Skip to content

ingest, hybrid RAG, citation grounding, CLI/Streamlit, and CI#35

Open
ringochen06 wants to merge 6 commits intoga4gh:mainfrom
ringochen06:main
Open

ingest, hybrid RAG, citation grounding, CLI/Streamlit, and CI#35
ringochen06 wants to merge 6 commits intoga4gh:mainfrom
ringochen06:main

Conversation

@ringochen06
Copy link
Copy Markdown

@ringochen06 ringochen06 commented Mar 26, 2026

Summary:
This PR started as contributor bootstrap (CI, tests, packaging) and has grown into a working MVP: policy ingest into a local vector store, hybrid retrieval, LLM-assisted compliance with programmatic citation grounding, plus CLI and Streamlit. Follow-up commits add token-overlap filtering, shared types/dev tooling, and small operational improvements (Chroma telemetry opt-in, OpenAI retries, clearer errors on empty PDF text).

What’s included:
Ingestion & retrieval: Chunk policy PDFs/text into Chroma + manifest; embedding + BM25 with reciprocal rank fusion.
Compliance: OpenAI JSON path when OPENAI_API_KEY is set; keyword fallback otherwise. Grounding: recommendations[] must use evidence_chunk_ids from retrieved chunks; allow-list audit + rewrite on failure; token recall vs cited chunks (REGBOT_MIN_TOKEN_OVERLAP, 0 disables).
UX: CLI (python -m src.main …) and Streamlit app; PDF eval harness for retrieval review.
Repo hygiene: GitHub Actions (tests), .gitignore, README/CONTRIBUTING, Ruff/pre-commit/lint workflow, pyproject.toml / dev requirements where added.
Ops: REGBOT_CHROMA_ANONYMIZED_TELEMETRY, REGBOT_OPENAI_MAX_RETRIES, ValueError when a PDF yields no extractable text.

(Not legal advice; outputs depend on corpus quality and model behavior. Happy to split docs vs code in follow-ups if the maintainers prefer smaller PRs.)

- Document a copy/pasteable local setup + test command in README
- Add a minimal GitHub Actions workflow that runs unit tests on PRs
- Add smoke tests to keep the repo in a runnable state while core features are built
- Ignore local env/venv + common Python artifacts
- Avoid install failures on Python 3.14 by not forcing tiktoken there
@ringochen06 ringochen06 reopened this Mar 26, 2026
- Filter recommendations by token recall vs cited chunks; env MIN_TOKEN_OVERLAP
- Compliance retry when overlap drops all rows; offline path skips overlap
- Add types module, overlap tests, README and CONTRIBUTING updates
@ringochen06
Copy link
Copy Markdown
Author

Pushed latest: ops hardening (Chroma telemetry env, REGBOT_OPENAI_MAX_RETRIES, empty-PDF ValueError) + README/test updates. Ready for another look.

@ringochen06 ringochen06 changed the title Bootstrap contributor workflow: CI, tests, gitignore, and install notes feat: RegBot MVP — ingest, hybrid RAG, citation grounding, CLI/Streamlit, and CI Mar 29, 2026
@ringochen06 ringochen06 changed the title feat: RegBot MVP — ingest, hybrid RAG, citation grounding, CLI/Streamlit, and CI ingest, hybrid RAG, citation grounding, CLI/Streamlit, and CI Mar 29, 2026
@ringochen06 ringochen06 force-pushed the main branch 2 times, most recently from 62c7f0d to 4155ab6 Compare April 8, 2026 01:15
- Add compliance_report_and_chunks() to return report + chunks in one pass
- Add chat_followup_policy_qa() with policy+consent context and STUDY_TYPE_GUESS
- Ask follow-up tab: session chat, form+Send (chat_input unsupported in tabs)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant