diff --git a/.env.example b/.env.example new file mode 100644 index 0000000..8ddc3ed --- /dev/null +++ b/.env.example @@ -0,0 +1,14 @@ +# OpenAI Configuration +OPENAI_API_KEY=your_openai_api_key_here + +# Optional: LLM Model +LLM_MODEL=gpt-3.5-turbo + +# Optional: Logging Level +LOG_LEVEL=INFO + +# Optional: Vector Store Path +VECTOR_STORE_PATH=./vector_store + +# Optional: Embedding Model +EMBEDDING_MODEL=all-MiniLM-L6-v2 diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..a7bc29c --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,48 @@ +\# Contributing to GA4GH-RegBot + + + +Thank you for your interest in contributing to GA4GH-RegBot! + + + +\## Getting Started + + + +1\. \*\*Fork the repository\*\* on GitHub + +2\. \*\*Clone your fork\*\* locally + +3\. \*\*Create a feature branch\*\* for your changes + +4\. \*\*Follow the development workflow\*\* below + +5\. \*\*Submit a pull request\*\* with a clear description + + + +\## Development Workflow + + + +\### 1. Fork and Clone + + + +```bash + +\# Fork on GitHub, then clone your fork + +git clone https://github.com/YOUR-USERNAME/GA4GH-RegBot.git + +cd GA4GH-RegBot + + + +\# Add upstream remote for staying updated + +git remote add upstream https://github.com/ga4gh/GA4GH-RegBot.git + + + diff --git a/README.md b/README.md index 38cd9d1..54d90ea 100644 --- a/README.md +++ b/README.md @@ -1,21 +1,40 @@ -GA4GH-RegBot: Compliance Assistant -Status: Proposal Stage for GSoC 2026 +\# GA4GH-RegBot: Compliance Assistant -Overview -RegBot is an LLM-powered tool designed to help researchers map their consent forms against GA4GH regulatory frameworks. It uses RAG (Retrieval-Augmented Generation) to flag compliance gaps automatically. -Architecture (Planned) -Core: Python -LLM Framework: LangChain / LlamaIndex +\*\*Status:\*\* Proposal Stage for GSoC 2026 -Vector Store: ChromaDB / FAISS -UI: Streamlit -Roadmap -Phase 1: Ingest GA4GH "Framework for Responsible Sharing" policy documents. +GA4GH-RegBot is an LLM-powered tool designed to help researchers map their consent forms against GA4GH regulatory frameworks. It uses RAG (Retrieval-Augmented Generation) to flag compliance gaps automatically. + + + +\## Quick Start (5 minutes) + + + +```bash + +git clone https://github.com/ga4gh/GA4GH-RegBot.git + +cd GA4GH-RegBot + +python -m venv venv + +venv\\Scripts\\activate # Windows + +source venv/bin/activate # Mac/Linux + +pip install -r requirements.txt + +copy .env.example .env # Windows + +cp .env.example .env # Mac/Linux + +\# Edit .env with your OpenAI API key + +python src/main.py + -Phase 2: Build RAG pipeline for clause extraction. -Phase 3: Develop Streamlit frontend for user uploads. diff --git a/SETUP.md b/SETUP.md new file mode 100644 index 0000000..62667f2 --- /dev/null +++ b/SETUP.md @@ -0,0 +1,60 @@ +\# GA4GH-RegBot: Development Setup Guide + + + +Welcome! This guide will help you set up GA4GH-RegBot for local development and testing. + + + +\## Prerequisites + + + +\- \*\*Python 3.8+\*\* + +\- \*\*pip\*\* (comes with Python) + +\- \*\*Git\*\* + + + +\## Quick Start (5 minutes) + + + +```bash + +\# 1. Clone the repository + +git clone https://github.com/ga4gh/GA4GH-RegBot.git + +cd GA4GH-RegBot + + + +\# 2. Create a virtual environment + +python -m venv venv + +venv\\Scripts\\activate # Windows + + + +\# 3. Install dependencies + +pip install -r requirements.txt + + + +\# 4. Set up environment + +copy .env.example .env + + + +\# 5. Verify + +python -c "import langchain; import chromadb; print('OK')" + + + diff --git a/requirements.txt b/requirements.txt index 33ba494..ad761cc 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,9 +1,29 @@ +# LangChain: LLM orchestration and RAG framework langchain==0.1.0 + +# LangChain community integrations langchain-community==0.0.10 + +# ChromaDB: Vector database for semantic search chromadb==0.4.22 + +# OpenAI: LLM provider openai==1.7.0 + +# Streamlit: Web UI framework (for Phase 3) streamlit==1.30.0 -pypdf==3.17.4 + +# Python-dotenv: Environment variable management python-dotenv==1.0.0 + +# Tiktoken: Token counting for LLM tiktoken==0.5.2 + +# Sentence-Transformers: Embedding models sentence-transformers==2.2.2 + +# PyPDF: PDF document processing +PyPDF==3.17.1 + +# Pydantic: Data validation +pydantic==2.5.0