A Clinical Decision Support System and Conversational Agent powered by Google Vertex AI (Gemini 1.5) and RAG using the NICE NG12 Guidelines.
- Risk Assessment: Evaluates patient symptoms against NG12 guidelines to determine referral urgency.
- Evidence-Based: Uses a RAG pipeline to retrieve and cite specific sections of the NG12 PDF.
- Conversational Interface: Chat with the guidelines to ask follow-up questions.
- Modular Architecture: FastAPI backend, ChromaDB vector store, and a clean Vector-based frontend.
- Python 3.11+
- Google Cloud Project with Vertex AI enabled.
- Valid
GOOGLE_APPLICATION_CREDENTIALS(orgcloud auth application-default login).
-
Environment Setup (Windows): I have created a virtual environment for you. Activate it or run commands via the path:
.\venv\Scripts\activate
-
Google Cloud Auth: Authenticate with your specific Google Cloud project.
gcloud auth application-default login --project <YOUR_PROJECT_ID>
-
Data Ingestion: Run the ingestion script using the virtual environment python:
.\venv\Scripts\python -m app.services.ingestion_serviceNote: Ensure your project has the Vertex AI API enabled.
-
Run the Application:
.\venv\Scripts\uvicorn app.main:app --reload -
Access the UI: Open http://localhost:8000 in your browser.
docker build -t ng12-assessor .
docker run -p 8080:8080 -e GOOGLE_APPLICATION_CREDENTIALS=/path/to/creds.json ng12-assessorapp/api: FastAPI routes.app/services: Business logic (Agent, RAG, Patient data).app/data: Local storage for PDF and Vector DB.app/static: Frontend HTML/JS.
- Choice: Switched from Gemini 1.5 Pro to Gemini 2.0 Flash.
- Reason: 2.0 Flash offers extremely low latency and a massive context window (1M+ tokens), making it ideal for interactive chat and processing large guidelines.
- Tradeoff: slightly less "deep reasoning" capability than the Ultra/Opus class models, but for guideline retrieval, speed and context retrieval are more important.
- Choice: Built on FastAPI with
uvicorn. - Reason: LLM and RAG operations are I/O bound. FastAPI's native
async/awaitsupport allows handling multiple concurrent chat requests without blocking, unlike Flask. - Tradeoff: Slightly more boilerplate than Flask, but essential for scalable AI apps.
- Choice: Used ChromaDB with local file persistence.
- Reason: "Batteries-included" solution that requires no external infrastructure or API keys (unlike Pinecone), making the project easy to clone and run.
- Tradeoff: Not suitable for production scaling to millions of documents. For production, we would migrate to Vertex AI Vector Search.
- Choice: Implemented a "Condense Question" step where the LLM rewrites user queries based on history (e.g., "And for lung?" -> "What are the referral criteria for lung cancer?").
- Reason: Essential for multi-turn chat. Without it, RAG fails on follow-up questions that lack explicit keywords.
- Tradeoff: Adds a small latency overhead (one extra LLM call per turn), but drastically improves answer quality.
- Choice: Implemented manual batching (100 items/batch).
- Reason: Vertex AI Embedding API has a hard limit of 250 instances per request.
- Tradeoff: Code complexity vs. API reliability.
- Choice: Used text-embedding-004.
- Reason: Latest stable embedding model offering improved semantic representations compared to older
geckomodels. - Tradeoff: Specific regional availability (us-central1), requiring explicit location configuration.
- Choice: Split PDF into 500 character chunks (with 200 overlap).
- Reason: Smaller chunks provide more precise context retrieval for specific medical criteria, reducing noise in the LLM prompt.
- Tradeoff: Risk of splitting a long sentence or list across chunks, handled partially by the 200-character overlap.