ray-project
diff --git a/‎.pre-commit-config.yaml
Lines changed: 1 addition & 1 deletion b/‎.pre-commit-config.yaml
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md
Lines changed: 37 additions & 110 deletions b/‎README.md
Lines changed: 37 additions & 110 deletions
diff --git a/‎app/config.py
Lines changed: 17 additions & 40 deletions b/‎app/config.py
Lines changed: 17 additions & 40 deletions
@@ -17,7 +17,7 @@ repos:
     rev: v1.4.0
     hooks:
     -   id: detect-secrets
-        exclude: "notebooks"
+        exclude: "notebooks|experiments"
 -   repo: local
     hooks:
     -   id: clean
 
@@ -1,25 +1,27 @@
 # LLM Applications
 
-An end-to-end guide for scaling and serving LLM application in production.
-
-This repo currently contains one such application: a retrieval-augmented generation (RAG)
-app for answering questions about supplied information. By default, the app uses
-the [Ray documentation](https://docs.ray.io/en/master/) as the source of information.
-This app first [indexes](./app/index.py) the documentation in a vector database
-and then uses an LLM to generate responses for questions that got augmented with
-relevant info retrieved from the index.
+An end-to-end guide for scaling and serving LLM application in production. This repo currently contains one such application: a retrieval-augmented generation (RAG) app for answering questions about supplied information.
 
 ## Setup
 
+### API keys
+We'll be using [OpenAI](https://platform.openai.com/docs/models/) to access ChatGPT models like `gpt-3.5-turbo`, `gpt-4`, etc. and [Anyscale Endpoints](https://endpoints.anyscale.com/) to access OSS LLMs like `Llama-2-70b`. Be sure to create your accounts for both and have your credentials ready.
+
 ### Compute
-- Start a new [Anyscale workspace on staging](https://console.anyscale-staging.com/o/anyscale-internal/workspaces)
-  using an [`g3.8xlarge`](https://instances.vantage.sh/aws/ec2/g3.8xlarge) head node on an AWS cloud.
+- Start a new [Anyscale workspace on staging](https://console.anyscale-staging.com/o/anyscale-internal/workspaces) using an [`g3.8xlarge`](https://instances.vantage.sh/aws/ec2/g3.8xlarge) head node (you can also add GPU worker nodes to run the workloads faster).
 - Use the [`default_cluster_env_2.6.2_py39`](https://docs.anyscale.com/reference/base-images/ray-262/py39#ray-2-6-2-py39) cluster environment.
+- Use the `us-east-1` if you'd like to use the artifacts in our shared storage (source docs, vector DB dumps, etc.).
 
 ### Repository
+```bash
+git clone https://github.com/ray-project/llm-applications.git .  # git checkout -b goku origin/goku
+git config --global user.name <GITHUB-USERNAME>
+git config --global user.email <EMAIL-ADDRESS>
+```
 
-First, clone this repository.
-
+### Data
+Our data is already ready at `/efs/shared_storage/goku/docs.ray.io/en/master/` (on Staging, `us-east-1`) but if you wanted to load it yourself, run this bash command (change `/desired/output/directory`, but make sure it's on the shared storage,
+so that it's accessible to the workers)
 ```bash
 git clone https://github.com/ray-project/llm-applications.git .
 ```
@@ -30,116 +32,41 @@ Then set up the environment correctly by specifying the values in your `.env` fi
 and installing the dependencies:
 
 ```bash
-cp ./envs/.env_template .envs
-source .envs
 pip install --user -r requirements.txt
+export PYTHONPATH=$PYTHONPATH:$PWD
 pre-commit install
 pre-commit autoupdate
 ```
 
-### Data
-
-Our data is already ready at `/efs/shared_storage/pcmoritz/docs.ray.io/en/master/`
-(on Staging) but if you wanted to load it yourself, run this bash command:
-
+### Variables
 ```bash
-bash scrape-docs.sh
+touch .env
+# Add environment variables to .env
+OPENAI_API_BASE="https://api.openai.com/v1"
+OPENAI_API_KEY=""  # https://platform.openai.com/account/api-keys
+ANYSCALE_API_BASE="https://api.endpoints.anyscale.com/v1"
+ANYSCALE_API_KEY=""  # https://app.endpoints.anyscale.com/credentials
+DB_CONNECTION_STRING="dbname=postgres user=postgres host=localhost password=postgres"
+source .env
 ```
 
-### Vector DB
-
-<details>
-<summary>Local installation with brew on MacOS</summary>
+## Steps
 
+1. Open [rag.ipynb](notebooks/rag.ipynb) to interactively go through all the concepts and run experiments.
+2. Use the best configuration (in `serve.py`) from the notebook experiments to serve the LLM.
 ```bash
-brew install postgresql
-brew install pgvector
-psql -c "CREATE USER postgres WITH SUPERUSER;"
-# pragma: allowlist nextline secret
-psql -c "ALTER USER postgres with password 'postgres';"
-psql -c "CREATE EXTENSION vector;"
-psql -f migrations/initial.sql
-python app/index.py create-index
+python app/main.py
 ```
-</details>
-
-```bash
-bash setup-pgvector.sh
-sudo -u postgres psql -f migrations/initial.sql
-python app/index.py create-index
-```
-
-### Query
-Just a sample and uses the current index that's been created.
+3. Query your service.
 ```python
 import json
-from app.query import QueryAgent
-query = "What is the default batch size for map_batches?"
-system_content = "Your job is to answer a question using the additional context provided."
-agent = QueryAgent(
-    embedding_model="thenlper/gte-base",
-    llm="meta-llama/Llama-2-7b-chat-hf",
-    max_context_length=4096,
-    system_content=system_content,
-)
-result = agent.get_response(query=query)
-print(json.dumps(result, indent=2))
+import requests
+data = {"query": "What is the default batch size for map_batches?"}
+response = requests.post("http://127.0.0.1:8000/query", json=data)
+print(response.text)
 ```
-
-### Experiments
-
-#### Generate responses
-
-```bash
-python app/main.py generate-responses \
-    --system-content "Answer the {query} using the additional {context} provided."
-```
-
-#### Evaluate responses
-
-```bash
-python app/main.py evaluate-responses \
-    --system-content """
-    Your job is to rate the quality of our generated answer {generated_answer}
-    given a query {query} and a reference answer {reference_answer}.
-    Your score has to be between 1 and 5.
-    You must return your response in a line with only the score.
-    Do not return answers in any other format.
-    On a separate line provide your reasoning for the score as well.
-    """
-```
-
-### Dashboard
-```bash
-export APP_PORT=8501
-echo https://$APP_PORT-port-$ANYSCALE_SESSION_DOMAIN
-streamlit run dashboard/Home.py
+3. Shutdown the service
+```python
+from ray import serve
+serve.shutdown()
 ```
-
-### TODO
-- [x] notebook cleanup
-- [x] evaluator (ex. GPT4) response script
-- [x] DB dump & load
-- [ ] experiments (in order and fixing choices along the way)
-    - Evaluator
-        - [ ] GPT-4 best experiment
-        - [ ] Llama-70b consistency with GPT4
-    - [ ] OSS vs. Closed (gpt-3.5 vs. llama)
-    - [ ] w/ and w/out context (value of RAG)
-    - [ ] # of chunks to use in context
-        - Does using more resources help/harm?
-        - 1, 5, 10 will all fit in the smallest context length of 4K)
-    - [ ] Chunking size/overlap
-        - related to # of chunks + context length, but we'll treat as independent variable
-    - [ ] Embedding (top 3 in leaderboard)
-        - global leaderboard may not be your leaderboard (empirically validate)
-    - Later
-        - [ ] Commercial Assistant evaluation
-        - [ ] Human Assistant evaluation
-        - [ ] Data sources
-    - Much later
-        - [ ] Prompt
-        - [ ] Prompt-tuning on query
-        - [ ] Embedding vs. LLM for retrieval
-- [ ] Ray Tune to tweak a subset of components
-- [ ] CI/CD workflows
@@ -1,44 +1,21 @@
-import os
 from pathlib import Path
 
 # Directories
+EFS_DIR = Path("/efs/shared_storage/goku")
 ROOT_DIR = Path(__file__).parent.parent.absolute()
-
-
-DB_CONNECTION_STRING = os.environ.get("DB_CONNECTION_STRING")
-DOCS_PATH = os.environ.get("DOCS_PATH")
-
-# Credentials
-OPENAI_API_BASE = os.environ.get("OPENAI_API_BASE", "https://api.endpoints.anyscale.com/v1")
-OPENAI_API_KEY = os.environ.get(
-    "OPENAI_API_KEY", ""
-)  # https://app.endpoints.anyscale.com/credentials
-
-# Indexing and model properties
-DEVICE = os.environ.get("DEVICE", "cuda")
-EMBEDDING_BATCH_SIZE = os.environ.get("EMBEDDING_BATCH_SIZE", 100)
-EMBEDDING_ACTORS = os.environ.get("EMBEDDING_ACTORS", 2)
-NUM_GPUS = os.environ.get("NUM_GPUS", 1)
-INDEXING_ACTORS = os.environ.get("INDEXING_ACTORS", 20)
-INDEXING_BATCH_SIZE = os.environ.get("INDEXING_BATCH_SIZE", 128)
-
-# Response generation properties
-EXPERIMENT_NAME = os.environ.get("EXPERIMENT_NAME", "llama-2-7b-gtebase")
-DATA_PATH = os.environ.get("DATA_PATH", "datasets/eval-dataset-v1.jsonl")
-CHUNK_SIZE = os.environ.get("CHUNK_SIZE", 300)
-CHUNK_OVERLAP = os.environ.get("CHUNK_OVERLAP", 50)
-EMBEDDING_MODEL = os.environ.get("EMBEDDING_MODEL", "thenlper/gte-base")
-LLM = os.environ.get("LLM", "meta-llama/Llama-2-7b-chat-hf")
-TEMPERATURE = os.environ.get("TEMPERATURE", 0)
-MAX_CONTEXT_LENGTH = os.environ.get("MAX_CONTEXT_LENGTH", 4096)
-
-# Evaluation properties
-REFERENCE_LOC = os.environ.get("REFERENCE_LOC", "experiments/responses/gpt-4-with-source.json")
-RESPONSE_LOC = os.environ.get("RESPONSE_LOC", "experiments/responses/$EXPERIMENT_NAME.json")
-EVALUATOR = os.environ.get("EVALUATOR", "meta-llama/Llama-2-70b-chat-hf")
-EVALUATOR_TEMPERATURE = os.environ.get("EVALUATOR_TEMPERATURE", 0)
-EVALUATOR_MAX_CONTEXT_LENGTH = os.environ.get("EVALUATOR_MAX_CONTEXT_LENGTH", 4096)
-
-# Slack bot integration
-SLACK_APP_TOKEN = os.environ.get("SLACK_APP_TOKEN", "")
-SLACK_BOT_TOKEN = os.environ.get("SLACK_BOT_TOKEN", "")
+EXPERIMENTS_DIR = Path(ROOT_DIR, "experiments")
+
+# Mappings
+EMBEDDING_DIMENSIONS = {
+    "thenlper/gte-base": 768,
+    "BAAI/bge-large-en": 1024,
+    "text-embedding-ada-002": 1536,
+}
+MAX_CONTEXT_LENGTHS = {
+    "gpt-4": 8192,
+    "gpt-3.5-turbo": 4096,
+    "gpt-3.5-turbo-16k": 16384,
+    "meta-llama/Llama-2-7b-chat-hf": 4096,
+    "meta-llama/Llama-2-13b-chat-hf": 4096,
+    "meta-llama/Llama-2-70b-chat-hf": 4096,
+}