cheshire-cat-ai · scicco · Oct 27, 2024 · Nov 5, 2024 · Nov 5, 2024 · Nov 5, 2024
diff --git a/.env.example b/.env.example
@@ -23,3 +23,15 @@ OLLAMA_DEBUG=false                   # Debug mode for Ollama service
 OLLAMA_KEEP_ALIVE="5m"               # Duration models stay loaded, default 5 minutes, can be set to e.g., "24h"
 OLLAMA_MAX_LOADED_MODELS=1           # Maximum number of models loaded simultaneously, default to 1
 OLLAMA_NUM_PARALLEL=1                # Maximum number of allocated contexts (parallel requests). Manage resource efficiently: If OLLAMA_NUM_PARALLEL=4 and OLLAMA_MAX_LOADED_MODELS=3, the total context requirement might be up to 12 (4x3)
+
+# Petals-specific settings
+
+# how to get your huggingface token: https://huggingface.co/settings/tokens
+HUGGINGFACE_TOKEN=your-huggingface-token-here
+# if you host 10+ blocks you can show your name in the swarm monitor page: https://health.petals.dev/
+PUBLIC_NAME=put_your_name_here
+# change this if you want to use a different model check huggingface for available models
+#PETAL_MODEL_NAME=bigscience/bloom-560m
+PETAL_MODEL_NAME=codellama/CodeLlama-7b-Instruct-hf
+# limit the amount of space used with this
+PETAL_MAX_DISK_SPACE=10GB
diff --git a/.gitignore b/.gitignore
@@ -7,3 +7,4 @@ __pycache__/
 /cat/**
 /ollama/*
 .env
+petals-cache
diff --git a/README.md b/README.md
@@ -7,6 +7,122 @@
 > - **Technical Expertise Required:** Setting up and running local-cat requires some technical know-how.
 > - **Hardware Requirements:** Performance may be slow without a recent GPU or NPU.
 
+## What is Petals?
+
+Run large language models at home, BitTorrent‑style, see [homepage](https://petals.dev/) and find more on [github](https://github.com/bigscience-workshop/petals) page.
+
+## Prerequisites
+
+1. Huggingface account, sign up [here](https://huggingface.co/join)
+
+## What you need to do
+
+1. check Petals' [health status](https://health.petals.dev/) page and choose your preferred model
+2. Request access to huggingface weights for the model that you want to use, find more [here](https://huggingface.co/docs/hub/models-gated#gated-models)
+3. Generate your huggingface [token](https://huggingface.co/settings/tokens)
+(usually they grants access in few minutes). When selecting token permission check `Read access to contents of all public gated repos you can access` option under `Repositories` group.
+4. Once your token is created, create an `.env` file by duplicating `.env.example` and fill `HUGGINGFACE_TOKEN` it with your token.
+
+### Setup Instructions
+
+Ollama container is removed because we will use Petals as replacement.
+Currently, you have to build the cheshire-cat container locally. 
+The Petals container is used to share some of the gpu resources with the other users and also to run the chosen model.
+
+To run local-cat with Petals, follow these steps:
+
+> [!IMPORTANT]
+> Don't enable DEBUG mode in .env file otherwise this will interfere with Petals
+
+1. fill .env file with the desired settings
+2. run `docker compose -f compose.petals.yml up -d`
+
+### Petals container setup
+1. The Petals container should start. Inside log you will see that is loading model blocks. 
+
+```
+...
+2024-10-27 12:31:59 petals                      | Login successful
+2024-10-27 12:32:00 petals                      | /home/petals/src/petals/server/block_functions.py:165: SyntaxWarning: assertion is always true, perhaps remove parentheses?
+2024-10-27 12:32:00 petals                      |   assert (
+2024-10-27 12:32:00 petals                      | Oct 27 11:32:00.715 [INFO] Running Petals 2.3.0.dev2
+2024-10-27 12:32:01 petals                      | Oct 27 11:32:01.116 [INFO] Make sure you follow the Llama terms of use: https://llama.meta.com/llama3/license, https://llama.meta.com/llama2/license
+2024-10-27 12:32:01 petals                      | Oct 27 11:32:01.116 [INFO] Using DHT prefix: CodeLlama-7b-Instruct-hf
+2024-10-27 12:32:12 petals                      | Oct 27 11:32:12.757 [INFO] This server is accessible via relays
+2024-10-27 12:32:14 petals                      | Oct 27 11:32:14.059 [INFO] Connecting to the public swarm
+2024-10-27 12:32:14 petals                      | Oct 27 11:32:14.059 [INFO] Running a server on ['/ip4/127.0.0.1/tcp/31330/p2p/12D3KooWEbeDea5LSiaJEBuWLYh85h8kPgTXM8UNYo7bUCRpvP7X', '/ip4/172.18.0.2/tcp/31330/p2p/12D3KooWEbeDea5LSiaJEBuWLYh85h8kPgTXM8UNYo7bUCRpvP7X', '/ip6/::1/tcp/31330/p2p/12D3KooWEbeDea5LSiaJEBuWLYh85h8kPgTXM8UNYo7bUCRpvP7X']
+2024-10-27 12:32:14 petals                      | Oct 27 11:32:14.219 [INFO] Model weights are loaded in bfloat16, quantized to nf4 format
+2024-10-27 12:32:14 petals                      | Oct 27 11:32:14.219 [INFO] Attention cache for all blocks will consume up to 2.00 GiB
+2024-10-27 12:32:14 petals                      | Oct 27 11:32:14.220 [INFO] Loading throughput info
+2024-10-27 12:32:14 petals                      | Oct 27 11:32:14.239 [INFO] Reporting throughput: 467.9 tokens/sec for 32 blocks
+2024-10-27 12:32:18 petals                      | Oct 27 11:32:18.175 [INFO] Announced that blocks [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] are joining
+2024-10-27 12:34:02 petals                      | Oct 27 11:34:02.638 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 0
+2024-10-27 12:34:08 petals                      | Oct 27 11:34:08.157 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 1
+2024-10-27 12:34:13 petals                      | Oct 27 11:34:13.249 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 2
+2024-10-27 12:34:17 petals                      | Oct 27 11:34:17.088 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 3
+2024-10-27 12:34:21 petals                      | Oct 27 11:34:21.051 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 4
+2024-10-27 12:34:24 petals                      | Oct 27 11:34:24.946 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 5
+2024-10-27 12:34:28 petals                      | Oct 27 11:34:28.609 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 6
+2024-10-27 12:34:32 petals                      | Oct 27 11:34:32.714 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 7
+2024-10-27 12:34:36 petals                      | Oct 27 11:34:36.334 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 8
+2024-10-27 12:34:40 petals                      | Oct 27 11:34:40.247 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 9
+2024-10-27 12:34:43 petals                      | Oct 27 11:34:43.998 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 10
+2024-10-27 12:34:49 petals                      | Oct 27 11:34:49.203 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 11
+2024-10-27 12:34:54 petals                      | Oct 27 11:34:54.436 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 12
+2024-10-27 12:34:59 petals                      | Oct 27 11:34:59.354 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 13
+2024-10-27 12:35:04 petals                      | Oct 27 11:35:04.460 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 14
+2024-10-27 12:35:09 petals                      | Oct 27 11:35:09.608 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 15
+2024-10-27 12:35:14 petals                      | Oct 27 11:35:14.763 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 16
+2024-10-27 12:35:19 petals                      | Oct 27 11:35:19.843 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 17
+2024-10-27 12:35:24 petals                      | Oct 27 11:35:24.982 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 18
+2024-10-27 12:35:29 petals                      | Oct 27 11:35:29.732 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 19
+2024-10-27 12:35:34 petals                      | Oct 27 11:35:34.720 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 20
+2024-10-27 12:35:39 petals                      | Oct 27 11:35:39.750 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 21
+2024-10-27 12:35:44 petals                      | Oct 27 11:35:44.851 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 22
+2024-10-27 12:35:49 petals                      | Oct 27 11:35:49.791 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 23
+2024-10-27 12:36:46 petals                      | Oct 27 11:36:46.508 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 24
+2024-10-27 12:36:50 petals                      | Oct 27 11:36:50.304 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 25
+2024-10-27 12:36:54 petals                      | Oct 27 11:36:54.170 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 26
+2024-10-27 12:36:57 petals                      | Oct 27 11:36:57.812 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 27
+2024-10-27 12:37:01 petals                      | Oct 27 11:37:01.571 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 28
+2024-10-27 12:37:05 petals                      | Oct 27 11:37:05.411 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 29
+2024-10-27 12:37:09 petals                      | Oct 27 11:37:09.223 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 30
+2024-10-27 12:37:13 petals                      | Oct 27 11:37:13.155 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 31
+2024-10-27 12:37:17 petals                      | Oct 27 11:37:17.658 [INFO] Server is reachable from the Internet. It will appear at https://health.petals.dev soon
+2024-10-27 12:37:17 petals                      | Oct 27 11:37:17.862 [INFO] Started
+```
+
+2. Once finished, check Petals [health status](https://health.petals.dev/) page to see if your model is ready.
+
+![image](pictures/petals_status.png)
+
+
+### Cheshire Cat Setup
+1. login into cheshire-cat as admin, go inside settings and select `Petals` as model
+2. put the model name inside the  model name field and save (use copy button from huggingface models page)
+
+![image](pictures/petals_settings.png)
+
+3. The cat will start downloading the model's weights. This operation may take a while, just wait to see the progress bar inside the log reaching 100% before start using the cat.
+
+```
+...
+2024-10-27 12:53:48 Oct 27 11:53:48.703 [INFO] Make sure you follow the Llama terms of use: https://llama.meta.com/llama3/license, https://llama.meta.com/llama2/license
+2024-10-27 12:53:48 Oct 27 11:53:48.703 [INFO] Using DHT prefix: CodeLlama-7b-Instruct-hf
+2024-10-27 12:56:54 
+Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]
+Downloading shards:  50%|█████     | 1/2 [02:17<02:17, 137.13s/it]
+Downloading shards: 100%|██████████| 2/2 [03:05<00:00, 84.74s/it] 
+Downloading shards: 100%|██████████| 2/2 [03:05<00:00, 92.60s/it]
+2024-10-27 12:56:56 
+Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
+Loading checkpoint shards:  50%|█████     | 1/2 [00:00<00:00,  2.09it/s]
+Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  3.12it/s]
+Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  2.91it/s]
+```
+
+This will happen every time the Cat will be started.
+
 ## Ollama Setup
 
 > [!IMPORTANT]

diff --git a/cat/data/metadata.json b/cat/data/metadata.json
diff --git a/compose.petals.yml b/compose.petals.yml
@@ -0,0 +1,75 @@
+services:
+  cheshire-cat-core:
+    image: ghcr.io/cheshire-cat-ai/core:1.6.2
+    container_name: cheshire_cat_core
+    build:
+      context: ./docker
+    depends_on:
+      - cheshire-cat-vector-memory
+    environment:
+      PYTHONUNBUFFERED: "1"
+      WATCHFILES_FORCE_POLLING: "true"
+      CORE_HOST: ${CORE_HOST:-localhost}
+      CORE_PORT: ${CORE_PORT:-1865}
+      QDRANT_HOST: ${QDRANT_HOST:-cheshire_cat_vector_memory}
+      QDRANT_PORT: ${QDRANT_PORT:-6333}
+      CORE_USE_SECURE_PROTOCOLS: ${CORE_USE_SECURE_PROTOCOLS:-false}
+      API_KEY: ${API_KEY:-}
+      LOG_LEVEL: ${LOG_LEVEL:-WARNING}
+      DEBUG: ${DEBUG:-false}
+      SAVE_MEMORY_SNAPSHOTS: ${SAVE_MEMORY_SNAPSHOTS:-false}
+      HUGGINGFACE_TOKEN: ${HUGGINGFACE_TOKEN:-}
+    #we need DEBUG=false otherwise watcher will interfer with Petals and Hivemind
+    ports:
+      - "${CORE_PORT:-1865}:80"
+    # This add an entry to /etc/hosts file in the container mapping host.docker.internal to the host machine IP addr, allowing the container to access services running on the host, not only on Win and Mac but also Linux. 
+    # See https://docs.docker.com/desktop/networking/#i-want-to-connect-from-a-container-to-a-service-on-the-host and https://docs.docker.com/reference/cli/docker/container/run/#add-host
+    extra_hosts:
+      - "host.docker.internal:host-gateway"
+    volumes:
+      - ./cat/static:/app/cat/static
+      - ./cat/plugins:/app/cat/plugins
+      - ./cat/data:/app/cat/data
+    restart: unless-stopped
+
+  cheshire-cat-vector-memory:
+    image: qdrant/qdrant:v1.9.1
+    container_name: cheshire_cat_vector_memory
+    environment:
+      LOG_LEVEL: ${LOG_LEVEL:-WARNING}
+    expose:
+      - ${QDRANT_PORT:-6333}
+    volumes:
+      - ./cat/long_term_memory/vector:/qdrant/storage
+    restart: unless-stopped
+
+  petals:
+    image: learningathome/petals:main
+    container_name: petals
+    command: 
+      - /bin/bash
+      - -c
+      - |
+        set -e
+        huggingface-cli login --token $HUGGINGFACE_TOKEN --add-to-git-credential
+        python -m petals.cli.run_server \
+        --public_name ${PUBLIC_NAME:-cheshirecat_user} \
+        --port 31330 \
+        --balance_quality 0.2 \
+        --num_blocks ${PETAL_NUM_BLOCKS:-36} \
+        --max_disk_space ${PETAL_MAX_DISK_SPACE:-30GB} \
+        ${PETAL_MODEL_NAME:-codellama/CodeLlama-7b-Instruct-hf}
+    #    ${PETAL_MODEL_NAME:-bigscience/bloom-560m}
+    #    ${PETAL_MODEL_NAME:-meta-llama/Meta-Llama-3.1-405B-Instruct}
+    ipc: host
+    ports:
+      - 31330:31330
+    volumes:
+      - ./petals-cache:/cache
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: all
+              capabilities: [ gpu ]
diff --git a/docker/Dockerfile b/docker/Dockerfile
@@ -0,0 +1,31 @@
+from ghcr.io/cheshire-cat-ai/core:1.7.1
+
+### ENVIRONMENT VARIABLES ###
+ENV PYTHONUNBUFFERED=1
+ENV WATCHFILES_FORCE_POLLING=true
+
+### SYSTEM SETUP ###
+RUN apt-get -y update && apt-get install -y git && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+
+### INSTALL PYTHON DEPENDENCIES (Core) ###
+WORKDIR /app
+
+RUN pip install -U pip
+RUN pip install --no-cache-dir --upgrade -v "fastembed==0.3.6"
+RUN pip install --no-cache-dir --upgrade -v "typing-extensions>=4.9.0"
+RUN pip install --no-cache-dir --upgrade -v "qdrant_client==1.11.0"
+RUN pip install --no-cache-dir --upgrade -v "typing-extensions>=4.9.0"
+RUN pip install --no-cache-dir --upgrade -v "protobuf==4.25.5"
+RUN pip install --no-cache-dir --upgrade -v "pydantic>=2.4.2"
+RUN pip install --no-cache-dir --upgrade -v "huggingface-hub>=0.20.3"
+RUN pip install --no-cache-dir --upgrade -v "unstructured>=0.12.6"
+RUN pip install --no-cache-dir -v "petals @ git+https://github.com/bigscience-workshop/petals"
+RUN pip install --no-cache-dir -v "hivemind @ git+https://github.com/learning-at-home/hivemind.git@213bff98a62accb91f254e2afdccbf1d69ebdea9"
+
+#fix for https://github.com/tensorflow/models/issues/11192
+RUN pip install --upgrade protobuf
+
+### FINISH ###
+CMD python3 -m cat.main
diff --git a/petals-cache/.keep b/petals-cache/.keep
diff --git a/pictures/petals_settings.png b/pictures/petals_settings.png
diff --git a/pictures/petals_status.png b/pictures/petals_status.png
-Original file line number
+Diff line change
@@ Expand Up / @@ -7,3 +7,4 @@ __pycache__/ @@
     /cat/**
     /ollama/*
     .env
+    petals-cache