Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,15 @@ OLLAMA_DEBUG=false # Debug mode for Ollama service
OLLAMA_KEEP_ALIVE="5m" # Duration models stay loaded, default 5 minutes, can be set to e.g., "24h"
OLLAMA_MAX_LOADED_MODELS=1 # Maximum number of models loaded simultaneously, default to 1
OLLAMA_NUM_PARALLEL=1 # Maximum number of allocated contexts (parallel requests). Manage resource efficiently: If OLLAMA_NUM_PARALLEL=4 and OLLAMA_MAX_LOADED_MODELS=3, the total context requirement might be up to 12 (4x3)

# Petals-specific settings

# how to get your huggingface token: https://huggingface.co/settings/tokens
HUGGINGFACE_TOKEN=your-huggingface-token-here
# if you host 10+ blocks you can show your name in the swarm monitor page: https://health.petals.dev/
PUBLIC_NAME=put_your_name_here
# change this if you want to use a different model check huggingface for available models
#PETAL_MODEL_NAME=bigscience/bloom-560m
PETAL_MODEL_NAME=codellama/CodeLlama-7b-Instruct-hf
# limit the amount of space used with this
PETAL_MAX_DISK_SPACE=10GB
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ __pycache__/
/cat/**
/ollama/*
.env
petals-cache
116 changes: 116 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,122 @@
> - **Technical Expertise Required:** Setting up and running local-cat requires some technical know-how.
> - **Hardware Requirements:** Performance may be slow without a recent GPU or NPU.

## What is Petals?

Run large language models at home, BitTorrent‑style, see [homepage](https://petals.dev/) and find more on [github](https://github.com/bigscience-workshop/petals) page.

## Prerequisites

1. Huggingface account, sign up [here](https://huggingface.co/join)

## What you need to do

1. check Petals' [health status](https://health.petals.dev/) page and choose your preferred model
2. Request access to huggingface weights for the model that you want to use, find more [here](https://huggingface.co/docs/hub/models-gated#gated-models)
3. Generate your huggingface [token](https://huggingface.co/settings/tokens)
(usually they grants access in few minutes). When selecting token permission check `Read access to contents of all public gated repos you can access` option under `Repositories` group.
4. Once your token is created, create an `.env` file by duplicating `.env.example` and fill `HUGGINGFACE_TOKEN` it with your token.

### Setup Instructions

Ollama container is removed because we will use Petals as replacement.
Currently, you have to build the cheshire-cat container locally.
The Petals container is used to share some of the gpu resources with the other users and also to run the chosen model.

To run local-cat with Petals, follow these steps:

> [!IMPORTANT]
> Don't enable DEBUG mode in .env file otherwise this will interfere with Petals

1. fill .env file with the desired settings
2. run `docker compose -f compose.petals.yml up -d`

### Petals container setup
1. The Petals container should start. Inside log you will see that is loading model blocks.

```
...
2024-10-27 12:31:59 petals | Login successful
2024-10-27 12:32:00 petals | /home/petals/src/petals/server/block_functions.py:165: SyntaxWarning: assertion is always true, perhaps remove parentheses?
2024-10-27 12:32:00 petals | assert (
2024-10-27 12:32:00 petals | Oct 27 11:32:00.715 [INFO] Running Petals 2.3.0.dev2
2024-10-27 12:32:01 petals | Oct 27 11:32:01.116 [INFO] Make sure you follow the Llama terms of use: https://llama.meta.com/llama3/license, https://llama.meta.com/llama2/license
2024-10-27 12:32:01 petals | Oct 27 11:32:01.116 [INFO] Using DHT prefix: CodeLlama-7b-Instruct-hf
2024-10-27 12:32:12 petals | Oct 27 11:32:12.757 [INFO] This server is accessible via relays
2024-10-27 12:32:14 petals | Oct 27 11:32:14.059 [INFO] Connecting to the public swarm
2024-10-27 12:32:14 petals | Oct 27 11:32:14.059 [INFO] Running a server on ['/ip4/127.0.0.1/tcp/31330/p2p/12D3KooWEbeDea5LSiaJEBuWLYh85h8kPgTXM8UNYo7bUCRpvP7X', '/ip4/172.18.0.2/tcp/31330/p2p/12D3KooWEbeDea5LSiaJEBuWLYh85h8kPgTXM8UNYo7bUCRpvP7X', '/ip6/::1/tcp/31330/p2p/12D3KooWEbeDea5LSiaJEBuWLYh85h8kPgTXM8UNYo7bUCRpvP7X']
2024-10-27 12:32:14 petals | Oct 27 11:32:14.219 [INFO] Model weights are loaded in bfloat16, quantized to nf4 format
2024-10-27 12:32:14 petals | Oct 27 11:32:14.219 [INFO] Attention cache for all blocks will consume up to 2.00 GiB
2024-10-27 12:32:14 petals | Oct 27 11:32:14.220 [INFO] Loading throughput info
2024-10-27 12:32:14 petals | Oct 27 11:32:14.239 [INFO] Reporting throughput: 467.9 tokens/sec for 32 blocks
2024-10-27 12:32:18 petals | Oct 27 11:32:18.175 [INFO] Announced that blocks [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] are joining
2024-10-27 12:34:02 petals | Oct 27 11:34:02.638 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 0
2024-10-27 12:34:08 petals | Oct 27 11:34:08.157 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 1
2024-10-27 12:34:13 petals | Oct 27 11:34:13.249 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 2
2024-10-27 12:34:17 petals | Oct 27 11:34:17.088 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 3
2024-10-27 12:34:21 petals | Oct 27 11:34:21.051 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 4
2024-10-27 12:34:24 petals | Oct 27 11:34:24.946 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 5
2024-10-27 12:34:28 petals | Oct 27 11:34:28.609 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 6
2024-10-27 12:34:32 petals | Oct 27 11:34:32.714 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 7
2024-10-27 12:34:36 petals | Oct 27 11:34:36.334 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 8
2024-10-27 12:34:40 petals | Oct 27 11:34:40.247 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 9
2024-10-27 12:34:43 petals | Oct 27 11:34:43.998 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 10
2024-10-27 12:34:49 petals | Oct 27 11:34:49.203 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 11
2024-10-27 12:34:54 petals | Oct 27 11:34:54.436 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 12
2024-10-27 12:34:59 petals | Oct 27 11:34:59.354 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 13
2024-10-27 12:35:04 petals | Oct 27 11:35:04.460 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 14
2024-10-27 12:35:09 petals | Oct 27 11:35:09.608 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 15
2024-10-27 12:35:14 petals | Oct 27 11:35:14.763 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 16
2024-10-27 12:35:19 petals | Oct 27 11:35:19.843 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 17
2024-10-27 12:35:24 petals | Oct 27 11:35:24.982 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 18
2024-10-27 12:35:29 petals | Oct 27 11:35:29.732 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 19
2024-10-27 12:35:34 petals | Oct 27 11:35:34.720 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 20
2024-10-27 12:35:39 petals | Oct 27 11:35:39.750 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 21
2024-10-27 12:35:44 petals | Oct 27 11:35:44.851 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 22
2024-10-27 12:35:49 petals | Oct 27 11:35:49.791 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 23
2024-10-27 12:36:46 petals | Oct 27 11:36:46.508 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 24
2024-10-27 12:36:50 petals | Oct 27 11:36:50.304 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 25
2024-10-27 12:36:54 petals | Oct 27 11:36:54.170 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 26
2024-10-27 12:36:57 petals | Oct 27 11:36:57.812 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 27
2024-10-27 12:37:01 petals | Oct 27 11:37:01.571 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 28
2024-10-27 12:37:05 petals | Oct 27 11:37:05.411 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 29
2024-10-27 12:37:09 petals | Oct 27 11:37:09.223 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 30
2024-10-27 12:37:13 petals | Oct 27 11:37:13.155 [INFO] Loaded codellama/CodeLlama-7b-Instruct-hf block 31
2024-10-27 12:37:17 petals | Oct 27 11:37:17.658 [INFO] Server is reachable from the Internet. It will appear at https://health.petals.dev soon
2024-10-27 12:37:17 petals | Oct 27 11:37:17.862 [INFO] Started
```

2. Once finished, check Petals [health status](https://health.petals.dev/) page to see if your model is ready.

![image](pictures/petals_status.png)


### Cheshire Cat Setup
1. login into cheshire-cat as admin, go inside settings and select `Petals` as model
2. put the model name inside the model name field and save (use copy button from huggingface models page)

![image](pictures/petals_settings.png)

3. The cat will start downloading the model's weights. This operation may take a while, just wait to see the progress bar inside the log reaching 100% before start using the cat.

```
...
2024-10-27 12:53:48 Oct 27 11:53:48.703 [INFO] Make sure you follow the Llama terms of use: https://llama.meta.com/llama3/license, https://llama.meta.com/llama2/license
2024-10-27 12:53:48 Oct 27 11:53:48.703 [INFO] Using DHT prefix: CodeLlama-7b-Instruct-hf
2024-10-27 12:56:54
Downloading shards: 0%| | 0/2 [00:00<?, ?it/s]
Downloading shards: 50%|█████ | 1/2 [02:17<02:17, 137.13s/it]
Downloading shards: 100%|██████████| 2/2 [03:05<00:00, 84.74s/it]
Downloading shards: 100%|██████████| 2/2 [03:05<00:00, 92.60s/it]
2024-10-27 12:56:56
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:00<00:00, 2.09it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 3.12it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 2.91it/s]
```

This will happen every time the Cat will be started.

## Ollama Setup

> [!IMPORTANT]
Expand Down
48 changes: 0 additions & 48 deletions cat/data/metadata.json

This file was deleted.

75 changes: 75 additions & 0 deletions compose.petals.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
services:
cheshire-cat-core:
image: ghcr.io/cheshire-cat-ai/core:1.6.2
container_name: cheshire_cat_core
build:
context: ./docker
depends_on:
- cheshire-cat-vector-memory
environment:
PYTHONUNBUFFERED: "1"
WATCHFILES_FORCE_POLLING: "true"
CORE_HOST: ${CORE_HOST:-localhost}
CORE_PORT: ${CORE_PORT:-1865}
QDRANT_HOST: ${QDRANT_HOST:-cheshire_cat_vector_memory}
QDRANT_PORT: ${QDRANT_PORT:-6333}
CORE_USE_SECURE_PROTOCOLS: ${CORE_USE_SECURE_PROTOCOLS:-false}
API_KEY: ${API_KEY:-}
LOG_LEVEL: ${LOG_LEVEL:-WARNING}
DEBUG: ${DEBUG:-false}
SAVE_MEMORY_SNAPSHOTS: ${SAVE_MEMORY_SNAPSHOTS:-false}
HUGGINGFACE_TOKEN: ${HUGGINGFACE_TOKEN:-}
#we need DEBUG=false otherwise watcher will interfer with Petals and Hivemind
ports:
- "${CORE_PORT:-1865}:80"
# This add an entry to /etc/hosts file in the container mapping host.docker.internal to the host machine IP addr, allowing the container to access services running on the host, not only on Win and Mac but also Linux.
# See https://docs.docker.com/desktop/networking/#i-want-to-connect-from-a-container-to-a-service-on-the-host and https://docs.docker.com/reference/cli/docker/container/run/#add-host
extra_hosts:
- "host.docker.internal:host-gateway"
volumes:
- ./cat/static:/app/cat/static
- ./cat/plugins:/app/cat/plugins
- ./cat/data:/app/cat/data
restart: unless-stopped

cheshire-cat-vector-memory:
image: qdrant/qdrant:v1.9.1
container_name: cheshire_cat_vector_memory
environment:
LOG_LEVEL: ${LOG_LEVEL:-WARNING}
expose:
- ${QDRANT_PORT:-6333}
volumes:
- ./cat/long_term_memory/vector:/qdrant/storage
restart: unless-stopped

petals:
image: learningathome/petals:main
container_name: petals
command:
- /bin/bash
- -c
- |
set -e
huggingface-cli login --token $HUGGINGFACE_TOKEN --add-to-git-credential
python -m petals.cli.run_server \
--public_name ${PUBLIC_NAME:-cheshirecat_user} \
--port 31330 \
--balance_quality 0.2 \
--num_blocks ${PETAL_NUM_BLOCKS:-36} \
--max_disk_space ${PETAL_MAX_DISK_SPACE:-30GB} \
${PETAL_MODEL_NAME:-codellama/CodeLlama-7b-Instruct-hf}
# ${PETAL_MODEL_NAME:-bigscience/bloom-560m}
# ${PETAL_MODEL_NAME:-meta-llama/Meta-Llama-3.1-405B-Instruct}
ipc: host
ports:
- 31330:31330
volumes:
- ./petals-cache:/cache
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [ gpu ]
31 changes: 31 additions & 0 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
from ghcr.io/cheshire-cat-ai/core:1.7.1

### ENVIRONMENT VARIABLES ###
ENV PYTHONUNBUFFERED=1
ENV WATCHFILES_FORCE_POLLING=true

### SYSTEM SETUP ###
RUN apt-get -y update && apt-get install -y git && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

### INSTALL PYTHON DEPENDENCIES (Core) ###
WORKDIR /app

RUN pip install -U pip
RUN pip install --no-cache-dir --upgrade -v "fastembed==0.3.6"
RUN pip install --no-cache-dir --upgrade -v "typing-extensions>=4.9.0"
RUN pip install --no-cache-dir --upgrade -v "qdrant_client==1.11.0"
RUN pip install --no-cache-dir --upgrade -v "typing-extensions>=4.9.0"
RUN pip install --no-cache-dir --upgrade -v "protobuf==4.25.5"
RUN pip install --no-cache-dir --upgrade -v "pydantic>=2.4.2"
RUN pip install --no-cache-dir --upgrade -v "huggingface-hub>=0.20.3"
RUN pip install --no-cache-dir --upgrade -v "unstructured>=0.12.6"
RUN pip install --no-cache-dir -v "petals @ git+https://github.com/bigscience-workshop/petals"
RUN pip install --no-cache-dir -v "hivemind @ git+https://github.com/learning-at-home/hivemind.git@213bff98a62accb91f254e2afdccbf1d69ebdea9"

#fix for https://github.com/tensorflow/models/issues/11192
RUN pip install --upgrade protobuf

### FINISH ###
CMD python3 -m cat.main
Empty file added petals-cache/.keep
Empty file.
Binary file added pictures/petals_settings.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added pictures/petals_status.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.