Skip to content

[Docs][Serve] Speed up weights loading by AMI and Docker Image #3073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
82 changes: 81 additions & 1 deletion docs/source/serving/sky-serve.rst
Original file line number Diff line number Diff line change
Expand Up @@ -488,7 +488,7 @@ You can see the controller with :code:`sky status` and refresh its status by usi
.. _customizing-sky-serve-controller-resources:

Customizing SkyServe controller resources
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You may want to customize the resources of the SkyServe controller for several reasons:

Expand Down Expand Up @@ -521,3 +521,83 @@ The :code:`resources` field has the same spec as a normal SkyPilot job; see `her
These settings will not take effect if you have an existing controller (either
stopped or live). For them to take effect, tear down the existing controller
first, which requires all services to be terminated.

Speedup Weights Loading in Large Model Serving
----------------------------------------------

When serving large models, the weights of the model are loaded from the cloud storage / public internet to the VMs. This process can take a long time, especially for large models. To speed up the weights loading, you can use the following methods:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the benefit of using machine image or docker image is more for reducing the setup time, instead of the the model weight downloading time, as they are mostly for packaging the dependencies and not neccessarily means it will speed up the download of the image or the model, which should be limited by the network bandwidth instead.

Should we rewrite the section as reducing the overhead of environment setup?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Let me rephrase that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there are some speedups if using a machine image? That should be optimized by the cloud provider which makes it faster than a plain network download?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Machine images are intended to be used in a single region. It can be used in another region for launching a VM, but there will involve data transfer from one region to another. Cloud provider should have optimized it, but we probably want to be careful about the wording to avoid having a impression that all the benefits comes from weight loading.


Use Machine image
~~~~~~~~~~~~~~~~~

You can use a machine image that has the weights preloaded. This can be done by creating a VM, loading the weights, and then creating a machine image from the VM. Then, you can use the machine image to launch the VMs for serving:

.. code-block:: yaml
:emphasize-lines: 7-8

service:
replicas: 2
readiness_probe: /v1/models

resources:
ports: 8080
accelerators: {L4:8, A10g:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
cloud: gcp
image_id: projects/my-project/global/machineImages/image-with-model-weights

setup: |
conda create -n vllm python=3.9 -y
conda activate vllm
pip install vllm

run: |
conda activate vllm
python -m vllm.entrypoints.openai.api_server \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--host 0.0.0.0 --port 8080 \
--model mistralai/Mixtral-8x7B-Instruct-v0.1

Notice that the :code:`cloud` field must be specified when :code:`image_id` is used. You can use :code:`any_of` in :code:`resources` to make it support multiple clouds:

.. code-block:: yaml
:emphasize-lines: 7-13

service:
replicas: 2
readiness_probe: /v1/models

resources:
ports: 8080
accelerators: {L4:8, A10g:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
any_of:
- cloud: gcp
image_id: projects/my-project/global/machineImages/image-with-model-weights
- cloud: aws
image_id:
us-east-1: ami-0729d913a335efca7
us-west-2: ami-050814f384259894c

# Here goes the setup and run commands...

Notice that AWS needs a per-region image since its image is only valid in a single region. This requires to configure a lot of images on the cloud console, but it can significantly speed up the weights loading.

Use Docker Container as Runtime Environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can also :ref:`use docker containers as runtime environment <docker-containers-as-runtime-environments>`. This can be done by creating a Docker container that has the weights preloaded, and then using the container to launch the VMs for serving:

.. code-block:: yaml
:emphasize-lines: 7

service:
replicas: 2
readiness_probe: /v1/models

resources:
ports: 8080
accelerators: {L4:8, A10g:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
image_id: docker:docker-image-with-model-weights

# Here goes the setup and run commands...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could mention something about how this docker image should be built, especially, it could have SkyPilot runtime pre-built. Something like the following would be useful (could you help giving an concrete example for how to install vllm, and download the image in the Dockerfile below for a better reference, i.e. replacing the line # Your dependencies installation and model download code goes here with actual workable commands for serving vllm+mistral):

Your docker image can have all skypilot dependencies pre-installed to further reduce the setup time, you could try building your docker image based from our base image. The `Dockerfile` could look like the following:
```Dockerfile
FROM docker:berkeleyskypilot/skypilot-k8s-gpu:latest

# Your dependencies installation and model download code goes here
```

This is easier to configure than machine images, but it may have a longer startup time than machine images since it needs to pull the docker image from the registry.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we actually timed these two methods?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm feeling like it is hard to make a fair comparison - it is largely dependent on the base docker/machine image used... Though I'll try to make some benchmarks and see the results 🫡