-
Notifications
You must be signed in to change notification settings - Fork 621
[Docs][Serve] Speed up weights loading by AMI and Docker Image #3073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 3 commits
68f8d0b
80da9c3
1f660ee
b927c37
6b14219
ae19f82
c96a9b5
74d8a5b
54b9b08
fa4276f
9f6928e
cdb6ba7
f868609
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -488,7 +488,7 @@ You can see the controller with :code:`sky status` and refresh its status by usi | |
.. _customizing-sky-serve-controller-resources: | ||
|
||
Customizing SkyServe controller resources | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
You may want to customize the resources of the SkyServe controller for several reasons: | ||
|
||
|
@@ -521,3 +521,83 @@ The :code:`resources` field has the same spec as a normal SkyPilot job; see `her | |
These settings will not take effect if you have an existing controller (either | ||
stopped or live). For them to take effect, tear down the existing controller | ||
first, which requires all services to be terminated. | ||
|
||
Speedup Weights Loading in Large Model Serving | ||
---------------------------------------------- | ||
|
||
When serving large models, the weights of the model are loaded from the cloud storage / public internet to the VMs. This process can take a long time, especially for large models. To speed up the weights loading, you can use the following methods: | ||
|
||
Use Machine image | ||
cblmemo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
~~~~~~~~~~~~~~~~~ | ||
|
||
You can use a machine image that has the weights preloaded. This can be done by creating a VM, loading the weights, and then creating a machine image from the VM. Then, you can use the machine image to launch the VMs for serving: | ||
|
||
.. code-block:: yaml | ||
:emphasize-lines: 7-8 | ||
cblmemo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
service: | ||
replicas: 2 | ||
readiness_probe: /v1/models | ||
|
||
resources: | ||
ports: 8080 | ||
accelerators: {L4:8, A10g:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8} | ||
cloud: gcp | ||
image_id: projects/my-project/global/machineImages/image-with-model-weights | ||
|
||
setup: | | ||
conda create -n vllm python=3.9 -y | ||
conda activate vllm | ||
pip install vllm | ||
|
||
run: | | ||
conda activate vllm | ||
python -m vllm.entrypoints.openai.api_server \ | ||
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ | ||
--host 0.0.0.0 --port 8080 \ | ||
--model mistralai/Mixtral-8x7B-Instruct-v0.1 | ||
|
||
Notice that the :code:`cloud` field must be specified when :code:`image_id` is used. You can use :code:`any_of` in :code:`resources` to make it support multiple clouds: | ||
cblmemo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
.. code-block:: yaml | ||
:emphasize-lines: 7-13 | ||
cblmemo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
service: | ||
replicas: 2 | ||
readiness_probe: /v1/models | ||
|
||
resources: | ||
ports: 8080 | ||
accelerators: {L4:8, A10g:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8} | ||
any_of: | ||
- cloud: gcp | ||
image_id: projects/my-project/global/machineImages/image-with-model-weights | ||
- cloud: aws | ||
image_id: | ||
us-east-1: ami-0729d913a335efca7 | ||
us-west-2: ami-050814f384259894c | ||
|
||
# Here goes the setup and run commands... | ||
cblmemo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Notice that AWS needs a per-region image since its image is only valid in a single region. This requires to configure a lot of images on the cloud console, but it can significantly speed up the weights loading. | ||
|
||
Use Docker Container as Runtime Environment | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
cblmemo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
You can also :ref:`use docker containers as runtime environment <docker-containers-as-runtime-environments>`. This can be done by creating a Docker container that has the weights preloaded, and then using the container to launch the VMs for serving: | ||
|
||
.. code-block:: yaml | ||
:emphasize-lines: 7 | ||
cblmemo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
service: | ||
replicas: 2 | ||
readiness_probe: /v1/models | ||
|
||
resources: | ||
ports: 8080 | ||
accelerators: {L4:8, A10g:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8} | ||
image_id: docker:docker-image-with-model-weights | ||
|
||
cblmemo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# Here goes the setup and run commands... | ||
cblmemo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could mention something about how this docker image should be built, especially, it could have SkyPilot runtime pre-built. Something like the following would be useful (could you help giving an concrete example for how to install vllm, and download the image in the Dockerfile below for a better reference, i.e. replacing the line
|
||
This is easier to configure than machine images, but it may have a longer startup time than machine images since it needs to pull the docker image from the registry. | ||
cblmemo marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have we actually timed these two methods? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm feeling like it is hard to make a fair comparison - it is largely dependent on the base docker/machine image used... Though I'll try to make some benchmarks and see the results 🫡 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the benefit of using machine image or docker image is more for reducing the setup time, instead of the the model weight downloading time, as they are mostly for packaging the dependencies and not neccessarily means it will speed up the download of the image or the model, which should be limited by the network bandwidth instead.
Should we rewrite the section as reducing the overhead of environment setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Let me rephrase that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there are some speedups if using a machine image? That should be optimized by the cloud provider which makes it faster than a plain network download?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Machine images are intended to be used in a single region. It can be used in another region for launching a VM, but there will involve data transfer from one region to another. Cloud provider should have optimized it, but we probably want to be careful about the wording to avoid having a impression that all the benefits comes from weight loading.