You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems trickier than I initially expected as I am on a CPU-only machine. This is my analysis thus far to integrate vLLM:
For CPU-only machines, we need to build vllm from source (probs add that within the Allycat Docker) as there is no prebuilt CPU-only binaries and then run the vllm inference server within the Docker based on the MY_CONFIG. For GPU machines, we would have the same Docker build file except we wouldn't compile from source (since there is already prebuilt vllm wheels for GPU) and just slightly modify the Docker run command so vLLM has access to the GPUs.
So overall I think it would be easier to implement vllm support only within a Docker (for sake of setup simplicity, especially for CPU-only machines).
In terms of the Allycat source code, it seems easy to just add the LiteLLM + vllm code in the three files (4_query.py, app_chainlit,py, and app_flash.py) as LiteLLM is basically a proxy. Then, we just need to add and update documentation for customizing Allycat and using vLLM.
What are your thoughts and do you think it would be worthwhile to integrate vLLM for the project?
If most users are running Allycat on CPU-only machines, it seems like it would take significant additional overhead for the vLLM src to build (I read it can take over 20 mins) unless we make it optional somehow.
Besides, I think Ollama is very good for llm inference especially for local CPU machines.
Alternatively, I can just integrate vLLM support for GPUs only and do the development/testing work on Kaggle. Supporting GPUs only via would be much easier to implement
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi folks,
I wanted to get some perspective on how to best add vLLM integration into Allycat
It seems trickier than I initially expected as I am on a CPU-only machine. This is my analysis thus far to integrate vLLM:
For CPU-only machines, we need to build vllm from source (probs add that within the Allycat Docker) as there is no prebuilt CPU-only binaries and then run the vllm inference server within the Docker based on the MY_CONFIG. For GPU machines, we would have the same Docker build file except we wouldn't compile from source (since there is already prebuilt vllm wheels for GPU) and just slightly modify the Docker run command so vLLM has access to the GPUs.
So overall I think it would be easier to implement vllm support only within a Docker (for sake of setup simplicity, especially for CPU-only machines).
In terms of the Allycat source code, it seems easy to just add the LiteLLM + vllm code in the three files (4_query.py, app_chainlit,py, and app_flash.py) as LiteLLM is basically a proxy. Then, we just need to add and update documentation for customizing Allycat and using vLLM.
What are your thoughts and do you think it would be worthwhile to integrate vLLM for the project?
If most users are running Allycat on CPU-only machines, it seems like it would take significant additional overhead for the vLLM src to build (I read it can take over 20 mins) unless we make it optional somehow.
Besides, I think Ollama is very good for llm inference especially for local CPU machines.
Alternatively, I can just integrate vLLM support for GPUs only and do the development/testing work on Kaggle. Supporting GPUs only via would be much easier to implement
Beta Was this translation helpful? Give feedback.
All reactions