Adding vLLM #44

nat3058 · 2025-08-05T23:41:57Z

nat3058
Aug 5, 2025

Hi folks,

I wanted to get some perspective on how to best add vLLM integration into Allycat

It seems trickier than I initially expected as I am on a CPU-only machine. This is my analysis thus far to integrate vLLM:

For CPU-only machines, we need to build vllm from source (probs add that within the Allycat Docker) as there is no prebuilt CPU-only binaries and then run the vllm inference server within the Docker based on the MY_CONFIG. For GPU machines, we would have the same Docker build file except we wouldn't compile from source (since there is already prebuilt vllm wheels for GPU) and just slightly modify the Docker run command so vLLM has access to the GPUs.

So overall I think it would be easier to implement vllm support only within a Docker (for sake of setup simplicity, especially for CPU-only machines).

In terms of the Allycat source code, it seems easy to just add the LiteLLM + vllm code in the three files (4_query.py, app_chainlit,py, and app_flash.py) as LiteLLM is basically a proxy. Then, we just need to add and update documentation for customizing Allycat and using vLLM.

What are your thoughts and do you think it would be worthwhile to integrate vLLM for the project?

If most users are running Allycat on CPU-only machines, it seems like it would take significant additional overhead for the vLLM src to build (I read it can take over 20 mins) unless we make it optional somehow.
Besides, I think Ollama is very good for llm inference especially for local CPU machines.

Alternatively, I can just integrate vLLM support for GPUs only and do the development/testing work on Kaggle. Supporting GPUs only via would be much easier to implement

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding vLLM #44

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Adding vLLM #44

Uh oh!

nat3058 Aug 5, 2025

Replies: 0 comments

nat3058
Aug 5, 2025