This project deploys a private Retrieval-Augmented Generation (RAG) API using LLaMA 3.2 and vLLM.
✅ Serverless (scale to zero) ✅ Private API ✅ Your own infrastructure ✅ Multi-GPU support
-
Clone this repository:
git clone <your-repo-url> cd <your-repo-directory>
-
Install required packages:
pip install -r requirements.txt
-
Ensure these modules are in your project directory:
- ingestion.py
- retriever.py
- prompt_template.py
- Download LLaMA model weights from [appropriate source].
- Place weights in [appropriate directory].
- Update
model_name
inrag.py
if necessary.
-
Add documents to chat with in the
./docs
folder. -
Start the server:
python server.py
-
Use the API:
python client.py --query "Your question here"
- Expose the server to the internet (authentication optional)
- Enable "auto start" for serverless operation
- Optimize performance with LitServe features (batching, multi-GPU, etc.)
This project utilizes:
- RAG (Retrieval-Augmented Generation)
- vLLM for efficient LLM serving
- Vector database (self-hosted Qdrant)
- LitServe for scalable inference
For more details on these components, refer to the full documentation.