Skip to content

本仓库提供了一个基于强大的 Llama 3.2 模型的私有化实现的检索增强生成(RAG)API,使用 vLLM 进行优化部署以提升性能和可扩展性。该 API 结合了先进的语言建模和高效的检索技术,提供了强大的问答和文档生成能力。

Notifications You must be signed in to change notification settings

learnagi/vllm-rag-api

Repository files navigation

RAG API with LLaMA 3.2

This project deploys a private Retrieval-Augmented Generation (RAG) API using LLaMA 3.2 and vLLM.

Features

✅ Serverless (scale to zero) ✅ Private API ✅ Your own infrastructure ✅ Multi-GPU support

Installation

  1. Clone this repository:

    git clone <your-repo-url>
    cd <your-repo-directory>
    
  2. Install required packages:

    pip install -r requirements.txt
    
  3. Ensure these modules are in your project directory:

    • ingestion.py
    • retriever.py
    • prompt_template.py

LLaMA Model Setup

  1. Download LLaMA model weights from [appropriate source].
  2. Place weights in [appropriate directory].
  3. Update model_name in rag.py if necessary.

Usage

  1. Add documents to chat with in the ./docs folder.

  2. Start the server:

    python server.py
    
  3. Use the API:

    python client.py --query "Your question here"
    

Deployment

  • Expose the server to the internet (authentication optional)
  • Enable "auto start" for serverless operation
  • Optimize performance with LitServe features (batching, multi-GPU, etc.)

Background

This project utilizes:

  • RAG (Retrieval-Augmented Generation)
  • vLLM for efficient LLM serving
  • Vector database (self-hosted Qdrant)
  • LitServe for scalable inference

For more details on these components, refer to the full documentation.

About

本仓库提供了一个基于强大的 Llama 3.2 模型的私有化实现的检索增强生成(RAG)API,使用 vLLM 进行优化部署以提升性能和可扩展性。该 API 结合了先进的语言建模和高效的检索技术,提供了强大的问答和文档生成能力。

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages