Local RAG Server

A fully local retrieval-augmented generation (RAG) system for querying PDF and DOCX documents using a local LLM. Designed for multi-user deployment on corporate networks — no cloud, no API keys.

PDF / DOCX files
      |
      v
 Text extraction & cleaning
      |
      v
 Chunking (overlapping, paragraph-aware)
      |
      v
 all-MiniLM-L6-v2 embeddings (384-dim, CPU friendly)
      |
      v
 Qdrant (local vector store)
      |
      v
 Hybrid retrieval (Dense + BM25 + optional cross-encoder reranker)
      |
      v
 FastAPI + llama.cpp  →  Browser UI

graph TD
    %% Node Definitions
    User((User))
    UI[Web Interface]
    API[FastAPI Backend]

    subgraph Storage [Data Storage]
        Raw[data_raw folder]
        Qdrant[(Qdrant Vector DB)]
    end

    subgraph Logic [Processing Engine]
        Parse[Document Parser]
        Check{GPU Enabled?}
        LLM[llama.cpp Server]
    end

    %% Flow
    User -->|1. Place Docs| Raw
    User -->|2. Configure| Env[.env]
    User -->|3. Chat| UI

    UI <-->|HTTP| API
    API -->|Scan| Raw
    Raw --> Parse -->|Embed| Qdrant

    API -->|Hybrid Retrieve| Qdrant
    Qdrant -->|Context| LLM

    Env -->|Read| Check
    Check -->|Yes| GPU[NVIDIA CUDA]
    Check -->|No| CPU[System CPU]

    GPU --> LLM
    CPU --> LLM
    LLM -->|Response| UI

    %% Styling
    style User fill:#f9f,stroke:#333,stroke-width:2px
    style Qdrant fill:#00d2ff,stroke:#333,stroke-width:2px
    style GPU fill:#76b900,stroke:#333,stroke-width:2px,color:#fff
    style UI fill:#fff,stroke:#333,stroke-dasharray: 5 5

Web UI

Once running, the web UI is accessible at http://localhost:8000 from the machine running the stack.

Accessing from other devices on your network: If you're running Local RAG on a server or another machine, access it via the host's IP address:

http://192.168.x.x:8000

Exposing to the internet: Use a reverse proxy like Caddy or nginx in front of port 8000, or use a tunnel like Cloudflare Tunnel for zero-config public access.

Warning: There is no authentication built in — if exposing publicly, put it behind a password-protected reverse proxy.

Screenshot

Platform Support

Platform	Docker	Native
Linux	`./run.sh --docker`	`./run.sh`
macOS	`./run.sh --docker`	`./run.sh`
Windows	`start.bat`	WSL2 only

Windows users: Just install Docker Desktop and double-click start.bat — no WSL2, Python, or manual setup needed.

Linux/macOS native: Requires Docker only for Qdrant. Everything else runs directly on your machine via ./run.sh.

GPU Acceleration

Hardware	Expected Speed
CPU only (4-8B model)	3–15 tok/s
NVIDIA GPU (4070 Ti)	100–150 tok/s
NVIDIA GPU (3090/4090)	150–200 tok/s

GPU setup instructions are in the Enabling GPU section below.

Requirements

Python 3.10+
Docker
Git, CMake, and a C++ compiler (native workflow only)

On Ubuntu:

sudo apt install build-essential cmake git docker.io python3-venv python3-full
sudo usermod -aG docker $USER && newgrp docker

Quick Start

1. Clone the repo

git clone https://github.com/burnoutmonk/local-rag-server.git
cd local-rag-server

2. Add your documents

Place PDF and DOCX files in the data_raw/ folder.

3. Configure

cp .env.example .env

Open .env and set at minimum:

LLM_MODEL_FILE — filename of your GGUF model (downloaded automatically)
LLM_MODEL_REPO — HuggingFace repo to download from
LLM_THREADS — number of CPU cores to use
LLM_GPU_LAYERS — set to -1 to use GPU, 0 for CPU only

4. Start everything

Option A — Windows (Docker Desktop)

Install Docker Desktop then double-click start.bat or run:

start.bat

This starts all services in the background, waits until everything is ready, and opens your browser automatically at http://localhost:8000.

To stop:

docker compose down

To enable GPU:

# in .env
CUDA_AVAILABLE=true
LLM_GPU_LAYERS=-1

Then re-run start.bat.

Option B — Linux / macOS (Docker)

cp .env.example .env    # adjust if needed
./run.sh --docker

Opens your browser automatically when ready. Same GPU settings as above via .env.

To stop:

docker compose down

Option C — Linux / macOS (Native)

Runs everything directly on your machine without Docker (except Qdrant). Easier to debug and faster iteration during development.

chmod +x run.sh
./run.sh

This automatically:

Creates a Python virtual environment and installs dependencies
Builds llama.cpp from source (first run only — takes 10–20 minutes)
Downloads the model from HuggingFace
Starts Qdrant via Docker
Ingests documents from data_raw/
Measures LLM speed and updates config.py
Starts the web UI at http://localhost:8000

Press Ctrl+C to stop all services.

Useful flags:

./run.sh --skip-ingest   # skip ingestion if documents haven't changed
./run.sh --skip-llm      # skip LLM server if already running

Startup Optimizations

The Docker stack is optimized for fast restarts when nothing has changed:

Service	First run	Subsequent restarts
`model_downloader`	Downloads GGUF from HuggingFace	Shell `test -f` — exits in <1s
`ingest`	Parses, chunks, and embeds all documents	Hash check only — skips embedding model load entirely
`benchmark`	Waits for LLM, measures tok/s	Sees `.benchmarked_cpu` or `.benchmarked_gpu` marker — exits in <1s

The benchmark uses separate marker files for CPU and GPU modes so switching LLM_GPU_LAYERS always triggers a re-measurement on the next startup.

To force a fresh benchmark, delete the appropriate marker file from the project root:

# GPU mode
rm .benchmarked_gpu
# CPU mode
rm .benchmarked_cpu

Configuration

All settings live in config.py (native) and .env (Docker). They share the same values — environment variables in .env override the defaults in config.py.

LLM & Model

Setting	Default	Description
`LLM_MODEL_FILE`	`Llama-3.2-3B-Instruct-Q4_K_M.gguf`	GGUF model filename
`LLM_MODEL_REPO`	`bartowski/Llama-3.2-3B-Instruct-GGUF`	HuggingFace repo
`LLM_THREADS`	`8`	CPU threads for inference
`LLM_CONTEXT`	`4096`	Context window size (GPU: use 32768+)
`LLM_GPU_LAYERS`	`0`	GPU layers (`-1` = all, `0` = CPU only)
`LLM_TEMPERATURE`	`0.7`	Sampling temperature
`LLM_TOP_P`	`0.8`	Nucleus sampling threshold
`LLM_TOP_K`	`20`	Top-K sampling
`LLM_MIN_P`	`0.0`	Min-P sampling
`MAX_TOKENS`	`500`	Max output tokens
`MIN_TOKENS`	`150`	Min output tokens (timeout estimation floor)
`TOKENS_PER_SECOND`	`10.0`	Measured hardware speed (set by benchmark)
`CUDA_AVAILABLE`	`false`	Build llama.cpp with CUDA support

Chunking

Setting	Default	Description
`MAX_CHARS`	`1000`	Max chars per chunk
`OVERLAP_CHARS`	`100`	Chunk overlap in chars
`EMBED_MODEL_NAME`	`sentence-transformers/all-MiniLM-L6-v2`	Embedding model (changing requires full re-ingest)

Chat Memory

Setting	Default	Description
`CHAT_MEMORY_TURNS`	`3`	Previous Q&A exchanges to include per session (0 to disable)

Hybrid Retrieval

Setting	Default	Description
`BM25_WEIGHT`	`0.5`	BM25 fusion weight (0 = dense only, 1 = BM25 only)
`RETRIEVAL_MULTIPLIER`	`4`	Candidates fetched = `top_k × multiplier` before reranking
`RERANKER_MODEL`	`cross-encoder/ms-marco-MiniLM-L-6-v2`	Cross-encoder model for reranking
`RERANKER_ENABLED`	`true`	Allow users to enable reranking (set to `false` to disable globally)

Note: LLM sampling parameters (temperature, top_p, top_k, min_p) and max tokens can be adjusted per-query from the web UI sidebar. The .env values are the defaults.

Hybrid Search

Retrieval uses a two-stage pipeline with user-selectable modes:

Dense — Qdrant cosine similarity search only (fastest).

Hybrid (default) — Dense search retrieves top_k × RETRIEVAL_MULTIPLIER candidates, then BM25 scores are computed on those candidates and fused with dense scores using BM25_WEIGHT.

Hybrid + Rerank — Adds a cross-encoder pass after Hybrid. The cross-encoder scores all candidates as (query, text) pairs and re-orders them. Adds 1–3 seconds of latency but improves precision for complex queries.

Users select the search mode via the three buttons in the UI sidebar. The rerank button is greyed out if RERANKER_ENABLED=false on the server.

Chat Memory

Each browser tab gets an independent session (via sessionStorage). The server keeps the last CHAT_MEMORY_TURNS exchanges (default 3) in memory per session, prepending them to each LLM prompt for conversational context.

Sessions expire automatically after 2 hours of inactivity. Clicking "Clear chat" in the UI starts a new session and discards prior history.

Enabling GPU

Using a GPU massively improves generation speed — expect 100–150 tok/s on a 4070 Ti vs 5–15 tok/s on CPU.

Which CUDA version? Install CUDA 12 — it is stable, widely supported, and what llama.cpp officially targets.

Windows (Docker — recommended)

Tested and working with a 4070 Ti on Windows 11 + Docker Desktop.

Make sure you have an up to date NVIDIA driver — download from nvidia.com/drivers
Verify: open CMD and run nvidia-smi
Set in .env:

CUDA_AVAILABLE=true
LLM_GPU_LAYERS=-1

Delete .benchmarked_gpu if it exists (to force re-measurement)
Run start.bat — Docker pulls the CUDA 12 toolkit automatically inside the container, no local CUDA install needed

Linux / WSL2 (Native)

Install CUDA 12 toolkit:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_amd64.deb
sudo dpkg -i cuda-keyring_1.1-1_amd64.deb
sudo apt update && sudo apt install cuda-toolkit-12-6
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

Verify: nvcc --version
Delete old llama.cpp build so it rebuilds with CUDA:

rm -rf llama.cpp/build

Set in config.py: LLM_GPU_LAYERS = -1
Run ./run.sh

WSL2 users: First install the NVIDIA WSL2 driver on Windows, then follow the steps above inside WSL.

Linux / WSL2 (Docker)

Follow the CUDA toolkit install steps above
Set in .env:

CUDA_AVAILABLE=true
LLM_GPU_LAYERS=-1

Delete .benchmarked_gpu to re-measure speed with GPU
Run ./run.sh --docker

Managing the Docker Stack

Common operations

docker compose -f docker/docker-compose.yml up -d        # start all services
docker compose -f docker/docker-compose.yml down         # stop all services
docker compose -f docker/docker-compose.yml logs -f      # follow logs from all services
docker compose -f docker/docker-compose.yml logs -f api  # follow logs from a specific service

Or use start.bat / run.sh --docker which handle the -f paths automatically.

When to rebuild

Most changes do NOT require a rebuild — just restart:

Change	Action
`.env` parameters (tokens, threads, context)	Restart: `start.bat` or `./run.sh --docker`
New model	Update `.env`, then restart
New/changed documents in `data_raw/`	Restart — ingest detects changes automatically
Python code changes	`docker compose -f docker/docker-compose.yml build api && start.bat`
Switch GPU layers on/off	Set `LLM_GPU_LAYERS` in `.env`, delete `.benchmarked_gpu` or `.benchmarked_cpu`, restart
Switch CPU build to GPU build	Set `CUDA_AVAILABLE=true` in `.env`, then `start.bat` (full rebuild)
Change embedding model	Update `EMBED_MODEL_NAME` in `.env`, then `docker volume rm local_rag_ingest_state` and restart

Switching models

Edit .env and set the new LLM_MODEL_FILE and LLM_MODEL_REPO
Delete .benchmarked_gpu or .benchmarked_cpu (new model = different speed)
Restart — model_downloader will automatically fetch the new model

Re-ingesting documents

Ingestion runs automatically on startup and only processes changed or new files (tracked via MD5 hashes). Deleted files are automatically detected and removed from the vector store. To force a full re-ingest:

docker volume rm local_rag_ingest_state

API Endpoints

`POST /answer`

Retrieves relevant chunks via hybrid search and generates an answer with the LLM.

curl -X POST http://localhost:8000/answer \
  -H "Content-Type: application/json" \
  -d '{"query": "what is X", "top_k": 5, "mode": "answer", "timeout": 60}'

Full request options:

{
  "query": "what is X",
  "top_k": 5,
  "mode": "answer",
  "timeout": 60,
  "max_tokens": 500,
  "temperature": 0.7,
  "top_p": 0.8,
  "llm_top_k": 20,
  "min_p": 0.0,
  "session_id": "optional-uuid-for-chat-memory",
  "use_bm25": true,
  "use_reranker": false
}

`POST /answer` (search mode)

Returns raw retrieved chunks without calling the LLM. Useful for debugging retrieval.

curl -X POST http://localhost:8000/answer \
  -H "Content-Type: application/json" \
  -d '{"query": "what is X", "top_k": 5, "mode": "search"}'

`POST /search`

Dedicated search endpoint — returns scored chunks with metadata.

`GET /metrics`

Returns per-query analytics history for the current server session (tok/s, prompt tokens, response tokens, response time).

`GET /health`

Returns server status including GPU/CUDA state, model info, measured tok/s, and full server configuration.

Project Structure

.
├── data_raw/                  # Place your PDF and DOCX files here
├── models/                    # GGUF model downloaded here automatically
├── docker/
│   ├── docker-compose.yml     # Full stack orchestration
│   ├── docker-compose.gpu.yml # GPU overlay (merged by start.bat/run.sh)
│   ├── Dockerfile             # CPU image for all Python services
│   └── Dockerfile.cuda        # GPU image (CUDA 12, nvidia base)
├── scripts/
│   ├── benchmark.py           # LLM speed benchmark (Docker)
│   ├── test_speed.py          # Manual LLM speed test (native)
│   └── qdrant_test.py         # Qdrant connectivity test
├── templates/
│   └── index.html             # Browser UI (single file, no build step)
├── config.py                  # All settings (native workflow)
├── .env.example               # All settings (Docker workflow) — copy to .env
├── run.sh                     # Linux/macOS entry point (native or --docker)
├── start.bat                  # Windows entry point (Docker)
├── start.py                   # Native launcher (Qdrant + ingest + LLM + API)
├── ingest.py                  # Parse, chunk, embed, and upsert into Qdrant
├── rag_api.py                 # FastAPI retrieval + LLM answer API
├── download_model.py          # Auto model downloader (Docker)
└── requirements.txt

Testing & Benchmarking RAG Accuracy

Evaluate Retrieval + LLM Quality

The rag_test.py script measures how well your RAG system performs on real documents:

Generates test questions from random chunks in your ingested documents
Tests all 3 search modes (Dense / Hybrid / Hybrid+Rerank) on the same questions
Scores retrieval accuracy (did it find the right source file?) and answer accuracy (is the answer correct?)
Reports results showing retrieval %, answer %, and response time per mode

This is useful for:

Tuning chunking size, overlap, and retrieval settings
Comparing search mode performance on your specific documents
Validating that configuration changes improve quality

Run the test:

Windows (Docker):

test.bat

Linux/macOS (Docker):

chmod +x test.sh
./test.sh

Docker directly:

docker compose -f docker/docker-compose.yml --profile test run rag_test

The script:

Samples random chunks from your collection
Generates 20 Q&A pairs (customizable: --questions 100)
Tests all search modes
Scores answers using the LLM as a judge
Saves results to test_results.json

Sample output:

Note: The test must run against a running stack (start with start.bat or run.sh --docker first). Scores depend heavily on your documents, model, and settings — the screenshot above shows a real test run on sample documents.

Supported Document Formats

Ingest.py supports the following file types:

Format	Handling
PDF	Text extraction per page, cleaned for OCR artifacts
DOCX	Paragraphs grouped by heading styles
PPTX	One section per slide, text from all shapes
XLSX / XLS	One section per sheet, rows formatted as text
CSV	Rows formatted as `key=value` (wide CSVs) or pipe-separated (normal)
JSON	Flattened to `path.to.key: value` for structure-aware search
TXT / MD	Read as-is, chunked by paragraphs

Place files in data_raw/ — they're automatically detected on startup.

Docker & Dependency Changes

Do I need to rebuild Docker if I change requirements.txt?

Yes. If you add or update Python dependencies in requirements.txt:

# Windows
docker compose -f docker/docker-compose.yml build api ingest benchmark
start.bat

# Linux/macOS
docker compose -f docker/docker-compose.yml build api ingest benchmark
./run.sh --docker

The build command rebuilds the images with the new dependencies. You only need to rebuild the services that use the updated code (usually api, ingest, benchmark). The qdrant and model images don't need rebuilding.

Why? Docker images are immutable. Dependencies are baked in at build time. Code changes don't require a rebuild (Docker can reload them), but dependency changes do.

Tips

Search method: Hybrid (Dense + BM25) is the default. Switch to Dense for the fastest results or Hybrid + Rerank for the best precision on complex queries.
Retrieval quality: top_k=5 is a good default. Increase if answers feel incomplete.
Chunk size: MAX_CHARS=1000 works well for most documents. Reduce to 600–700 for documents with many short definitions.
Incremental ingestion: Only changed or new files are re-ingested. Deleted files are automatically removed from the vector store.
LLM sampling: Adjust temperature, top_p, top_k, and min_p from the sidebar in the web UI — changes apply per-query without restarting.
Max response tokens: Use the slider in the UI to control response length. The admin-configured MAX_TOKENS value is the ceiling.
Chat memory: Each browser tab maintains its own session. "Clear chat" in the UI starts a fresh session. Set CHAT_MEMORY_TURNS=0 in .env to disable memory entirely.
Analytics: The sidebar tracks tok/s and response time per query with live sparkline graphs. System stats (CPU, RAM, GPU) show peak values since page load.
Config panel: The UI sidebar shows a read-only view of server admin settings (context window, threads, GPU layers, chunk size, reranker model, etc.).
CPU context warning: If running in CPU mode with a context window above 4096, the UI shows a warning — large contexts are very slow on CPU.
GPU toggle: Set LLM_GPU_LAYERS=-1 or 0 in .env and restart. The benchmark auto-detects the mode and re-runs automatically when you switch.
Re-benchmark: Delete .benchmarked_gpu or .benchmarked_cpu from the project root to force a speed measurement on next startup.

License

MIT — see LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
data_raw		data_raw
data_raw_test		data_raw_test
docker		docker
docs		docs
models		models
scripts		scripts
templates		templates
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
download_model.py		download_model.py
ingest.py		ingest.py
rag_api.py		rag_api.py
requirements.txt		requirements.txt
run.sh		run.sh
start.bat		start.bat
start.py		start.py
test.bat		test.bat
test.sh		test.sh

Folders and files

Latest commit

History

Repository files navigation

Local RAG Server

Web UI

Screenshot

Platform Support

GPU Acceleration

Requirements

Quick Start

1. Clone the repo

2. Add your documents

3. Configure

4. Start everything

Option A — Windows (Docker Desktop)

Option B — Linux / macOS (Docker)

Option C — Linux / macOS (Native)

Startup Optimizations

Configuration

LLM & Model

Chunking

Chat Memory

Hybrid Retrieval

Hybrid Search

Chat Memory

Enabling GPU

Windows (Docker — recommended)

Linux / WSL2 (Native)

Linux / WSL2 (Docker)

Managing the Docker Stack

Common operations

When to rebuild

Switching models

Re-ingesting documents

API Endpoints

POST /answer

POST /answer (search mode)

POST /search

GET /metrics

GET /health

Project Structure

Testing & Benchmarking RAG Accuracy

Evaluate Retrieval + LLM Quality

Supported Document Formats

Docker & Dependency Changes

Do I need to rebuild Docker if I change requirements.txt?

Tips

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`POST /answer`

`POST /answer` (search mode)

`POST /search`

`GET /metrics`

`GET /health`

Packages