A fully local retrieval-augmented generation (RAG) system for querying PDF and DOCX documents using a local LLM. Designed for multi-user deployment on corporate networks — no cloud, no API keys.
PDF / DOCX files
|
v
Text extraction & cleaning
|
v
Chunking (overlapping, paragraph-aware)
|
v
all-MiniLM-L6-v2 embeddings (384-dim, CPU friendly)
|
v
Qdrant (local vector store)
|
v
Hybrid retrieval (Dense + BM25 + optional cross-encoder reranker)
|
v
FastAPI + llama.cpp → Browser UI
graph TD
%% Node Definitions
User((User))
UI[Web Interface]
API[FastAPI Backend]
subgraph Storage [Data Storage]
Raw[data_raw folder]
Qdrant[(Qdrant Vector DB)]
end
subgraph Logic [Processing Engine]
Parse[Document Parser]
Check{GPU Enabled?}
LLM[llama.cpp Server]
end
%% Flow
User -->|1. Place Docs| Raw
User -->|2. Configure| Env[.env]
User -->|3. Chat| UI
UI <-->|HTTP| API
API -->|Scan| Raw
Raw --> Parse -->|Embed| Qdrant
API -->|Hybrid Retrieve| Qdrant
Qdrant -->|Context| LLM
Env -->|Read| Check
Check -->|Yes| GPU[NVIDIA CUDA]
Check -->|No| CPU[System CPU]
GPU --> LLM
CPU --> LLM
LLM -->|Response| UI
%% Styling
style User fill:#f9f,stroke:#333,stroke-width:2px
style Qdrant fill:#00d2ff,stroke:#333,stroke-width:2px
style GPU fill:#76b900,stroke:#333,stroke-width:2px,color:#fff
style UI fill:#fff,stroke:#333,stroke-dasharray: 5 5
Once running, the web UI is accessible at http://localhost:8000 from the machine running the stack.
Accessing from other devices on your network: If you're running Local RAG on a server or another machine, access it via the host's IP address:
http://192.168.x.x:8000
Exposing to the internet: Use a reverse proxy like Caddy or nginx in front of port 8000, or use a tunnel like Cloudflare Tunnel for zero-config public access.
Warning: There is no authentication built in — if exposing publicly, put it behind a password-protected reverse proxy.
| Platform | Docker | Native |
|---|---|---|
| Linux | ./run.sh --docker |
./run.sh |
| macOS | ./run.sh --docker |
./run.sh |
| Windows | start.bat |
WSL2 only |
Windows users: Just install Docker Desktop and double-click start.bat — no WSL2, Python, or manual setup needed.
Linux/macOS native: Requires Docker only for Qdrant. Everything else runs directly on your machine via ./run.sh.
| Hardware | Expected Speed |
|---|---|
| CPU only (4-8B model) | 3–15 tok/s |
| NVIDIA GPU (4070 Ti) | 100–150 tok/s |
| NVIDIA GPU (3090/4090) | 150–200 tok/s |
GPU setup instructions are in the Enabling GPU section below.
- Python 3.10+
- Docker
- Git, CMake, and a C++ compiler (native workflow only)
On Ubuntu:
sudo apt install build-essential cmake git docker.io python3-venv python3-full
sudo usermod -aG docker $USER && newgrp dockergit clone https://github.com/burnoutmonk/local-rag-server.git
cd local-rag-serverPlace PDF and DOCX files in the data_raw/ folder.
cp .env.example .envOpen .env and set at minimum:
LLM_MODEL_FILE— filename of your GGUF model (downloaded automatically)LLM_MODEL_REPO— HuggingFace repo to download fromLLM_THREADS— number of CPU cores to useLLM_GPU_LAYERS— set to-1to use GPU,0for CPU only
Install Docker Desktop then double-click start.bat or run:
start.batThis starts all services in the background, waits until everything is ready, and opens your browser automatically at http://localhost:8000.
To stop:
docker compose downTo enable GPU:
# in .env
CUDA_AVAILABLE=true
LLM_GPU_LAYERS=-1Then re-run start.bat.
cp .env.example .env # adjust if needed
./run.sh --dockerOpens your browser automatically when ready. Same GPU settings as above via .env.
To stop:
docker compose downRuns everything directly on your machine without Docker (except Qdrant). Easier to debug and faster iteration during development.
chmod +x run.sh
./run.shThis automatically:
- Creates a Python virtual environment and installs dependencies
- Builds llama.cpp from source (first run only — takes 10–20 minutes)
- Downloads the model from HuggingFace
- Starts Qdrant via Docker
- Ingests documents from
data_raw/ - Measures LLM speed and updates
config.py - Starts the web UI at
http://localhost:8000
Press Ctrl+C to stop all services.
Useful flags:
./run.sh --skip-ingest # skip ingestion if documents haven't changed
./run.sh --skip-llm # skip LLM server if already runningThe Docker stack is optimized for fast restarts when nothing has changed:
| Service | First run | Subsequent restarts |
|---|---|---|
model_downloader |
Downloads GGUF from HuggingFace | Shell test -f — exits in <1s |
ingest |
Parses, chunks, and embeds all documents | Hash check only — skips embedding model load entirely |
benchmark |
Waits for LLM, measures tok/s | Sees .benchmarked_cpu or .benchmarked_gpu marker — exits in <1s |
The benchmark uses separate marker files for CPU and GPU modes so switching LLM_GPU_LAYERS always triggers a re-measurement on the next startup.
To force a fresh benchmark, delete the appropriate marker file from the project root:
# GPU mode
rm .benchmarked_gpu
# CPU mode
rm .benchmarked_cpuAll settings live in config.py (native) and .env (Docker). They share the same values — environment variables in .env override the defaults in config.py.
| Setting | Default | Description |
|---|---|---|
LLM_MODEL_FILE |
Llama-3.2-3B-Instruct-Q4_K_M.gguf |
GGUF model filename |
LLM_MODEL_REPO |
bartowski/Llama-3.2-3B-Instruct-GGUF |
HuggingFace repo |
LLM_THREADS |
8 |
CPU threads for inference |
LLM_CONTEXT |
4096 |
Context window size (GPU: use 32768+) |
LLM_GPU_LAYERS |
0 |
GPU layers (-1 = all, 0 = CPU only) |
LLM_TEMPERATURE |
0.7 |
Sampling temperature |
LLM_TOP_P |
0.8 |
Nucleus sampling threshold |
LLM_TOP_K |
20 |
Top-K sampling |
LLM_MIN_P |
0.0 |
Min-P sampling |
MAX_TOKENS |
500 |
Max output tokens |
MIN_TOKENS |
150 |
Min output tokens (timeout estimation floor) |
TOKENS_PER_SECOND |
10.0 |
Measured hardware speed (set by benchmark) |
CUDA_AVAILABLE |
false |
Build llama.cpp with CUDA support |
| Setting | Default | Description |
|---|---|---|
MAX_CHARS |
1000 |
Max chars per chunk |
OVERLAP_CHARS |
100 |
Chunk overlap in chars |
EMBED_MODEL_NAME |
sentence-transformers/all-MiniLM-L6-v2 |
Embedding model (changing requires full re-ingest) |
| Setting | Default | Description |
|---|---|---|
CHAT_MEMORY_TURNS |
3 |
Previous Q&A exchanges to include per session (0 to disable) |
| Setting | Default | Description |
|---|---|---|
BM25_WEIGHT |
0.5 |
BM25 fusion weight (0 = dense only, 1 = BM25 only) |
RETRIEVAL_MULTIPLIER |
4 |
Candidates fetched = top_k × multiplier before reranking |
RERANKER_MODEL |
cross-encoder/ms-marco-MiniLM-L-6-v2 |
Cross-encoder model for reranking |
RERANKER_ENABLED |
true |
Allow users to enable reranking (set to false to disable globally) |
Note: LLM sampling parameters (temperature, top_p, top_k, min_p) and max tokens can be adjusted per-query from the web UI sidebar. The
.envvalues are the defaults.
Retrieval uses a two-stage pipeline with user-selectable modes:
Dense — Qdrant cosine similarity search only (fastest).
Hybrid (default) — Dense search retrieves top_k × RETRIEVAL_MULTIPLIER candidates, then BM25 scores are computed on those candidates and fused with dense scores using BM25_WEIGHT.
Hybrid + Rerank — Adds a cross-encoder pass after Hybrid. The cross-encoder scores all candidates as (query, text) pairs and re-orders them. Adds 1–3 seconds of latency but improves precision for complex queries.
Users select the search mode via the three buttons in the UI sidebar. The rerank button is greyed out if RERANKER_ENABLED=false on the server.
Each browser tab gets an independent session (via sessionStorage). The server keeps the last CHAT_MEMORY_TURNS exchanges (default 3) in memory per session, prepending them to each LLM prompt for conversational context.
Sessions expire automatically after 2 hours of inactivity. Clicking "Clear chat" in the UI starts a new session and discards prior history.
Using a GPU massively improves generation speed — expect 100–150 tok/s on a 4070 Ti vs 5–15 tok/s on CPU.
Which CUDA version? Install CUDA 12 — it is stable, widely supported, and what llama.cpp officially targets.
Tested and working with a 4070 Ti on Windows 11 + Docker Desktop.
- Make sure you have an up to date NVIDIA driver — download from nvidia.com/drivers
- Verify: open CMD and run
nvidia-smi - Set in
.env:
CUDA_AVAILABLE=true
LLM_GPU_LAYERS=-1- Delete
.benchmarked_gpuif it exists (to force re-measurement) - Run
start.bat— Docker pulls the CUDA 12 toolkit automatically inside the container, no local CUDA install needed
- Install CUDA 12 toolkit:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_amd64.deb
sudo dpkg -i cuda-keyring_1.1-1_amd64.deb
sudo apt update && sudo apt install cuda-toolkit-12-6
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc- Verify:
nvcc --version - Delete old llama.cpp build so it rebuilds with CUDA:
rm -rf llama.cpp/build- Set in
config.py:LLM_GPU_LAYERS = -1 - Run
./run.sh
WSL2 users: First install the NVIDIA WSL2 driver on Windows, then follow the steps above inside WSL.
- Follow the CUDA toolkit install steps above
- Set in
.env:
CUDA_AVAILABLE=true
LLM_GPU_LAYERS=-1- Delete
.benchmarked_gputo re-measure speed with GPU - Run
./run.sh --docker
docker compose -f docker/docker-compose.yml up -d # start all services
docker compose -f docker/docker-compose.yml down # stop all services
docker compose -f docker/docker-compose.yml logs -f # follow logs from all services
docker compose -f docker/docker-compose.yml logs -f api # follow logs from a specific serviceOr use start.bat / run.sh --docker which handle the -f paths automatically.
Most changes do NOT require a rebuild — just restart:
| Change | Action |
|---|---|
.env parameters (tokens, threads, context) |
Restart: start.bat or ./run.sh --docker |
| New model | Update .env, then restart |
New/changed documents in data_raw/ |
Restart — ingest detects changes automatically |
| Python code changes | docker compose -f docker/docker-compose.yml build api && start.bat |
| Switch GPU layers on/off | Set LLM_GPU_LAYERS in .env, delete .benchmarked_gpu or .benchmarked_cpu, restart |
| Switch CPU build to GPU build | Set CUDA_AVAILABLE=true in .env, then start.bat (full rebuild) |
| Change embedding model | Update EMBED_MODEL_NAME in .env, then docker volume rm local_rag_ingest_state and restart |
- Edit
.envand set the newLLM_MODEL_FILEandLLM_MODEL_REPO - Delete
.benchmarked_gpuor.benchmarked_cpu(new model = different speed) - Restart —
model_downloaderwill automatically fetch the new model
Ingestion runs automatically on startup and only processes changed or new files (tracked via MD5 hashes). Deleted files are automatically detected and removed from the vector store. To force a full re-ingest:
docker volume rm local_rag_ingest_stateRetrieves relevant chunks via hybrid search and generates an answer with the LLM.
curl -X POST http://localhost:8000/answer \
-H "Content-Type: application/json" \
-d '{"query": "what is X", "top_k": 5, "mode": "answer", "timeout": 60}'Full request options:
{
"query": "what is X",
"top_k": 5,
"mode": "answer",
"timeout": 60,
"max_tokens": 500,
"temperature": 0.7,
"top_p": 0.8,
"llm_top_k": 20,
"min_p": 0.0,
"session_id": "optional-uuid-for-chat-memory",
"use_bm25": true,
"use_reranker": false
}Returns raw retrieved chunks without calling the LLM. Useful for debugging retrieval.
curl -X POST http://localhost:8000/answer \
-H "Content-Type: application/json" \
-d '{"query": "what is X", "top_k": 5, "mode": "search"}'Dedicated search endpoint — returns scored chunks with metadata.
Returns per-query analytics history for the current server session (tok/s, prompt tokens, response tokens, response time).
Returns server status including GPU/CUDA state, model info, measured tok/s, and full server configuration.
.
├── data_raw/ # Place your PDF and DOCX files here
├── models/ # GGUF model downloaded here automatically
├── docker/
│ ├── docker-compose.yml # Full stack orchestration
│ ├── docker-compose.gpu.yml # GPU overlay (merged by start.bat/run.sh)
│ ├── Dockerfile # CPU image for all Python services
│ └── Dockerfile.cuda # GPU image (CUDA 12, nvidia base)
├── scripts/
│ ├── benchmark.py # LLM speed benchmark (Docker)
│ ├── test_speed.py # Manual LLM speed test (native)
│ └── qdrant_test.py # Qdrant connectivity test
├── templates/
│ └── index.html # Browser UI (single file, no build step)
├── config.py # All settings (native workflow)
├── .env.example # All settings (Docker workflow) — copy to .env
├── run.sh # Linux/macOS entry point (native or --docker)
├── start.bat # Windows entry point (Docker)
├── start.py # Native launcher (Qdrant + ingest + LLM + API)
├── ingest.py # Parse, chunk, embed, and upsert into Qdrant
├── rag_api.py # FastAPI retrieval + LLM answer API
├── download_model.py # Auto model downloader (Docker)
└── requirements.txt
The rag_test.py script measures how well your RAG system performs on real documents:
- Generates test questions from random chunks in your ingested documents
- Tests all 3 search modes (Dense / Hybrid / Hybrid+Rerank) on the same questions
- Scores retrieval accuracy (did it find the right source file?) and answer accuracy (is the answer correct?)
- Reports results showing retrieval %, answer %, and response time per mode
This is useful for:
- Tuning chunking size, overlap, and retrieval settings
- Comparing search mode performance on your specific documents
- Validating that configuration changes improve quality
Run the test:
Windows (Docker):
test.batLinux/macOS (Docker):
chmod +x test.sh
./test.shDocker directly:
docker compose -f docker/docker-compose.yml --profile test run rag_testThe script:
- Samples random chunks from your collection
- Generates 20 Q&A pairs (customizable:
--questions 100) - Tests all search modes
- Scores answers using the LLM as a judge
- Saves results to
test_results.json
Sample output:
Note: The test must run against a running stack (start with start.bat or run.sh --docker first). Scores depend heavily on your documents, model, and settings — the screenshot above shows a real test run on sample documents.
Ingest.py supports the following file types:
| Format | Handling |
|---|---|
| Text extraction per page, cleaned for OCR artifacts | |
| DOCX | Paragraphs grouped by heading styles |
| PPTX | One section per slide, text from all shapes |
| XLSX / XLS | One section per sheet, rows formatted as text |
| CSV | Rows formatted as key=value (wide CSVs) or pipe-separated (normal) |
| JSON | Flattened to path.to.key: value for structure-aware search |
| TXT / MD | Read as-is, chunked by paragraphs |
Place files in data_raw/ — they're automatically detected on startup.
Yes. If you add or update Python dependencies in requirements.txt:
# Windows
docker compose -f docker/docker-compose.yml build api ingest benchmark
start.bat
# Linux/macOS
docker compose -f docker/docker-compose.yml build api ingest benchmark
./run.sh --dockerThe build command rebuilds the images with the new dependencies. You only need to rebuild the services that use the updated code (usually api, ingest, benchmark). The qdrant and model images don't need rebuilding.
Why? Docker images are immutable. Dependencies are baked in at build time. Code changes don't require a rebuild (Docker can reload them), but dependency changes do.
- Search method: Hybrid (Dense + BM25) is the default. Switch to Dense for the fastest results or Hybrid + Rerank for the best precision on complex queries.
- Retrieval quality:
top_k=5is a good default. Increase if answers feel incomplete. - Chunk size:
MAX_CHARS=1000works well for most documents. Reduce to 600–700 for documents with many short definitions. - Incremental ingestion: Only changed or new files are re-ingested. Deleted files are automatically removed from the vector store.
- LLM sampling: Adjust temperature, top_p, top_k, and min_p from the sidebar in the web UI — changes apply per-query without restarting.
- Max response tokens: Use the slider in the UI to control response length. The admin-configured
MAX_TOKENSvalue is the ceiling. - Chat memory: Each browser tab maintains its own session. "Clear chat" in the UI starts a fresh session. Set
CHAT_MEMORY_TURNS=0in.envto disable memory entirely. - Analytics: The sidebar tracks tok/s and response time per query with live sparkline graphs. System stats (CPU, RAM, GPU) show peak values since page load.
- Config panel: The UI sidebar shows a read-only view of server admin settings (context window, threads, GPU layers, chunk size, reranker model, etc.).
- CPU context warning: If running in CPU mode with a context window above 4096, the UI shows a warning — large contexts are very slow on CPU.
- GPU toggle: Set
LLM_GPU_LAYERS=-1or0in.envand restart. The benchmark auto-detects the mode and re-runs automatically when you switch. - Re-benchmark: Delete
.benchmarked_gpuor.benchmarked_cpufrom the project root to force a speed measurement on next startup.
MIT — see LICENSE

