Bandwidth-friendly Ollama proxy with HuggingFace integration and download queuing.
Queue model downloads for off-peak hours. Supports both Ollama library models and HuggingFace GGUF models with automatic conversion.
- 🕐 Scheduled Downloads - Queue models for off-peak download (default: 10 PM)
- 🤗 HuggingFace Integration - Download GGUF models directly from HuggingFace
- 🔄 Auto-Conversion - Automatically converts HuggingFace models to Ollama format
- 📊 Interactive CLI - User-friendly menu for managing models
- 🔒 Rate Limiting - Prevent abuse with per-IP daily limits
- 💾 Disk Monitoring - Automatic disk space checks before downloads
- 🔌 Transparent Proxy - Drop-in replacement for Ollama API
git clone https://github.com/wildwasser/ohhhllama.git
cd ohhhllama
sudo ./install.shThe installer will:
- Install Docker (if not present)
- Set up Ollama in a Docker container
- Install the ohhhllama proxy service
- Set up the download queue timer
- Install the HuggingFace integration module
ohhhllamaThis opens an interactive menu where you can:
- View system status
- Queue Ollama models
- Queue HuggingFace models
- View/manage the download queue
- List and remove installed models
- View logs
ohhhllama --statusOllama models:
curl http://localhost:11434/api/pull -d '{"name": "llama3:8b"}'HuggingFace models:
curl http://localhost:11434/api/hf/queue -d '{"repo_id": "TheBloke/Mistral-7B-v0.1-GGUF"}'┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Client/App │────▶│ ohhhllama Proxy │────▶│ Ollama (Docker) │
│ (port 11434) │ │ (port 11434) │ │ (port 11435) │
└─────────────────┘ └────────┬────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ SQLite Queue │
│ Database │
└────────┬────────┘
│
▼ (scheduled)
┌─────────────────┐
│ Queue Processor │
│ (systemd timer)│
└─────────────────┘
Configuration file: /opt/ohhhllama/ohhhllama.conf
# Ollama backend URL (internal)
OLLAMA_BACKEND=http://127.0.0.1:11435
# Proxy listen port
LISTEN_PORT=11434
# Queue database path
DB_PATH=/var/lib/ohhhllama/queue.db
# Rate limit (requests per IP per day)
RATE_LIMIT=5
# Disk monitoring
DISK_PATH=/data/ollama
DISK_THRESHOLD=90
# HuggingFace settings
HF_CACHE_DIR=/data/huggingfaceAll standard Ollama API endpoints are proxied transparently:
GET /api/tags- List modelsPOST /api/generate- Generate textPOST /api/chat- Chat completionPOST /api/pull- Pull model (queued for off-peak)DELETE /api/delete- Delete model
GET /api/queueReturns queue status and pending downloads.
GET /api/healthReturns system health including disk space and service status.
POST /api/hf/queue
Content-Type: application/json
{
"repo_id": "TheBloke/Llama-2-7B-GGUF",
"quant": "Q4_K_M", # Optional, default: Q4_K_M
"name": "my-llama" # Optional, custom Ollama model name
}-
GGUF Repositories (recommended)
- Pre-quantized models ready for Ollama
- Providers: TheBloke, bartowski, QuantFactory, mradermacher
- Example:
TheBloke/Mistral-7B-v0.1-GGUF
-
Standard HuggingFace Models
- Automatically converted to GGUF
- Requires supported architecture
Models with these architectures can be converted:
- LlamaForCausalLM (Llama, Llama 2, Llama 3)
- MistralForCausalLM, MixtralForCausalLM
- Qwen2ForCausalLM
- PhiForCausalLM, Phi3ForCausalLM
- GemmaForCausalLM, Gemma2ForCausalLM
- FalconForCausalLM
- GPT2LMHeadModel, GPTNeoXForCausalLM
- StableLmForCausalLM
- OlmoForCausalLM
| Type | Bits | Quality | Size | Use Case |
|---|---|---|---|---|
| Q8_0 | 8 | Best | Large | Maximum quality |
| Q5_K_M | 5.5 | Better | Medium | Quality-focused |
| Q4_K_M | 4.5 | Good | Small | Recommended default |
| Q3_K_M | 3.4 | Lower | Smaller | Memory constrained |
/opt/ohhhllama/
├── proxy.py # Main proxy server
├── ohhhllama.conf # Configuration
├── scripts/
│ └── process-queue.sh # Queue processor
├── huggingface/
│ ├── hf_backend.py # HuggingFace module
│ ├── requirements.txt
│ └── .venv/ # Python environment
└── ...
/data/
├── ollama/ # Ollama model storage
│ ├── models/
│ └── modelfiles/
└── huggingface/ # HuggingFace cache
└── gguf/ # Downloaded GGUF files
/var/lib/ohhhllama/
└── queue.db # SQLite queue database
# Proxy service
sudo systemctl status ollama-proxy
sudo systemctl restart ollama-proxy
sudo journalctl -u ollama-proxy -f
# Queue timer
sudo systemctl list-timers ollama-queue.timer
sudo systemctl start ollama-queue.service # Process now
# Queue processor logs
sudo journalctl -u ollama-queue.service -n 50By default, queued downloads run at 10 PM daily. To change:
sudo nano /etc/systemd/system/ollama-queue.timer
sudo systemctl daemon-reload
sudo systemctl restart ollama-queue.timerTimer format uses systemd calendar syntax:
OnCalendar=*-*-* 22:00:00- Daily at 10 PMOnCalendar=*-*-* 03:00:00- Daily at 3 AM
- Check queue status:
ohhhllama→ View queue - Check logs:
sudo journalctl -u ollama-queue.service -n 50 - Process manually:
sudo systemctl start ollama-queue.service
- Verify venv exists:
ls /opt/ohhhllama/huggingface/.venv - Check disk space:
df -h /data - Test manually:
/opt/ohhhllama/huggingface/.venv/bin/python3 \ /opt/ohhhllama/huggingface/hf_backend.py \ TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF
- Check service:
sudo systemctl status ollama-proxy - Check Ollama container:
sudo docker ps | grep ollama - Restart:
sudo systemctl restart ollama-proxy
cd /path/to/ohhhllama
sudo ./uninstall.shMIT License - see LICENSE
Contributions welcome! Please open an issue or PR on GitHub.