Run AI models locally for private, offline development without cloud dependencies.
This guide covers Ollama, LM Studio, Open WebUI, local models, and complete offline AI workflows.
- Why Offline Vibe Coding?
- Benefits of Local AI
- Ollama Setup Guide
- LM Studio Guide
- Open WebUI Installation
- Best Local Models
- IDE Integration
- Offline Agent Workflows
- Performance Optimization
- Privacy & Security
- Troubleshooting
- Next Steps
Offline vibe coding means running AI models on your local hardware instead of relying on cloud APIs like Claude or ChatGPT.
| Scenario | Cloud AI | Local AI |
|---|---|---|
| Privacy-sensitive code | ❌ Risk | ✅ Perfect |
| No internet connection | ❌ Won't work | ✅ Works |
| High volume usage | 💰 Expensive | ✅ Free after setup |
| Custom fine-tuning | ❌ Limited | ✅ Full control |
| Latency critical | 🌐 Network delay | ✅ Instant |
| Learning/experimentation | 💰 Cost adds up | ✅ Unlimited |
Your codebase never leaves your machine. Critical for:
- Proprietary algorithms
- Customer data handling
- Security-sensitive projects
- NDAs and compliance requirements
Cloud services have rate limits and costs:
Cloud AI (Claude Pro): $20/month → ~1000 messages
Local AI: One-time hardware cost → Unlimited messages
- Fine-tune models on your codebase
- Create custom system prompts
- Modify model behavior
- Build specialized agents
Work from anywhere:
- Airplanes
✈️ - Remote locations 🏔️
- Network outages ⚡
Break-even calculation:
RTX 4060 Ti 16GB: $400
Monthly cloud AI: $20-200
Break-even: 2-20 months
After that: Pure savings!
Ollama is the easiest way to run LLMs locally. It handles:
- Model downloading
- Quantization
- GPU acceleration
- API server
- Simple CLI interface
brew install ollama
# Or download from https://ollama.ai# Download installer from https://ollama.ai/download
# Run OllamaSetup.execurl -fsSL https://ollama.ai/install.sh | shollama --version
ollama list# Start with a small, fast model
ollama pull llama3.2:3b
# Medium-sized general purpose
ollama pull llama3.2:7b
# Coding specialist
ollama pull codellama:7b
# Large powerful model (needs 16GB+ VRAM)
ollama pull llama3.2:14b# Interactive chat
ollama run llama3.2:7b
# One-shot query
ollama run llama3.2:7b "Explain quantum computing in 3 sentences"
# With system prompt
ollama run llama3.2:7b "You are a senior Python developer. Review this code:"
# Pass code via stdin
cat main.py | ollama run llama3.2:7b "Find bugs in this code"# Start server (runs on http://localhost:11434)
ollama serve
# API usage
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2:7b",
"prompt": "Hello!",
"stream": false
}'| Model | Size | Min RAM | Speed | Quality | Best For |
|---|---|---|---|---|---|
| Phi-3-mini | 3.8B | 4GB | ⚡⚡⚡ | ⭐⭐⭐ | Quick tasks |
| Llama 3.2 3B | 3B | 4GB | ⚡⚡⚡ | ⭐⭐⭐ | General use |
| Llama 3.2 7B | 7B | 8GB | ⚡⚡ | ⭐⭐⭐⭐ | Balanced |
| CodeLlama 7B | 7B | 8GB | ⚡⚡ | ⭐⭐⭐⭐ | Code generation |
| DeepSeek Coder 6.7B | 6.7B | 8GB | ⚡⚡ | ⭐⭐⭐⭐⭐ | Best coding |
| Qwen 2.5 Coder 7B | 7B | 8GB | ⚡⚡ | ⭐⭐⭐⭐⭐ | Excellent |
| Llama 3.2 14B | 14B | 16GB | ⚡ | ⭐⭐⭐⭐⭐ | Complex tasks |
| Mistral Large | 123B | 64GB+ | 🐌 | ⭐⭐⭐⭐⭐⭐ | Maximum power |
# List downloaded models
ollama list
# Show model info
ollama show llama3.2:7b
# Delete a model
ollama rm llama3.2:7b
# Copy a model
ollama cp llama3.2:7b my-coding-model
# Export to GGUF format
ollama cp llama3.2:7b /path/to/export.ggufCreate customized versions:
FROM llama3.2:7b
SYSTEM """
You are an expert software developer specializing in Python and JavaScript.
Always provide clean, production-ready code with comments.
Explain your reasoning step-by-step.
"""
PARAMETER temperature 0.7
PARAMETER top_p 0.9Build it:
ollama create my-dev-assistant -f ./ModelfileLM Studio is a user-friendly GUI for running local LLMs with:
- Beautiful interface
- Model discovery built-in
- No command line needed
- Cross-platform support
- Download from https://lmstudio.ai/
- Install for your OS (Windows/Mac/Linux)
- Launch the application
- Click "Discover" tab
- Search for models (e.g., "llama", "codellama")
- Choose quantization level:
- Q4_K_M: Good balance (recommended)
- Q5_K_M: Better quality, larger size
- Q8_0: Best quality, maximum size
- Click Download
- Go to "Chat" tab
- Load your model from dropdown
- Set system prompt in settings
- Start chatting!
- Multiple model comparison: Run side-by-side chats
- Presets: Save different configurations
- Local API server: Expose as OpenAI-compatible endpoint
- Vision models: Support for image analysis
- Context customization: Adjust context window size
LM Studio can act as an OpenAI drop-in replacement:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="local-model",
messages=[
{"role": "system", "content": "You are a coding assistant"},
{"role": "user", "content": "Write a Python function to sort a list"}
]
)
print(response.choices[0].message.content)Open WebUI provides a ChatGPT-like interface for local models with:
- Beautiful web UI
- Conversation history
- Multiple model support
- RAG capabilities
- User management
docker run -d \
-p 3000:8080 \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:mainAccess at: http://localhost:3000
pip install open-webui
open-webui serve- Open WebUI auto-detects Ollama on localhost:11434
- If not, go to Settings → Connections
- Add Ollama URL: http://host.docker.internal:11434 (Docker)
- Refresh models
- Chat Interface: ChatGPT-like experience
- Document Upload: RAG with PDFs, text files
- Web Search: Enable internet access
- Image Generation: DALL-E integration
- Voice Input/Output: Speech-to-text support
- Multi-user: Role-based access control
- Model Switching: Easy model selection
- Sizes: 1.3B, 6.7B, 16B, 33B
- Strengths: Best overall coding performance
- Languages: 338 programming languages
- Context: Up to 128K tokens
ollama pull deepseek-coder:6.7b- Sizes: 1.5B, 7B, 14B, 32B
- Strengths: Excellent code understanding
- Special: Strong in competitive programming
ollama pull qwen2.5-coder:7b- Sizes: 7B, 13B, 34B, 70B
- Strengths: Meta's official coding model
- Variants: Python, Instruct, Base
ollama pull codellama:7b-instruct- Sizes: 1B, 3B, 7B, 14B
- Strengths: General purpose, good coding
- Best for: Mixed coding + explanation tasks
ollama pull llama3.2:7b- Size: 3.8B
- Strengths: Fastest, runs everywhere
- Best for: Quick tasks, low-end hardware
ollama pull phi3:mini| Hardware | Recommended Model | Expected Speed |
|---|---|---|
| 4GB RAM | Phi-3-mini, TinyLlama | 30-50 tokens/s |
| 8GB RAM | Llama 3.2 7B, DeepSeek 6.7B | 20-40 tokens/s |
| 16GB RAM | Llama 3.2 14B, Qwen 14B | 10-25 tokens/s |
| 24GB VRAM | Mixtral 8x7B, CodeLlama 34B | 5-15 tokens/s |
| 32GB+ RAM | Llama 3 70B (quantized) | 2-5 tokens/s |
Continue is the best open-source AI extension for VS Code.
# In VS Code Extensions
Search: "Continue"
Install: Continue - Edit code with AI.continue/config.json:
{
"models": [
{
"title": "Ollama",
"provider": "ollama",
"model": "llama3.2:7b"
},
{
"title": "LM Studio",
"provider": "openai",
"apiBase": "http://localhost:1234/v1",
"model": "local-model"
}
],
"tabAutocompleteModel": {
"title": "Autocomplete",
"provider": "ollama",
"model": "deepseek-coder:6.7b"
}
}- Tab Autocomplete: Inline code suggestions
- Chat Sidebar: Ask questions about code
- Edit Commands: Select code → Cmd+I → Describe change
- Code Review: Automatic PR reviews
- Documentation: Generate docs from code
Cursor now supports local models via Ollama:
- Settings → AI → Advanced
- Enable "Use Ollama"
- Set model name
- Enjoy AI features with local models!
For terminal enthusiasts:
require('copilot').setup({
servers = {
{
url = "http://localhost:11434",
model = "deepseek-coder:6.7b"
}
}
})With local models, you can build agents that:
- Access your file system
- Run commands
- Make API calls
- Work completely offline
Create agent.py:
from ollama import chat
import subprocess
def review_code(file_path):
# Read the code
with open(file_path, 'r') as f:
code = f.read()
# Prompt for review
prompt = f"""Review this code for:
1. Bugs and errors
2. Security vulnerabilities
3. Performance issues
4. Code style problems
5. Suggestions for improvement
Code:
{code}
Provide actionable feedback with examples."""
# Get AI review
response = chat(model='llama3.2:7b', messages=[
{'role': 'user', 'content': prompt}
])
return response['message']['content']
# Usage
review = review_code('main.py')
print(review)def refactor_file(file_path):
with open(file_path, 'r') as f:
code = f.read()
prompt = f"""Refactor this code to:
- Improve readability
- Follow SOLID principles
- Add type hints
- Extract functions if too long
- Add error handling
Code:
{code}
Return only the refactored code, no explanations."""
response = chat(model='codellama:7b', messages=[
{'role': 'user', 'content': prompt}
])
# Write back
with open(file_path, 'w') as f:
f.write(response['message']['content'])Create specialized agents:
agents = {
'architect': 'llama3.2:14b', # High-level design
'coder': 'deepseek-coder:6.7b', # Implementation
'reviewer': 'qwen2.5-coder:7b', # Code review
'tester': 'codellama:7b' # Test generation
}
def build_feature(requirement):
# Architect designs
design = agents['architect'].generate(f"Design: {requirement}")
# Coder implements
code = agents['coder'].generate(f"Implement: {design}")
# Reviewer checks
feedback = agents['reviewer'].generate(f"Review: {code}")
# Tester creates tests
tests = agents['tester'].generate(f"Test this: {code}")
return {
'design': design,
'code': code,
'feedback': feedback,
'tests': tests
}# Q4 quantization (recommended)
ollama pull llama3.2:7b-q4_K_M
# Q3 for speed
ollama pull llama3.2:7b-q3_K_M
# Q5 for quality
ollama pull llama3.2:7b-q5_K_MEnsure GPU is being used:
# Check during inference
nvidia-smi # Should show ollama processollama run llama3.2:7b "/set parameter num_ctx 2048"
# Default is 4096, reduce for speedInstead of multiple calls:
# Bad: Sequential
for question in questions:
ask(question)
# Good: Batch
ask_all(questions)# Linux
watch -n 1 nvidia-smi
# macOS
sudo powermetrics --samplers gpu_power -i 1000# Stop Ollama
ollama serve # Ctrl+C
# Clear GPU cache
echo 3 | sudo tee /proc/sys/vm/drop_cachesimport time
from ollama import chat
def benchmark(model_name, prompt="Write a quicksort implementation"):
start = time.time()
response = chat(model=model_name, messages=[
{'role': 'user', 'content': prompt}
])
end = time.time()
tokens = len(response['message']['content'].split())
speed = tokens / (end - start)
print(f"{model_name}: {speed:.1f} tokens/s")
return speed
# Test different models
benchmark('phi3:mini')
benchmark('llama3.2:7b')
benchmark('deepseek-coder:6.7b')| Aspect | Cloud AI | Local AI |
|---|---|---|
| Data transmission | ❌ Sent over internet | ✅ Stays local |
| Training on your data | ❌ Possible | ✅ Never |
| Logging | ❌ Provider logs | ✅ No logs |
| Compliance | ✅ Full control | |
| Air-gapped | ❌ Impossible | ✅ Possible |
# Only allow localhost access
sudo ufw deny 11434
sudo ufw allow from 127.0.0.1 to any port 11434# Verify SHA256 checksums
sha256sum ~/.ollama/models/*
# Compare with official hashes# Run in container
docker run --gpus all -p 127.0.0.1:11434:11434 ollama/ollamaEnable logging:
OLLAMA_DEBUG=1 ollama serveSolutions:
- Use smaller quantization (Q4 instead of Q8)
- Close other applications
- Upgrade RAM
- Use faster SSD
Solutions:
# Reduce context
ollama run llama3.2:7b "/set parameter num_ctx 1024"
# Use smaller model
ollama pull phi3:mini
# Close other GPU applicationsSolutions:
- Ensure GPU is being used (
nvidia-smi) - Try different model (some are optimized better)
- Update Ollama to latest version
- Check thermal throttling
Solutions:
# Check if port is in use
lsof -i :11434
# Kill existing process
killall ollama
# Restart service
sudo systemctl restart ollamaSolutions:
- Use coding-specific models (DeepSeek, CodeLlama)
- Improve your prompts
- Increase temperature slightly (0.7-0.8)
- Provide more context in prompts
- Ollama Docs: https://ollama.ai/docs
- GitHub Issues: https://github.com/ollama/ollama/issues
- Discord: https://discord.gg/ollama
- Reddit: r/Ollama
Now that you have local AI setup:
- ✅ Practice: Use local models for daily coding tasks
- ✅ Integrate: Set up Continue extension in VS Code
- ✅ Experiment: Try different models for different tasks
- ✅ Optimize: Benchmark and tune for your hardware
- ✅ Build: Create custom agents for your workflow
- 📚 Read 22-AI Agents for advanced agent patterns
- 🔧 Explore 24-Automation for workflow automation
- 📖 Check 05-AI Tools for tool comparisons
- 💡 Browse 26-Awesome Prompts for prompt examples
💡 Pro Tip: Start with a small model (3-7B) for speed, use larger models (14B+) only when needed. Most coding tasks work great with 7B models!
Ready to code offline?
➡️ Next: Open Source Projects ⬅️
Made with ❤️ by the Vibe Coding Community