A powerful Go-based OpenAI-compatible API wrapper around llama-server from llama.cpp, designed as a replacement for Ollama. It supports AI Skills, Tools, MCP (Model Context Protocol) servers, streaming responses, and automatic model discovery.
- OpenAI-Compatible API: Drop-in replacement for OpenAI API endpoints
- Automatic Model Discovery: Scans
~/.cache/llama.cpp(or custom directory) for GGUF models - Streaming Support: Full SSE streaming for chat completions
- Multi-Model: Automatically manages multiple llama-server instances
- CORS Enabled: Ready for browser-based applications
- API Authentication: Optional API key authentication (default: NOOLLAMA)
- Model Auto-Unloading: Configurable timeout for unused model cleanup (default: 5 minutes)
- Multimodal Support: Vision models with image input capabilities
- π― AI Skills System: Domain-specific skill injection with vector search routing
- Create custom skills in Markdown with frontmatter metadata
- Automatic skill selection via semantic similarity + LLM routing
- Skills inject specialized instructions into system prompts
- Works seamlessly alongside MCP tools
- π§ Tools Support: Native function calling with tool definitions
- π MCP Servers: Integration with Model Context Protocol servers for extended capabilities
- π Security Features: Input validation, CORS restrictions, and secure MCP tool execution
-
Go 1.21+: Install from go.dev
-
llama-server: From llama.cpp
# Build from source git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make -j # llama-server binary will be in the build directory
-
GGUF Models: Place your models in
~/.cache/llama.cppor specify custom directory
Download the latest binary for your platform from the Releases page.
# Clone the repository
git clone https://github.com/soymh/noollama
cd noollama
# Download dependencies
go mod download
# Build
go build -o noollama
# Run
./noollama./noollama --port 8080 --models-dir ~/.cache/llama.cpp-port int
Port to run the API server on (default 8080)
-host string
Host to bind the API server to (default "0.0.0.0")
-models-dir string
Directory containing GGUF models (default "~/.cache/llama.cpp")
-llama-server string
Path to llama-server binary (default "llama-server")
-default-model string
Default model to use if none specified
-mcp-config string
Path to MCP servers configuration file
-api-key string
API key for authentication (default "NOOLLAMA")
-model-timeout duration
Time after which unused models are unloaded (default 5m0s)
-skills-dir string
Directory containing skills (optional)
-db-path string
Path to skills database file (default: ~/.cache/noollama/skills.db)
-embed-url string
URL for embedding model (required for skills)
-embed-model string
Embedding model name (required for skills)
-chat-url string
URL for chat model (required for skills)
-chat-model string
Chat model name (required for skills)
# Basic usage with default settings
./noollama
# Custom port and models directory
./noollama --port 9000 --models-dir /path/to/models
# With MCP support
./noollama --mcp-config ./mcp-config.json
# With Skills support (requires external embedding/chat models)
./noollama \
--skills-dir ./skills \
--embed-url http://localhost:11434/v1 \
--embed-model nomic-embed-text \
--chat-url http://localhost:11434/v1 \
--chat-model llama-3.2-3b
# All options together
./noollama \
--port 8080 \
--host 0.0.0.0 \
--models-dir /mnt/models/gguf \
--llama-server /opt/llama.cpp/llama-server \
--default-model llama-3.2-3b \
--mcp-config ./mcp-servers.json \
--api-key your-secret-key \
--model-timeout 10m \
--skills-dir ./my-skills \
--embed-url http://localhost:11434/v1 \
--embed-model nomic-embed-text \
--chat-url http://localhost:11434/v1 \
--chat-model llama-3.2-3bThe Skills system allows you to create domain-specific instruction sets that are automatically injected based on user queries.
Skills are defined in Markdown files with YAML frontmatter:
---
name: my-skill-name
description: A brief description of when to use this skill
user-invocable: true
---
# Skill Title
Detailed instructions for the skill...---
name: terminal-automation
description: "Execute terminal commands via MCP with step-by-step guidance"
user-invocable: true
---
# Terminal Automation Skill
When this skill is active, you are an expert DevOps Engineer.
## Instructions
1. Plan the command sequence
2. Execute commands one at a time using MCP tools
3. Capture screenshots after each command
4. Report results to the user
## Safety Guidelines
- Never run destructive commands without confirmation
- Always show commands before executing- Vector Search: User query is embedded and matched against skill descriptions
- LLM Routing: Top candidates are presented to a router model as tools
- Skill Injection: Selected skill instructions are added to the system prompt
- Execution: Model responds with skill-enhanced behavior while retaining access to MCP tools
skills/
βββ .claude/
βββ skills/
βββ explain-code/
β βββ SKILL.md
βββ terminal-automation/
βββ SKILL.md
Create an mcp-config.json file to enable MCP server integration:
{
"mcpServers": {
"filesystem": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/tmp/allowed"],
"env": {}
},
"github": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-github"],
"env": {
"GITHUB_PERSONAL_ACCESS_TOKEN": "your-token"
}
},
"fetch": {
"command": "uvx",
"args": ["mcp-server-fetch"],
"env": {}
}
}
}- Filesystem:
@modelcontextprotocol/server-filesystem - GitHub:
@modelcontextprotocol/server-github - PostgreSQL:
@modelcontextprotocol/server-postgres - Git:
@modelcontextprotocol/server-git - Fetch:
mcp-server-fetch - Brave Search:
@modelcontextprotocol/server-brave-search
Install MCP servers via npm:
npm install -g @modelcontextprotocol/server-filesystem
npm install -g @modelcontextprotocol/server-githubThe server supports optional API key authentication. By default, it uses "NOOLLAMA" as the API key.
Authorization Header:
curl -H "Authorization: Bearer NOOLLAMA" http://localhost:8080/v1/modelsX-API-Key Header:
curl -H "X-API-Key: NOOLLAMA" http://localhost:8080/v1/modelsPython Client:
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="NOOLLAMA"
)curl http://localhost:8080/v1/modelscurl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": " mradermacher_Qwen3-4B-Instruct-2507-heretic-GGUF.gguf",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": " mradermacher_Qwen3-4B-Instruct-2507-heretic-GGUF.gguf",
"messages": [{"role": "user", "content": "Tell me a story"}],
"stream": true
}'curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": " mradermacher_Qwen3-4B-Instruct-2507-heretic-GGUF.gguf",
"messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
}]
}'from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="NOOLLAMA"
)
response = client.chat.completions.create(
model=" mradermacher_Qwen3-4B-Instruct-2507-heretic-GGUF.gguf",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)import OpenAI from 'openai';
const openai = new OpenAI({
baseURL: 'http://localhost:8080/v1',
apiKey: 'NOOLLAMA'
});
const response = await openai.chat.completions.create({
model: ' mradermacher_Qwen3-4B-Instruct-2507-heretic-GGUF.gguf',
messages: [{ role: 'user', content: 'Hello!' }]
});
console.log(response.choices[0].message.content);βββββββββββββββββββ
β HTTP Server β
β (Gin Framework)β
ββββββββββ¬βββββββββ
β
ββββββ΄βββββ
β Manager β
ββββββ¬βββββ
βββββββββββββββ¬βββββββββββββββ
ββββββ΄βββββ βββββββ΄ββββββ ββββββ΄βββββ
βllama- β β MCP β β Skills β
βserver β β Clients β β Manager β
βInstancesβ β β β β
βββββββββββ βββββββββββββ ββββββ¬βββββ
β
ββββββββ΄βββββββ
β Vector DB + β
β Router β
βββββββββββββββ
- Model Auto-Unloading: Unused models automatically unload after 5 minutes
- Keep models warm: Make a test request after startup
- Use Q4_K_M quantization: Best balance of size/speed/quality
- SSD storage: Faster model loading
- Sufficient RAM: Ensure enough memory for model + context + skills
- Verify the model file exists with
.ggufextension - Check read permissions on the model file
- Ensure all required flags are provided (
--embed-url,--embed-model,--chat-url,--chat-model) - Check skills directory structure:
skills/.claude/skills/<name>/SKILL.md - Verify YAML frontmatter is valid (use quotes for descriptions with colons)
- Verify MCP server command is installed (
npx,uvx) - Check network/firewall settings
- Review logs for specific errors
- Ensure client supports SSE (Server-Sent Events)
- Disable proxy buffering if applicable
Contributions welcome! Please feel free to:
- Submit issues for bugs or feature requests
- Create pull requests with improvements
- Share your custom skills in the community
MIT License - See LICENSE file for details
- llama.cpp for llama-server
- Model Context Protocol for MCP specification
- OpenAI for API compatibility
- Claude for skills system inspiration
- NOOLLAMA - This project
- llama.cpp - The underlying inference engine
- MCP Specification - Model Context Protocol