NOOLLAMA - Llama OpenAI Wrapper with Skills & MCP

A powerful Go-based OpenAI-compatible API wrapper around llama-server from llama.cpp, designed as a replacement for Ollama. It supports AI Skills, Tools, MCP (Model Context Protocol) servers, streaming responses, and automatic model discovery.

✨ Features

Core Features

OpenAI-Compatible API: Drop-in replacement for OpenAI API endpoints
Automatic Model Discovery: Scans ~/.cache/llama.cpp (or custom directory) for GGUF models
Streaming Support: Full SSE streaming for chat completions
Multi-Model: Automatically manages multiple llama-server instances
CORS Enabled: Ready for browser-based applications
API Authentication: Optional API key authentication (default: NOOLLAMA)
Model Auto-Unloading: Configurable timeout for unused model cleanup (default: 5 minutes)
Multimodal Support: Vision models with image input capabilities

Advanced Features

🎯 AI Skills System: Domain-specific skill injection with vector search routing
- Create custom skills in Markdown with frontmatter metadata
- Automatic skill selection via semantic similarity + LLM routing
- Skills inject specialized instructions into system prompts
- Works seamlessly alongside MCP tools
🔧 Tools Support: Native function calling with tool definitions
🔌 MCP Servers: Integration with Model Context Protocol servers for extended capabilities
🔒 Security Features: Input validation, CORS restrictions, and secure MCP tool execution

🚀 Quick Start

Prerequisites

Go 1.21+: Install from go.dev

llama-server: From llama.cpp

# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j
# llama-server binary will be in the build directory

GGUF Models: Place your models in ~/.cache/llama.cpp or specify custom directory

Installation

Option 1: Download Pre-built Binaries

Download the latest binary for your platform from the Releases page.

Option 2: Build from Source

# Clone the repository
git clone https://github.com/soymh/noollama
cd noollama

# Download dependencies
go mod download

# Build
go build -o noollama

# Run
./noollama

📚 Documentation

Basic Usage

./noollama --port 8080 --models-dir ~/.cache/llama.cpp

Command Line Options

-port int
    Port to run the API server on (default 8080)
-host string
    Host to bind the API server to (default "0.0.0.0")
-models-dir string
    Directory containing GGUF models (default "~/.cache/llama.cpp")
-llama-server string
    Path to llama-server binary (default "llama-server")
-default-model string
    Default model to use if none specified
-mcp-config string
    Path to MCP servers configuration file
-api-key string
    API key for authentication (default "NOOLLAMA")
-model-timeout duration
    Time after which unused models are unloaded (default 5m0s)
-skills-dir string
    Directory containing skills (optional)
-db-path string
    Path to skills database file (default: ~/.cache/noollama/skills.db)
-embed-url string
    URL for embedding model (required for skills)
-embed-model string
    Embedding model name (required for skills)
-chat-url string
    URL for chat model (required for skills)
-chat-model string
    Chat model name (required for skills)

Examples

# Basic usage with default settings
./noollama

# Custom port and models directory
./noollama --port 9000 --models-dir /path/to/models

# With MCP support
./noollama --mcp-config ./mcp-config.json

# With Skills support (requires external embedding/chat models)
./noollama \
  --skills-dir ./skills \
  --embed-url http://localhost:11434/v1 \
  --embed-model nomic-embed-text \
  --chat-url http://localhost:11434/v1 \
  --chat-model llama-3.2-3b

# All options together
./noollama \
  --port 8080 \
  --host 0.0.0.0 \
  --models-dir /mnt/models/gguf \
  --llama-server /opt/llama.cpp/llama-server \
  --default-model llama-3.2-3b \
  --mcp-config ./mcp-servers.json \
  --api-key your-secret-key \
  --model-timeout 10m \
  --skills-dir ./my-skills \
  --embed-url http://localhost:11434/v1 \
  --embed-model nomic-embed-text \
  --chat-url http://localhost:11434/v1 \
  --chat-model llama-3.2-3b

🎯 Skills System

The Skills system allows you to create domain-specific instruction sets that are automatically injected based on user queries.

Creating a Skill

Skills are defined in Markdown files with YAML frontmatter:

---
name: my-skill-name
description: A brief description of when to use this skill
user-invocable: true
---

# Skill Title

Detailed instructions for the skill...

Example: Terminal Automation Skill

---
name: terminal-automation
description: "Execute terminal commands via MCP with step-by-step guidance"
user-invocable: true
---

# Terminal Automation Skill

When this skill is active, you are an expert DevOps Engineer.

## Instructions
1. Plan the command sequence
2. Execute commands one at a time using MCP tools
3. Capture screenshots after each command
4. Report results to the user

## Safety Guidelines
- Never run destructive commands without confirmation
- Always show commands before executing

How Skills Work

Vector Search: User query is embedded and matched against skill descriptions
LLM Routing: Top candidates are presented to a router model as tools
Skill Injection: Selected skill instructions are added to the system prompt
Execution: Model responds with skill-enhanced behavior while retaining access to MCP tools

Skills Directory Structure

skills/
└── .claude/
    └── skills/
        ├── explain-code/
        │   └── SKILL.md
        └── terminal-automation/
            └── SKILL.md

🔌 MCP Server Configuration

Create an mcp-config.json file to enable MCP server integration:

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/tmp/allowed"],
      "env": {}
    },
    "github": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-github"],
      "env": {
        "GITHUB_PERSONAL_ACCESS_TOKEN": "your-token"
      }
    },
    "fetch": {
      "command": "uvx",
      "args": ["mcp-server-fetch"],
      "env": {}
    }
  }
}

Popular MCP Servers

Filesystem: @modelcontextprotocol/server-filesystem
GitHub: @modelcontextprotocol/server-github
PostgreSQL: @modelcontextprotocol/server-postgres
Git: @modelcontextprotocol/server-git
Fetch: mcp-server-fetch
Brave Search: @modelcontextprotocol/server-brave-search

Install MCP servers via npm:

npm install -g @modelcontextprotocol/server-filesystem
npm install -g @modelcontextprotocol/server-github

🔐 Authentication

The server supports optional API key authentication. By default, it uses "NOOLLAMA" as the API key.

Using API Keys

Authorization Header:

curl -H "Authorization: Bearer NOOLLAMA" http://localhost:8080/v1/models

X-API-Key Header:

curl -H "X-API-Key: NOOLLAMA" http://localhost:8080/v1/models

Python Client:

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="NOOLLAMA"
)

📡 API Endpoints

Models

List Available Models

curl http://localhost:8080/v1/models

Chat Completions

Standard Request

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": " mradermacher_Qwen3-4B-Instruct-2507-heretic-GGUF.gguf",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Streaming Request

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": " mradermacher_Qwen3-4B-Instruct-2507-heretic-GGUF.gguf",
    "messages": [{"role": "user", "content": "Tell me a story"}],
    "stream": true
  }'

With Tools

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": " mradermacher_Qwen3-4B-Instruct-2507-heretic-GGUF.gguf",
    "messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string"}
          },
          "required": ["location"]
        }
      }
    }]
  }'

💻 Client Examples

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="NOOLLAMA"
)

response = client.chat.completions.create(
    model=" mradermacher_Qwen3-4B-Instruct-2507-heretic-GGUF.gguf",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Node.js (OpenAI SDK)

import OpenAI from 'openai';

const openai = new OpenAI({
  baseURL: 'http://localhost:8080/v1',
  apiKey: 'NOOLLAMA'
});

const response = await openai.chat.completions.create({
  model: ' mradermacher_Qwen3-4B-Instruct-2507-heretic-GGUF.gguf',
  messages: [{ role: 'user', content: 'Hello!' }]
});

console.log(response.choices[0].message.content);

🏗️ Architecture

┌─────────────────┐
│   HTTP Server   │
│   (Gin Framework)│
└────────┬────────┘
         │
    ┌────┴────┐
    │  Manager │
    └────┬────┘
         ├─────────────┬──────────────┐
    ┌────┴────┐  ┌─────┴─────┐  ┌────┴────┐
    │llama-   │  │   MCP     │  │ Skills  │
    │server   │  │  Clients  │  │ Manager │
    │Instances│  │           │  │         │
    └─────────┘  └───────────┘  └────┬────┘
                                     │
                              ┌──────┴──────┐
                              │ Vector DB + │
                              │   Router    │
                              └─────────────┘

⚡ Performance Tips

Model Auto-Unloading: Unused models automatically unload after 5 minutes
Keep models warm: Make a test request after startup
Use Q4_K_M quantization: Best balance of size/speed/quality
SSD storage: Faster model loading
Sufficient RAM: Ensure enough memory for model + context + skills

🐛 Troubleshooting

Model not found

Verify the model file exists with .gguf extension
Check read permissions on the model file

Skills not loading

Ensure all required flags are provided (--embed-url, --embed-model, --chat-url, --chat-model)
Check skills directory structure: skills/.claude/skills/<name>/SKILL.md
Verify YAML frontmatter is valid (use quotes for descriptions with colons)

MCP servers not connecting

Verify MCP server command is installed (npx, uvx)
Check network/firewall settings
Review logs for specific errors

Streaming not working

Ensure client supports SSE (Server-Sent Events)
Disable proxy buffering if applicable

🤝 Contributing

Contributions welcome! Please feel free to:

Submit issues for bugs or feature requests
Create pull requests with improvements
Share your custom skills in the community

📄 License

MIT License - See LICENSE file for details

🙏 Acknowledgments

llama.cpp for llama-server
Model Context Protocol for MCP specification
OpenAI for API compatibility
Claude for skills system inspiration

📦 Related Projects

NOOLLAMA - This project
llama.cpp - The underlying inference engine
MCP Specification - Model Context Protocol

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
examples		examples
internal		internal
skills-python-demo		skills-python-demo
skills/.claude/skills		skills/.claude/skills
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum
main.go		main.go
mcp-config.example.json		mcp-config.example.json
no-ollama		no-ollama

License

soymh/noollama

Folders and files

Latest commit

History

Repository files navigation