Skip to content
/ noollama Public

Have problems with ollama?! Same as I do! then use NoOllama!

License

Notifications You must be signed in to change notification settings

soymh/noollama

Repository files navigation

NOOLLAMA - Llama OpenAI Wrapper with Skills & MCP

Build Status Go Version License

A powerful Go-based OpenAI-compatible API wrapper around llama-server from llama.cpp, designed as a replacement for Ollama. It supports AI Skills, Tools, MCP (Model Context Protocol) servers, streaming responses, and automatic model discovery.

✨ Features

Core Features

  • OpenAI-Compatible API: Drop-in replacement for OpenAI API endpoints
  • Automatic Model Discovery: Scans ~/.cache/llama.cpp (or custom directory) for GGUF models
  • Streaming Support: Full SSE streaming for chat completions
  • Multi-Model: Automatically manages multiple llama-server instances
  • CORS Enabled: Ready for browser-based applications
  • API Authentication: Optional API key authentication (default: NOOLLAMA)
  • Model Auto-Unloading: Configurable timeout for unused model cleanup (default: 5 minutes)
  • Multimodal Support: Vision models with image input capabilities

Advanced Features

  • 🎯 AI Skills System: Domain-specific skill injection with vector search routing
    • Create custom skills in Markdown with frontmatter metadata
    • Automatic skill selection via semantic similarity + LLM routing
    • Skills inject specialized instructions into system prompts
    • Works seamlessly alongside MCP tools
  • πŸ”§ Tools Support: Native function calling with tool definitions
  • πŸ”Œ MCP Servers: Integration with Model Context Protocol servers for extended capabilities
  • πŸ”’ Security Features: Input validation, CORS restrictions, and secure MCP tool execution

πŸš€ Quick Start

Prerequisites

  1. Go 1.21+: Install from go.dev

  2. llama-server: From llama.cpp

    # Build from source
    git clone https://github.com/ggerganov/llama.cpp
    cd llama.cpp
    make -j
    # llama-server binary will be in the build directory
  3. GGUF Models: Place your models in ~/.cache/llama.cpp or specify custom directory

Installation

Option 1: Download Pre-built Binaries

Download the latest binary for your platform from the Releases page.

Option 2: Build from Source

# Clone the repository
git clone https://github.com/soymh/noollama
cd noollama

# Download dependencies
go mod download

# Build
go build -o noollama

# Run
./noollama

πŸ“š Documentation

Basic Usage

./noollama --port 8080 --models-dir ~/.cache/llama.cpp

Command Line Options

-port int
    Port to run the API server on (default 8080)
-host string
    Host to bind the API server to (default "0.0.0.0")
-models-dir string
    Directory containing GGUF models (default "~/.cache/llama.cpp")
-llama-server string
    Path to llama-server binary (default "llama-server")
-default-model string
    Default model to use if none specified
-mcp-config string
    Path to MCP servers configuration file
-api-key string
    API key for authentication (default "NOOLLAMA")
-model-timeout duration
    Time after which unused models are unloaded (default 5m0s)
-skills-dir string
    Directory containing skills (optional)
-db-path string
    Path to skills database file (default: ~/.cache/noollama/skills.db)
-embed-url string
    URL for embedding model (required for skills)
-embed-model string
    Embedding model name (required for skills)
-chat-url string
    URL for chat model (required for skills)
-chat-model string
    Chat model name (required for skills)

Examples

# Basic usage with default settings
./noollama

# Custom port and models directory
./noollama --port 9000 --models-dir /path/to/models

# With MCP support
./noollama --mcp-config ./mcp-config.json

# With Skills support (requires external embedding/chat models)
./noollama \
  --skills-dir ./skills \
  --embed-url http://localhost:11434/v1 \
  --embed-model nomic-embed-text \
  --chat-url http://localhost:11434/v1 \
  --chat-model llama-3.2-3b

# All options together
./noollama \
  --port 8080 \
  --host 0.0.0.0 \
  --models-dir /mnt/models/gguf \
  --llama-server /opt/llama.cpp/llama-server \
  --default-model llama-3.2-3b \
  --mcp-config ./mcp-servers.json \
  --api-key your-secret-key \
  --model-timeout 10m \
  --skills-dir ./my-skills \
  --embed-url http://localhost:11434/v1 \
  --embed-model nomic-embed-text \
  --chat-url http://localhost:11434/v1 \
  --chat-model llama-3.2-3b

🎯 Skills System

The Skills system allows you to create domain-specific instruction sets that are automatically injected based on user queries.

Creating a Skill

Skills are defined in Markdown files with YAML frontmatter:

---
name: my-skill-name
description: A brief description of when to use this skill
user-invocable: true
---

# Skill Title

Detailed instructions for the skill...

Example: Terminal Automation Skill

---
name: terminal-automation
description: "Execute terminal commands via MCP with step-by-step guidance"
user-invocable: true
---

# Terminal Automation Skill

When this skill is active, you are an expert DevOps Engineer.

## Instructions
1. Plan the command sequence
2. Execute commands one at a time using MCP tools
3. Capture screenshots after each command
4. Report results to the user

## Safety Guidelines
- Never run destructive commands without confirmation
- Always show commands before executing

How Skills Work

  1. Vector Search: User query is embedded and matched against skill descriptions
  2. LLM Routing: Top candidates are presented to a router model as tools
  3. Skill Injection: Selected skill instructions are added to the system prompt
  4. Execution: Model responds with skill-enhanced behavior while retaining access to MCP tools

Skills Directory Structure

skills/
└── .claude/
    └── skills/
        β”œβ”€β”€ explain-code/
        β”‚   └── SKILL.md
        └── terminal-automation/
            └── SKILL.md

πŸ”Œ MCP Server Configuration

Create an mcp-config.json file to enable MCP server integration:

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/tmp/allowed"],
      "env": {}
    },
    "github": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-github"],
      "env": {
        "GITHUB_PERSONAL_ACCESS_TOKEN": "your-token"
      }
    },
    "fetch": {
      "command": "uvx",
      "args": ["mcp-server-fetch"],
      "env": {}
    }
  }
}

Popular MCP Servers

  • Filesystem: @modelcontextprotocol/server-filesystem
  • GitHub: @modelcontextprotocol/server-github
  • PostgreSQL: @modelcontextprotocol/server-postgres
  • Git: @modelcontextprotocol/server-git
  • Fetch: mcp-server-fetch
  • Brave Search: @modelcontextprotocol/server-brave-search

Install MCP servers via npm:

npm install -g @modelcontextprotocol/server-filesystem
npm install -g @modelcontextprotocol/server-github

πŸ” Authentication

The server supports optional API key authentication. By default, it uses "NOOLLAMA" as the API key.

Using API Keys

Authorization Header:

curl -H "Authorization: Bearer NOOLLAMA" http://localhost:8080/v1/models

X-API-Key Header:

curl -H "X-API-Key: NOOLLAMA" http://localhost:8080/v1/models

Python Client:

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="NOOLLAMA"
)

πŸ“‘ API Endpoints

Models

List Available Models

curl http://localhost:8080/v1/models

Chat Completions

Standard Request

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": " mradermacher_Qwen3-4B-Instruct-2507-heretic-GGUF.gguf",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Streaming Request

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": " mradermacher_Qwen3-4B-Instruct-2507-heretic-GGUF.gguf",
    "messages": [{"role": "user", "content": "Tell me a story"}],
    "stream": true
  }'

With Tools

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": " mradermacher_Qwen3-4B-Instruct-2507-heretic-GGUF.gguf",
    "messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string"}
          },
          "required": ["location"]
        }
      }
    }]
  }'

πŸ’» Client Examples

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="NOOLLAMA"
)

response = client.chat.completions.create(
    model=" mradermacher_Qwen3-4B-Instruct-2507-heretic-GGUF.gguf",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Node.js (OpenAI SDK)

import OpenAI from 'openai';

const openai = new OpenAI({
  baseURL: 'http://localhost:8080/v1',
  apiKey: 'NOOLLAMA'
});

const response = await openai.chat.completions.create({
  model: ' mradermacher_Qwen3-4B-Instruct-2507-heretic-GGUF.gguf',
  messages: [{ role: 'user', content: 'Hello!' }]
});

console.log(response.choices[0].message.content);

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   HTTP Server   β”‚
β”‚   (Gin Framework)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
    β”‚  Manager β”‚
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
         β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
    β”‚llama-   β”‚  β”‚   MCP     β”‚  β”‚ Skills  β”‚
    β”‚server   β”‚  β”‚  Clients  β”‚  β”‚ Manager β”‚
    β”‚Instancesβ”‚  β”‚           β”‚  β”‚         β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
                                     β”‚
                              β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
                              β”‚ Vector DB + β”‚
                              β”‚   Router    β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

⚑ Performance Tips

  1. Model Auto-Unloading: Unused models automatically unload after 5 minutes
  2. Keep models warm: Make a test request after startup
  3. Use Q4_K_M quantization: Best balance of size/speed/quality
  4. SSD storage: Faster model loading
  5. Sufficient RAM: Ensure enough memory for model + context + skills

πŸ› Troubleshooting

Model not found

  • Verify the model file exists with .gguf extension
  • Check read permissions on the model file

Skills not loading

  • Ensure all required flags are provided (--embed-url, --embed-model, --chat-url, --chat-model)
  • Check skills directory structure: skills/.claude/skills/<name>/SKILL.md
  • Verify YAML frontmatter is valid (use quotes for descriptions with colons)

MCP servers not connecting

  • Verify MCP server command is installed (npx, uvx)
  • Check network/firewall settings
  • Review logs for specific errors

Streaming not working

  • Ensure client supports SSE (Server-Sent Events)
  • Disable proxy buffering if applicable

🀝 Contributing

Contributions welcome! Please feel free to:

  • Submit issues for bugs or feature requests
  • Create pull requests with improvements
  • Share your custom skills in the community

πŸ“„ License

MIT License - See LICENSE file for details

πŸ™ Acknowledgments

πŸ“¦ Related Projects

About

Have problems with ollama?! Same as I do! then use NoOllama!

Resources

License

Stars

Watchers

Forks

Packages

No packages published