Skip to content

feat(vendor): add llama.cpp vendor plugin#2131

Open
xb17 wants to merge 1 commit into
danielmiessler:mainfrom
xb17:feat/llamacpp-vendor
Open

feat(vendor): add llama.cpp vendor plugin#2131
xb17 wants to merge 1 commit into
danielmiessler:mainfrom
xb17:feat/llamacpp-vendor

Conversation

@xb17
Copy link
Copy Markdown

@xb17 xb17 commented May 28, 2026

What

Adds a dedicated vendor plugin for llama.cpp server.

Background

Issue #2072 requested llama.cpp support. Currently users have to route through the openai_compatible driver pointed at http://localhost:8080/v1, which works but requires manual setup and doesn't expose llama.cpp-specific behaviour.

Why a dedicated driver

llama.cpp's server is OpenAI-compatible but differs in a few meaningful ways:

openai_compatible llama.cpp
Default URL varies http://localhost:8080/v1
API key required field optional (local server)
cache_prompt not sent ✓ sent — reuses KV cache for repeated system prompts
SDK dependency OpenAI Go SDK none (hand-rolled HTTP)

The cache_prompt: true field is particularly valuable for Fabric usage patterns, where the same system prompt (pattern) is sent repeatedly across requests. llama.cpp reuses the KV cache for the matching prefix, significantly reducing time-to-first-token on subsequent requests.

Implementation

Follows the same hand-rolled HTTP approach as the LM Studio plugin. Implements ListModels, SendStream, and Send. API key is optional — the Authorization header is only set when a key is configured.

New files:

  • internal/plugins/ai/llamacpp/llamacpp.go
  • i18n keys in internal/i18n/locales/en.json
  • Registration in internal/core/plugin_registry.go

Closes #2072

Adds a dedicated vendor plugin for llama.cpp server
(https://github.com/ggml-org/llama.cpp).

llama.cpp exposes an OpenAI-compatible REST API but has behaviour that
differs from the generic openai_compatible driver:

- Default base URL: http://localhost:8080/v1 (llama.cpp default port)
- No API key required (auth header is sent only when a key is configured)
- Supports cache_prompt to reuse the KV cache across requests that share
  a common prefix (e.g. the same system prompt), reducing latency
- Does not use LM Studio-specific extensions (chat_template_kwargs, etc.)

The driver follows the same hand-rolled HTTP pattern as the LM Studio
plugin and implements ListModels, SendStream, and Send.

Closes danielmiessler#2072
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature request]: Support for local Llama.cpp architecture

1 participant