This guide explains how Heimdall manages context and token allocation with the default qwen3-0.6b-instruct model.
Heimdall uses a single-shot command architecture - each request is independent with no conversation history accumulation. This maximizes the available context for rich system prompts while keeping responses fast.
┌─────────────────────────────────────────────────────────────────┐
│ 8K CONTEXT WINDOW (default) │
│ (model may support larger; configurable) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ SYSTEM PROMPT (6K budget) │ │
│ │ ┌────────────────────────────────────────────────────┐ │ │
│ │ │ Identity & Role (~50 tokens) │ │ │
│ │ │ "You are Heimdall, the AI assistant..." │ │ │
│ │ ├────────────────────────────────────────────────────┤ │ │
│ │ │ Available Actions (~200-500 tokens) │ │ │
│ │ │ - heimdall_watcher_status │ │ │
│ │ │ - heimdall_watcher_query │ │ │
│ │ │ - [plugin-registered actions] │ │ │
│ │ ├────────────────────────────────────────────────────┤ │ │
│ │ │ Cypher Query Primer (~400 tokens) │ │ │
│ │ │ - Basic patterns, filtering, aggregations │ │ │
│ │ │ - Path queries, modifications, subqueries │ │ │
│ │ ├────────────────────────────────────────────────────┤ │ │
│ │ │ Response Modes (~100 tokens) │ │ │
│ │ │ - ACTION MODE: JSON for operations │ │ │
│ │ │ - HELP MODE: Conversational for questions │ │ │
│ │ ├────────────────────────────────────────────────────┤ │ │
│ │ │ Plugin Instructions (~variable) │ │ │
│ │ │ - AdditionalInstructions from plugins │ │ │
│ │ ├────────────────────────────────────────────────────┤ │ │
│ │ │ Examples (~500 tokens) │ │ │
│ │ │ - 20 built-in examples for common commands │ │ │
│ │ └────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ Total base system: ~1,200 tokens │ │
│ │ Available for plugins: ~4,800 tokens │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ USER MESSAGE (2K budget) │ │
│ │ Single-shot command from Bifrost UI │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ RESPONSE (1K max tokens) │ │
│ │ JSON action OR conversational help │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
| Setting | Value | Purpose |
|---|---|---|
NORNICDB_HEIMDALL_CONTEXT_SIZE |
8192 | Model context window (tokens) |
NORNICDB_HEIMDALL_BATCH_SIZE |
2048 | Batch size for prefill |
NORNICDB_HEIMDALL_MAX_TOKENS |
1024 | 1K response limit |
Heimdall also enforces a prompt construction budget (so plugins can’t blow up the system prompt):
| Setting | Default | Description |
|---|---|---|
NORNICDB_HEIMDALL_MAX_CONTEXT_TOKENS |
8192 | Total prompt budget (system + user) |
NORNICDB_HEIMDALL_MAX_SYSTEM_TOKENS |
6000 | System prompt budget (base + plugins) |
NORNICDB_HEIMDALL_MAX_USER_TOKENS |
2000 | User message budget |
The same settings apply to all providers (local GGUF, Ollama, OpenAI). Override via the env vars above or via YAML (heimdall.max_context_tokens, max_system_tokens, max_user_tokens).
For remote providers (Ollama, OpenAI), inference runs on the provider’s infrastructure, so there is no local GPU memory limit. You can safely increase the token budgets to match the model’s context window (e.g. 32K or 128K).
Example — 32K context (e.g. GPT-4, many Ollama models):
export NORNICDB_HEIMDALL_MAX_CONTEXT_TOKENS=32768
export NORNICDB_HEIMDALL_MAX_SYSTEM_TOKENS=24000
export NORNICDB_HEIMDALL_MAX_USER_TOKENS=8000Example — 128K context (e.g. GPT-4o-mini, GPT-4 Turbo, Claude, or large Ollama models):
export NORNICDB_HEIMDALL_MAX_CONTEXT_TOKENS=131072
export NORNICDB_HEIMDALL_MAX_SYSTEM_TOKENS=100000
export NORNICDB_HEIMDALL_MAX_USER_TOKENS=30000Keep max_system_tokens + max_user_tokens within your model’s context size and leave headroom for the response. Startup logs show the active budget, e.g. Token budget: 32K context = 24K system + 8K user.
When the system prompt exceeds the batch size, Heimdall automatically splits it into multiple batches:
System Prompt (2K tokens) + User Message (500 tokens) = 2.5K total
Batch 1: [System prompt tokens 0-2047] → KV cache stores
Batch 2: [Remaining tokens + user message] → KV cache accumulates
→ Generation starts
The KV cache accumulates across batches, so the model "sees" the entire context when generating.
These defaults are defined in pkg/heimdall/types.go and can be overridden via environment variables:
const (
DefaultMaxContextTokens = 8192 // 8K total context budget
DefaultMaxSystemPromptTokens = 6000 // 6K for system + plugins
DefaultMaxUserMessageTokens = 2000 // 2K for user commands
TokensPerChar = 0.25 // ~4 chars per token estimate
)| Component | Estimated Tokens | Notes |
|---|---|---|
| Base identity | ~50 | Fixed header |
| Available actions | 200-500 | Depends on plugin count |
| Cypher primer | ~400 | Reference guide |
| Response modes | ~100 | Action + Help modes |
| Built-in examples | ~500 | 20 comprehensive examples |
| Base total | ~1,200 | Before plugins |
| Plugin instructions | ~10,800 available | Plugins can add context |
If plugins add too many instructions and the system prompt exceeds the system budget, Heimdall automatically falls back to a minimal prompt:
// Minimal fallback prompt (~200 tokens)
"You are Heimdall, AI assistant for NornicDB graph database.
ACTIONS:
[plugin actions only]
For queries: {"action": "heimdall_watcher_query", "params": {"cypher": "..."}}
Respond with JSON only."| Factor | Impact | Notes |
|---|---|---|
| MaxTokens | High | Each output token takes ~same time |
| GPU vs CPU | Very High | GPU is 10-50x faster |
| Prompt size | Low | Only affects prefill, not generation |
| Context/Batch size | Minimal | Memory allocation only |
- KV Cache is lazy - Only allocates for actual tokens used
- Prefill is fast - Parallel processing of input tokens
- Generation dominates - 90% of time is in token generation
- Your prompts are small - typically ~1–3K tokens vs 8K default capacity (and you can increase the context window if needed)
| Spec | Value |
|---|---|
| Parameters | 500M |
| Context Length | 32,768 tokens |
| Quantization | Q4_K_M recommended |
| VRAM (GPU) | ~500MB |
| RAM (CPU) | ~1GB |
| License | Apache 2.0 |
NORNICDB_HEIMDALL_ENABLED=true
# Uses defaults - 8K context, 2K batch, 1K outputNORNICDB_HEIMDALL_ENABLED=true
NORNICDB_HEIMDALL_CONTEXT_SIZE=4096 # Reduce if low RAM
NORNICDB_HEIMDALL_BATCH_SIZE=1024
NORNICDB_HEIMDALL_MAX_TOKENS=512 # Shorter responsesNORNICDB_HEIMDALL_ENABLED=true
NORNICDB_HEIMDALL_MAX_TOKENS=2048 # Allow longer explanationsThe handler logs token budget information:
[Bifrost] Token budget: system=1247, user=156, total=1403/8192
If you see truncation errors, check:
- Is MaxTokens high enough for the response?
- Are plugins adding too many instructions?
- Is the user message within budget?
- Heimdall AI Assistant - Overview and configuration
- Heimdall Plugins - Writing custom plugins
- Operations - Monitoring - Prometheus metrics