Skip to content

Gemma 4 26B agent definitions (battle-tested on local llama.cpp) #22

@interpreteragent

Description

@interpreteragent

Gemma 4 26B agent definitions (battle-tested on local llama.cpp)

I've been running pi-interactive-subagents with Gemma 4 26B-A4B on local llama.cpp (GTX 1080, ~16 t/s) for several days. The bundled agent personas are tuned for Claude, and Gemma hits a set of failure modes that Claude doesn't. I've iteratively fixed the personas to address these and added new agents to fill gaps in the pipeline.

Sharing in case this is useful for the project or other local-model users.

What's included

6 modified personas (from the bundled set):

  • planner.md -- added mandatory phase enforcement ("No Skipping"), premortem step, todo quality checklist with code examples requirement
  • reviewer.md -- added algorithmic complexity to the review rubric, "Before issuing APPROVED" evidence requirement
  • scout.md -- minor refinements
  • spec.md -- added "STOP AND WAIT" enforcement (Gemma steamrolls past questions), mandatory phase gates
  • visual-tester.md -- added system context reference, cleanup step
  • worker.md -- added pre-write checklist, promise honoring, "write scripts to files" rule, cleanup-before-commit step, automation verification, state file restoration

5 new personas:

  • orchestrator.md -- pipeline conductor that routes between agents, manages fix cycles (cap at 3), produces session summaries. Strictly tool-only (no code reading, no implementation).
  • tester.md -- functional tester that runs the program with real and adversarial inputs. Separate from reviewer (which reads code). Includes state backup/restore, mandatory cleanup, ISC verification.
  • debugger.md -- diagnostic specialist for root cause analysis. Hypothesis-driven, supports git bisect. Report-only, no fixes.
  • researcher.md -- web researcher with last30days skill integration and anti-hallucination rules (cite sources, report conflicts, never fill gaps from training data).
  • refactor.md -- systematic refactoring with exhaustive search (multiple naming conventions), verification that zero instances remain.

AGENTS.md additions (8 rules targeting Gemma 4 failure modes):

  1. <|think|> token at top of file -- forces Gemma 4 into thinking mode. Without it, thinking drops off after turn 1-2 (known Gemma 4 template behavior).

  2. "Core rule: no facts without a fresh source" -- before writing any name, path, version, flag, or URL, ask "did a tool call IN THIS CONVERSATION confirm this?" Gemma hallucinates package names and CLI flags confidently.

  3. "EXECUTE tools, don't describe them" -- Gemma frequently writes "I'll now spawn a scout" and then continues talking without making the tool call. Includes BAD/GOOD examples. This was the single most common failure mode.

  4. "Verify CLI tools, flags, and options" -- VERIFIED/RECALLED framework. Run command -v, check --help or man pages before using any flag. Gemma invents plausible-sounding flags that don't exist.

  5. "Do not mentally simulate code. Run it." -- Gemma gets stuck in reasoning loops tracing code in its head for 500+ tokens instead of writing a 3-line test and running it.

  6. "Never retry the same command twice" -- Gemma re-runs failed commands without reading the error.

  7. "Promise honoring" -- Gemma writes "I will verify X before doing Y" then does Y without verifying. This rule makes it explicit: restating the promise does not count as keeping it.

  8. "Write scripts to files instead of inline" -- python3 -c '...' breaks on shell escaping. Write to /tmp/script.py and run it.

What these fixed in practice

  • Worker leaving 13 scratch/debug files in project root (now has mandatory cleanup step)
  • Worker committing files with print(f"DEBUG: statements (now searches for debug artifacts)
  • Tester polluting state files with mock data like example.com (now backs up and restores)
  • Reviewer missing O(n^2) parser on multi-KB input (now has algorithmic complexity in rubric)
  • Worker creating systemd timer files but never enabling them (now verifies automation is active)
  • Orchestrator describing tool calls instead of making them (EXECUTE rule)
  • Spec agent answering its own questions and proceeding without user input (STOP AND WAIT)

Model and setup details

Model: Gemma 4 26B-A4B-it, Q5_K_M quantization (Unsloth GGUF, Apr 11 2026 upload with Google's updated chat template)

Runtime: llama.cpp (built from HEAD with CUDA SM 6.1)

llama-server flags:

llama-server \
  -m gemma-4-26B-A4B-it-UD-Q5_K_M.gguf \
  -mm mmproj-BF16.gguf --no-mmproj-offload \
  -ngl 31 -ot "blk\.(4|6|10|23)\.ffn_.*_exps=CUDA0,exps=CPU" \
  --no-mmap \
  --cache-type-k q4_0 --cache-type-v q4_0 -fa on \
  -cram 16384 \
  -c 131072 -np 1 -t 4 \
  --jinja \
  --reasoning auto \
  --temp 1.0 --top-p 0.95 --top-k 64 \
  --port 8090

Key flags explained:

  • -ngl 31 -- all 26 transformer layers + embeddings offloaded to GPU
  • -ot "blk\.(4|6|10|23)\.ffn_.*_exps=CUDA0,exps=CPU" -- MoE expert offloading: pins 4 expert layers to GPU, rest on CPU. This is how a 26B MoE fits in 8GB VRAM.
  • --no-mmap -- required for MoE expert offloading to work correctly
  • --cache-type-k q4_0 --cache-type-v q4_0 -fa on -- quantized KV cache with flash attention, enables 128K context in limited VRAM
  • -cram 16384 -- VRAM budget for KV cache (MB)
  • -c 131072 -- 128K context window
  • --jinja -- uses the chat template baked into the GGUF metadata (critical for Gemma 4, the built-in llama.cpp fallback template causes hallucinations)
  • --reasoning auto -- enables thinking/reasoning tokens
  • --temp 1.0 --top-p 0.95 --top-k 64 -- Unsloth's recommended sampling (no DRY, no min-p, no repeat-penalty; those degrade URL/identifier verbatim output)

Pi models.json config:

{
  "llama-gemma4": {
    "baseUrl": "http://localhost:8090/v1",
    "api": "openai-completions",
    "apiKey": "none",
    "models": [
      {
        "id": "gemma-4-26B-A4B-it-UD-Q5_K_M.gguf",
        "name": "gemma4-26b-q5",
        "contextWindow": 131072,
        "maxTokens": 131072,
        "input": ["text", "image"],
        "reasoning": true
      }
    ],
    "compat": {
      "supportsDeveloperRole": false,
      "supportsReasoningEffort": false,
      "maxTokensField": "max_tokens",
      "thinkingFormat": "qwen-chat-template"
    }
  }
}

Note: "thinkingFormat": "qwen-chat-template" is required for Pi to correctly parse Gemma 4's reasoning tokens through llama.cpp's OpenAI-compatible API.

Hardware: GTX 1080 (8GB VRAM, Pascal SM 6.1), Debian testing. ~16 t/s generation, ~43 t/s prompt processing.

Performance: A full pipeline run (scout -> reviewer -> tester -> worker -> re-verify) on a 120-line Python script takes ~3.5 hours at this speed. This is the hardware constraint, not agent overhead. Subagent sessions range from 4-30 minutes each.

Testing: Used across multiple real projects with the full pipeline (scout -> spec -> plan -> worker -> reviewer -> tester -> fix cycles). The persona changes were iterative, each addressing a specific failure observed in a real session.

How to ship this

I don't know what structure makes sense for the repo. Some options:

  • A gemma4/ subdirectory under agents/ with a note in the README
  • A separate branch
  • Just documenting the AGENTS.md additions as "recommended for local models" in the README

Happy to submit a PR in whatever format works. The agent override mechanism (~/.pi/agent/agents/ > package-bundled) means users can adopt these without any repo changes, but having them discoverable in the project would help other local-model users.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions