Skip to content

Caveman Mode

Ankur Nair edited this page Apr 19, 2026 · 1 revision

Caveman Mode

Caveman Mode is TitanX's token-saving layer for long conversations. On conversations with 50+ turns, it typically reduces token usage by 30–75%. Opt-in per team or per conversation.

The name is a joke about thinking "like a caveman" — using fewer words, grunting out only the essentials.


The problem it solves

Every turn in a conversation carries the full prior history to the LLM. After 100 turns, you're paying for the same 100-turn context every single turn — quadratic cost in conversation length.

Naive solutions (just dropping old turns) lose context you might still need. Caveman Mode is smarter: it compresses old turns into summaries while keeping recent turns verbatim.


How it works

Trigger

Caveman Mode activates when:

  1. Total input token count exceeds a threshold (default 8000 tokens)
  2. OR user manually invokes /compact

Compression

  1. TitanX identifies "old" turns (everything except the last N, default N=5)
  2. Sends them to the same model with a summarize prompt
  3. Replaces the old turns in the conversation context with a single summary message

The original turns aren't deleted — they stay in SQLite for display. Only the in-context representation sent to the model changes.

Recent-turn fidelity

Recent turns (last N, typically 5) are always kept verbatim. The LLM has lossless access to what happened recently. Only far-back history is compressed.

Re-triggering

After compression, the conversation continues. When it grows back past the threshold (new turns accumulate), Caveman triggers again — compresses the now-old-turns-plus-previous-summary into a new consolidated summary. Repeat as needed.


When to enable

✅ Good fits

  • Long multi-step projects that span dozens of turns
  • Support / customer service bots with continuous conversations
  • Daily-standup type heartbeat agents that build up context
  • Research agents pulling lots of context from different sources

❌ Poor fits

  • Short, focused conversations (Caveman has overhead; <20 turns isn't worth it)
  • High-fidelity work where every detail matters (legal review, exact code refactoring on a specific line — compression can drop specifics)
  • Multi-step debugging where you're re-referencing turn-1 details on turn-50 (Caveman can drop those specifics; keep it off for these)

Configuration

Global toggle

Settings → System → Enable Caveman Mode — applies to all new conversations. Off by default.

Per-team override

Team Settings → Caveman Mode — override global for this team's agents.

Per-conversation

Conversation settings gear → Caveman Mode → on/off/inherit.

Threshold tuning

Settings → System → Caveman Threshold — token count at which compression triggers. Default 8000.

  • Lower (4000) → compresses more aggressively, more savings but more info loss
  • Higher (16000) → compresses less, less savings but higher fidelity

Sweet spot for most teams: 8000 (default). Enterprise teams with long support sessions often prefer 12000.

Keep-last-N tuning

Settings → System → Caveman Keep-Last-N Turns — how many recent turns stay verbatim. Default 5.

  • Lower (3) → aggressive compression; good for bot-style agents
  • Higher (10) → cautious; good for back-and-forth dialogue

What gets summarized

The summarization prompt asks the model for:

  1. Key facts — names, numbers, IDs, decisions
  2. State changes — "agent X completed task Y", "user clarified they want Z"
  3. Open questions — unresolved points that might matter later
  4. Pinned context — anything marked as pinned in the UI is always kept verbatim

Formatted output is a structured markdown block replacing the old turns. The LLM reads it as one message instead of N.

Example summary block:

[Compressed summary of turns 1-45]

Context:
- User is building a Python FastAPI service for a todo app
- Chose Postgres, Redis caching
- Agent wrote 3 endpoints (list/create/delete), user approved all
- Wrote unit tests for list endpoint

Open questions:
- User hasn't decided on auth strategy yet (mentioned JWT + OAuth options in turn 30)
- Deployment target (mentioned Fly.io vs. Railway in turn 42)

Last 5 turns: [shown verbatim below]

Manual invocation

/compact in any conversation → forces immediate Caveman pass regardless of threshold.

/restore → un-applies the last Caveman compression, restoring full history for the next turn. (Useful if compression dropped something you need.)


Cost of compression itself

Caveman's compression turn costs tokens too — it's one additional LLM call per trigger. Typical cost:

  • Compression call: ~2000-4000 tokens input, ~500 tokens output
  • Savings per subsequent turn: (full history - compressed summary) × remaining turns

Break-even is usually around turn 15-20 for an 8000-threshold config. Beyond that, net savings grow.


Compression caveats

Information loss is real. The summary is lossy by design. If you need perfect recall of turn-2's exact wording on turn-100, don't use Caveman.

Mitigations:

  • Pin critical messages (pinned content survives compression)
  • Use /compact manually at logical checkpoints (control when it triggers)
  • Keep a higher keep-last-N
  • Turn it off for high-fidelity work

Observability

Observability → Caveman Panel shows:

  • Total compressions performed
  • Tokens saved (estimated)
  • Cost saved (estimated)
  • Per-conversation compression history

If Caveman is off, the panel shows what you'd save if you enabled it (based on retroactive analysis of your conversation lengths).


Implementation detail: conversation-type support

Caveman currently works for these conversation types:

  • claude (API) ✅
  • gemini (API) ✅
  • acp with claude/gemini/codex backends ✅
  • openclaw-gateway
  • farm — 🚧 v2.6 (farm conversations use the slave's caveman config, not master's)
  • nanobot — 🚧 not yet

If you flip Caveman on for an unsupported conversation type, the setting is stored but no-ops.


Related pages

Clone this wiki locally