Models don't pay equal attention to all parts of the context:
┌─────────────────────────────────────┐
│ HIGH ATTENTION — Start of context │ ← System prompt, role, critical rules
│ │
│ LOWER ATTENTION — Middle │ ← Supporting context, examples
│ │
│ HIGH ATTENTION — End of context │ ← Current request, recent messages
└─────────────────────────────────────┘
Implication: Place the most important information at the START and END. Put supporting details in the middle.
Before building a workflow, calculate your context budget:
| Model | Window Size | Practical Budget |
|---|---|---|
| 8K models | 8,192 tokens | ~5,500 usable |
| 32K models | 32,768 tokens | ~22,000 usable |
| 128K models | 131,072 tokens | ~90,000 usable |
| 200K+ models | 200,000+ tokens | ~140,000 usable |
Why "practical budget"? Because you need to reserve:
- 20-30% for the model's response
- A safety margin for tokenization variance
- Space for tool call results (which are unpredictable in size)
Prioritize what goes into context:
- Always include: System prompt, output schema, current request
- Include if relevant: Retrieved documents (RAG), recent conversation turns
- Summarize: Long histories, large documents, previous tool results
- Never include: Entire codebases, full databases, raw log files
For multi-turn conversations:
Turn 1: Full context
Turn 2: System prompt + Turn 1 summary + Turn 2 input
Turn 3: System prompt + Turns 1-2 summary + Turn 3 input
...
Summarize aggressively. Keep the system prompt and current turn at full fidelity. Everything else is summary.
Explicit state: Pass workflow state as structured data
{
"workflow_id": "abc123",
"current_step": 3,
"completed_steps": [1, 2],
"accumulated_results": { ... },
"remaining_budget": { "tokens": 50000, "api_calls": 10 }
}Checkpoint state: Save state at decision points so workflows can resume after failure. Every tool call, external API call, or model call is a potential failure point. Save state BEFORE, not after.
| Pattern | Use When | Implementation |
|---|---|---|
| Full history | Short conversations (<10 turns) | Pass all turns |
| Sliding window | Medium conversations | Keep last N turns |
| Summary + recent | Long conversations | Summarize old, keep recent 3-5 |
| Episodic memory | Complex workflows | Key decisions + outcomes stored |
| Semantic memory | Knowledge-heavy tasks | Vector DB for retrieval |
- The data dump: Stuffing the entire codebase into context. Use targeted retrieval.
- The infinite history: Passing all 200 conversation turns. Summarize old turns.
- The hopeful truncation: Truncating context at a fixed token count without considering what gets cut. Truncation should be semantic, not positional.
- The empty middle: Putting everything at the start and end with nothing in between. The middle matters for supporting context.