Skip to content

Add code compaction and session memory cookbooks#343

Merged
jsham042 merged 14 commits into
mainfrom
feature/code-compaction-cookbook
Jan 30, 2026
Merged

Add code compaction and session memory cookbooks#343
jsham042 merged 14 commits into
mainfrom
feature/code-compaction-cookbook

Conversation

@jsham042

@jsham042 jsham042 commented Jan 8, 2026

Copy link
Copy Markdown
Contributor

Added a cookbook to showcase compaction of long context, multi turn AI loops into session memory. This cookbook showcases how to:

  • Write effective session memory prompts that preserve critical context across compaction events
  • Implement instant compaction using background threading to eliminate user wait time
  • Apply prompt caching to reduce the cost of background memory updates by ~80%
  • Choose appropriate compaction strategies (traditional vs. instant) based on your use case

@github-actions

github-actions Bot commented Jan 8, 2026

Copy link
Copy Markdown

Notebook Changes

This PR modifies the following notebooks:

📓 misc/compaction_cookbook.ipynb

View diff
nbdiff /dev/null misc/compaction_cookbook.ipynb (0b8873ee05cd8fc821dca1947fce555d9e318210)
--- /dev/null  2026-01-08 06:16:25.096419
+++ misc/compaction_cookbook.ipynb (0b8873ee05cd8fc821dca1947fce555d9e318210)  (no timestamp)
## added /cells:
+  markdown cell:
+    source:
+      # Compaction Cookbook: Incremental Session Memory Strategy
+      
+      This notebook demonstrates an efficient compaction strategy that uses **incremental background summarization** instead of summarizing everything at compaction time.
+      
+      ## The Problem
+      
+      Traditional compaction summarizes the entire conversation when context gets full. This is:
+      - **Slow**: Requires a blocking API call at the moment the user is waiting
+      - **Disruptive**: The user experiences latency at the worst possible time
+      
+      ## The Solution: Session Memory
+      
+      Instead, we maintain a **running summary** that updates incrementally in the background:
+      1. Periodically summarize new messages into a "session memory"
+      2. Track which messages have been summarized
+      3. At compaction time, just use the pre-computed summary + unsummarized messages
+      
+      **Key benefit**: The compaction itself is instant - no API call needed when context is full.
+      
+      **Trade-off**: This adds overhead from periodic summarization calls, so it doesn't reduce total API cost. The value is in eliminating user-facing latency.
+      
+      ## Setup
+  code cell:
+    source:
+      import anthropic
+      from anthropic.types import MessageParam
+      
+      client = anthropic.Anthropic()
+      MODEL = "claude-sonnet-4-5-20250929"
+  markdown cell:
+    source:
+      ## Session Memory Manager
+      
+      This class manages the incremental summarization strategy:
+  code cell:
+    source:
+      from dataclasses import dataclass, field
+      
+      
+      @dataclass
+      class SessionMemory:
+          """Manages incremental session summarization for fast compaction."""
+          min_tokens_to_init: int = 1000  # Tokens before first summarization
+          min_tokens_between_updates: int = 500  # Tokens between updates
+          summary: str = ""
+          last_summarized_count: int = 0  # Message count at last summarization
+          tokens_at_last_update: int = 0
+          current_tokens: int = 0
+      
+          def update_tokens(self, tokens: int):
+              """Update current token count (call after each API response)."""
+              self.current_tokens = tokens
+      
+          def should_summarize(self) -> bool:
+              """Check if we should run background summarization."""
+              if self.current_tokens < self.min_tokens_to_init:
+                  return False
+      
+              tokens_since = self.current_tokens - self.tokens_at_last_update
+              return tokens_since >= self.min_tokens_between_updates
+       
+          def compact_conversation(self, messages: list[MessageParam], summarize_fn):
+              """Incrementally summarize new messages in the background."""
+              new_messages = messages[self.last_summarized_count :]
+              if not new_messages:
+                  return
+      
+              self.summary = summarize_fn(new_messages, self.summary)
+              self.last_summarized_count = len(messages)
+              self.tokens_at_last_update = self.current_tokens
+      
+              print(f"  [Background] Summarized {len(new_messages)} messages at {self.current_tokens} tokens")
+      
+              return self.summary
+  markdown cell:
+    source:
+      ## Summarization Function
+      
+      The summarization function calls Claude to extract key information:
+  code cell:
+    source:
+      def summarize_messages(messages: list[MessageParam], existing_memory: str) -> str:
+          """Use Claude to incrementally summarize conversation messages."""
+          conversation_text = "\n".join(f"{msg['role'].upper()}: {msg['content']}" for msg in messages)
+      
+          if existing_memory:
+              prompt = f"""Update this session memory with new conversation turns.
+      
+      <existing_summary>
+      {existing_memory}
+      </existing_summary>
+      
+      <new_messages>
+      {conversation_text}
+      </new_messages>
+      
+      Return only the updated summary."""
+          else:
+              prompt = f"""Summarize this conversation concisely.
+      
+      <messages>
+      {conversation_text}
+      </messages>
+      
+      Capture: topics discussed, key decisions, important context. Return only the summary."""
+      
+          response = client.messages.create(
+              model=MODEL,
+              max_tokens=500,
+              messages=[{"role": "user", "content": prompt}],
+          )
+      
+          return response.content[0].text
+  markdown cell:
+    source:
+      ## Real Conversation Loop with Session Memory
+      
+      Now let's run a real conversation with Claude while demonstrating background summarization:
+  code cell:
+    source:
+      # Create session memory with low thresholds for demo
+      session_memory = SessionMemory(
+          min_tokens_to_init=200, 
+          min_tokens_between_updates=500
+      )
+      
+      SYSTEM_PROMPT = """You are a helpful coding assistant. Keep responses concise but informative."""
+      
+      user_questions = [
+          "What are Python decorators and why are they useful?",
+          "Show me a simple decorator example that logs function calls.",
+          "How do I create a decorator that accepts arguments?",
+          "Now explain Python's async/await syntax briefly.",
+          "What's the difference between asyncio.gather and asyncio.wait?",
+      ]
+      
+      print("=" * 60)
+      print("CONVERSATION WITH BACKGROUND SUMMARIZATION")
+      print("=" * 60)
+      
+      messages: list[MessageParam] = []
+      for i, question in enumerate(user_questions, 1):
+          print(f"\n[Turn {i}] USER: {question}")
+          print("-" * 40)
+      
+          messages.append({"role": "user", "content": question})
+      
+          response = client.messages.create(
+              model=MODEL,
+              max_tokens=1024,
+              system=SYSTEM_PROMPT,
+              messages=messages,
+          )
+      
+          assistant_msg: MessageParam = {"role": "assistant", "content": response.content[0].text}
+          messages.append(assistant_msg)
+          session_memory.update_tokens(response.usage.input_tokens + response.usage.output_tokens) # Update token count with each response
+      
+          print(f"ASSISTANT RESPONSE: {response.content[0].text}")
+         
+          if not session_memory.should_summarize():
+              print(f"\nConversation at {response.usage.input_tokens + response.usage.output_tokens} tokens, no summarization needed yet.")
+              continue
+      
+          if session_memory.should_summarize():
+              print(f"\nConversation at {response.usage.input_tokens + response.usage.output_tokens} tokens, running background summarization...")
+              # Create the summary of the conversation and update session memory
+              session_memory.compact_conversation(messages, summarize_messages)
+              
+              # Reset the messages to only keep the summary for future turns
+              summary = session_memory.compact_conversation(messages, summarize_messages)
+              
+              print("\n" + "=" * 60)
+              print("SESSION MEMORY SUMMARY")
+              print("=" * 60)
+              print(session_memory.summary)
+      
+              messages = [{"role": "user", "content": f"You have been chatting with the user already. Here is the summary of the conversation so far:\n{summary}"}]
+  code cell:
+    source:
+      print("=" * 60)
+      print("COMPACTION DEMONSTRATION")
+      print("=" * 60)
+      
+      print(f"\nBefore compaction:")
+      print(f"  Total messages: {len(messages)}")
+      
+      
+      
+      print(f"\nAfter compaction:")
+      print(f"  Messages kept (unsummarized): {len(kept_messages)}")
+      for msg in kept_messages:
+          content = msg["content"]
+          preview = content[:50] + "..." if len(content) > 50 else content
+          print(f"    - {msg['role']}: {preview}")
+      
+      print(f"\nPre-computed summary:")
+      print("-" * 40)
+      print(summary)
+      print("-" * 40)
+      print("\n(Compaction was instant - no API call needed!)")
+  markdown cell:
+    source:
+      ## Key Benefits
+      
+      1. **Instant compaction**: No API call needed at compaction time - summary already exists
+      2. **Non-blocking**: Background summarization doesn't interrupt the user
+      3. **No lost context**: Messages after the last summarization are preserved verbatim
+      4. **Configurable thresholds**: Control when summarization happens based on token count
+      
+      ## Production Considerations
+      
+      In a real implementation (like Claude Code's `sessionMemory.ts`):
+      
+      - **Background execution**: Summarization runs in a forked process to not block the main conversation
+      - **Tool call awareness**: Don't summarize mid-tool-use to avoid orphaned tool results  
+      - **File persistence**: Session memory is saved to disk (`.claude/session-memory.md`)
+      - **Threshold tuning**: Default is 10K tokens to init, 5K between updates
+  markdown cell:
+    source:
+      ## Continuing After Compaction
+      
+      After compaction, we can continue the conversation using the summary as context:
+  code cell:
+    source:
+      # Build the compacted message history
+      compacted_messages: list[MessageParam] = [
+          {"role": "user", "content": f"[Previous conversation summary]\n{summary}\n\n[Continuing conversation]"},
+          {"role": "assistant", "content": "I understand. I have context from our previous discussion. How can I help?"},
+      ]
+      compacted_messages.extend(kept_messages)
+      
+      # Continue the conversation
+      print("=" * 60)
+      print("CONTINUING CONVERSATION AFTER COMPACTION")
+      print("=" * 60)
+      
+      follow_up = "Based on what we discussed, how would I combine a decorator with an async function?"
+      print(f"\nUSER: {follow_up}")
+      print("-" * 40)
+      
+      response, tokens = chat(follow_up, compacted_messages)
+      print(f"ASSISTANT: {response['content']}")
+      
+      print(f"\n[Context: {len(compacted_messages)} messages, {tokens} tokens instead of {session_memory.current_tokens}]")
+  markdown cell:
+    source:
+      ## Summary
+      
+      This notebook demonstrated the **incremental session memory** pattern for efficient context compaction:
+      
+      | Approach | At Compaction Time | Cost Distribution |
+      |----------|-------------------|-------------------|
+      | **Traditional** | Summarize all messages (slow) | All cost at once |
+      | **Session Memory** | Use pre-computed summary (instant) | Cost spread over time |
+      
+      The key insight is that **summarization work can be done incrementally in the background**, making the actual compaction operation nearly instant. This pattern is particularly valuable for long-running conversations where context management is critical.
+  markdown cell:
+    source:
+      ## Evaluation: Response Time with Full vs Compacted Context
+      
+      Let's measure how much faster follow-up responses are when using the compacted context vs the full conversation history.
+  code cell:
+    source:
+      import time
+      
+      
+      def timed_chat(user_message: str, messages: list[MessageParam]) -> tuple[str, float, int]:
+          """Send a message and return response, elapsed time, and input tokens."""
+          start_time = time.perf_counter()
+      
+          response = client.messages.create(
+              model=MODEL,
+              max_tokens=1024,
+              system=SYSTEM_PROMPT,
+              messages=messages + [{"role": "user", "content": user_message}],
+          )
+      
+          elapsed = time.perf_counter() - start_time
+          return response.content[0].text, elapsed, response.usage.input_tokens
+      
+      
+      # Build fresh message lists for fair comparison (before any follow-ups)
+      
+      # Full context: original conversation messages
+      full_context_messages = messages.copy()
+      
+      # Compacted context: summary + unsummarized messages
+      compacted_context_messages: list[MessageParam] = [
+          {"role": "user", "content": f"[Previous conversation summary]\n{summary}\n\n[Continuing conversation]"},
+          {"role": "assistant", "content": "I understand. I have context from our previous discussion. How can I help?"},
+      ]
+      compacted_context_messages.extend(kept_messages)
+      
+      # The follow-up question to test
+      follow_up_question = "Can you give me a quick example combining decorators with async?"
+      
+      print("=" * 60)
+      print("RESPONSE TIME COMPARISON: FULL vs COMPACTED CONTEXT")
+      print("=" * 60)
+      print(f"\nQuestion: {follow_up_question}")
+      
+      # Test 1: Full context
+      print("\n" + "-" * 60)
+      print("[1] FULL CONTEXT (original conversation)")
+      print("-" * 60)
+      print(f"Messages: {len(full_context_messages)} | ", end="")
+      full_response, full_time, full_tokens = timed_chat(follow_up_question, full_context_messages)
+      print(f"Input tokens: {full_tokens} | Response time: {full_time:.2f}s")
+      print(f"\nAnswer:\n{full_response}")
+      
+      # Test 2: Compacted context
+      print("\n" + "-" * 60)
+      print("[2] COMPACTED CONTEXT (session memory)")
+      print("-" * 60)
+      print(f"Messages: {len(compacted_context_messages)} | ", end="")
+      compact_response, compact_time, compact_tokens = timed_chat(
+          follow_up_question, compacted_context_messages
+      )
+      print(f"Input tokens: {compact_tokens} | Response time: {compact_time:.2f}s")
+      print(f"\nAnswer:\n{compact_response}")
+      
+      # Results
+      print("\n" + "=" * 60)
+      print("COMPARISON")
+      print("=" * 60)
+      
+      token_reduction = full_tokens - compact_tokens
+      token_reduction_pct = (token_reduction / full_tokens) * 100
+      time_saved = full_time - compact_time
+      time_saved_pct = (time_saved / full_time) * 100 if full_time > 0 else 0
+      
+      print(f"\nToken reduction: {token_reduction:,} tokens ({token_reduction_pct:.1f}% smaller)")
+      print(f"Time saved: {time_saved:.2f}s ({time_saved_pct:.1f}% faster)")
+      print(f"\nFull context:     {full_tokens:,} tokens → {full_time:.2f}s")
+      print(f"Compacted context: {compact_tokens:,} tokens → {compact_time:.2f}s")
+  markdown cell:
+    source:
+      ### Evaluation Results
+      
+      The comparison shows the response time benefit after compaction:
+      
+      | Metric | Full Context | Compacted Context |
+      |--------|--------------|-------------------|
+      | **Input Tokens** | All messages | Summary + recent |
+      | **Response Time** | Baseline | Faster |
+      
+      **Why compaction speeds up responses:**
+      - Fewer input tokens = faster time-to-first-token
+      - Smaller context = lower cost *per subsequent turn*
+      
+      **Important trade-off**: Session memory does NOT reduce total API cost. It adds overhead from periodic background summarization calls. The benefits are:
+      
+      1. **No compaction latency** - user never waits for a summarization call
+      2. **Faster subsequent responses** - smaller context after compaction
+      3. **Lower cost per turn** - after compaction, each API call is cheaper
+      
+      The value is in **user experience** (no blocking) and **per-turn efficiency** after compaction, not total cost reduction.
+  code cell:
+    source:
+      # Cost evaluation
+      
+      # Pricing (per million tokens) - Sonnet 4.5
+      INPUT_COST_PER_M = 3.00
+      OUTPUT_COST_PER_M = 15.00
+      
+      
+      def estimate_cost(input_tokens: int, output_tokens: int) -> float:
+          """Estimate cost in dollars."""
+          return (input_tokens * INPUT_COST_PER_M + output_tokens * OUTPUT_COST_PER_M) / 1_000_000
+      
+      
+      print("=" * 60)
+      print("COST COMPARISON: SESSION MEMORY vs TRADITIONAL")
+      print("=" * 60)
+      
+      # Session Memory: background summarization overhead + compacted follow-up
+      print("\n[Session Memory]")
+      bg_summarize_calls = 2
+      bg_input = bg_summarize_calls * 800
+      bg_output = bg_summarize_calls * 200
+      print(f"  Background summaries: ~{bg_input:,} input, ~{bg_output:,} output")
+      print(f"  Follow-up:            {compact_tokens:,} input")
+      
+      sm_input = bg_input + compact_tokens
+      sm_output = bg_output + len(compact_response) // 4
+      sm_cost = estimate_cost(sm_input, sm_output)
+      print(f"  Total: ${sm_cost:.4f}")
+      
+      # Traditional: compaction + full-context follow-up
+      print("\n[Traditional]")
+      print(f"  Compaction:  {full_tokens:,} input")
+      print(f"  Follow-up:   {full_tokens:,} input")
+      
+      trad_input = full_tokens * 2
+      trad_output = 300 + len(full_response) // 4
+      trad_cost = estimate_cost(trad_input, trad_output)
+      print(f"  Total: ${trad_cost:.4f}")
+      
+      # Result
+      print("\n" + "-" * 60)
+      cost_diff = sm_cost - trad_cost
+      print(f"Difference: ${cost_diff:+.4f}")
+      if cost_diff > 0:
+          print("→ Session memory costs more, but eliminates compaction latency")
+      else:
+          print("→ Session memory costs less due to smaller follow-up context")

📓 misc/session_memory_compaction.ipynb

View diff
nbdiff /dev/null misc/session_memory_compaction.ipynb (0b8873ee05cd8fc821dca1947fce555d9e318210)
--- /dev/null  2026-01-08 06:16:25.096419
+++ misc/session_memory_compaction.ipynb (0b8873ee05cd8fc821dca1947fce555d9e318210)  (no timestamp)
## added /cells:
+  markdown cell:
+    source:
+      # Instant Compaction with Session Memory
+      
+      Traditional compaction is slow: when you hit the context limit, you wait for a summary.
+      
+      With **Instant compaction** the session memory is proactively generated once a soft token threshold is reached. Once the user triggers a compaction or a hard limit is reached, the summary is already available, so the user doesn't need to wait.
+      
+      Result: Instant compaction, no waiting.
+  markdown cell:
+    source:
+      
+      ```
+      TRADITIONAL COMPACTION (slow)
+      ─────────────────────────────
+      Turn 1 → Turn 2 → Turn 3 → ... → Turn N → CONTEXT FULL!
+
+
+                                          ┌─────────────────┐
+                                          │ Generate summary│
+                                          │ ( USER WAITS !) │
+                                          └─────────────────┘
+
+
+                                               Continue
+      
+      
+      SESSION MEMORY COMPACTION (instant)
+      ────────────────────────────────────
+      Turn 1 → Turn 2 → ... → Turn K → Turn K+1 → ... → Turn N → ..  → CONTEXT FULL!
+                                  │                         │            │
+                      (soft threshold met:              (update          │
+                         10k tokens init)                trigger)        │
+                                  │                                      │
+                                  │                         │            │
+                                  ▼                         ▼            │
+                             ┌────────┐                ┌────────┐        │
+                             │ Update │                │ Update │        │
+                             │ memory │ (background)   │ memory │        │
+                             └────────┘                └────────┘        │
+                                  │                         │            │
+                                  ▼                         ▼            ▼
+                           📝 session-memory.md ──────────────────► INSTANT SWAP!
+                             (continuously updated)
+      ```
+      
+      **Update triggers:** The first summary is generated after the initial 10k tokens. Updates can be triggered after every subsequent turn, or at periodically at natural breakpoints intervals (e.g. every ~5k tokens or 3+ tool calls).
+  markdown cell:
+    source:
+      ## Fundamentals: writing a compaction prompt
+  markdown cell:
+    source:
+      Make sure you have a well structured session memory prompt. 
+      
+      Some best practices include:
+      - Use chain-of-thought before summarizing — analyze first, then output                                                                                         
+      - Enumerate exactly what to preserve: file paths, code snippets, errors, user corrections                                                                      
+      - Weight recency heavily — the end of the conversation is the active context                                                                                   
+      - Require verbatim quotes for next steps to prevent task drift                                                                                                 
+      - Use structured sections with token budgets per section                                                                                                       
+      - Include a "Current State" section that always reflects the moment of compaction
+      
+      Some pitfalls include:
+      - Vague prompts like "summarize this conversation" produce lossy output                                                                                        
+      - Treating all messages equally loses the active working context                                                                                               
+      - Paraphrasing next steps introduces subtle drift that compounds                                                                                               
+      - Omitting error history causes the model to retry failed approaches                                                                                           
+      - Dropping user corrections makes the model revert to old behaviors                                                                                            
+      - No token limits lets one section consume the entire summary                                                                                                  
+      - Summarizing for human readability instead of model continuity
+      - Having the agent try to compress the results of tool calls here - this can be retrieved later if the agent needs it
+  code cell:
+    source:
+      SESSION_CREATION_PROMPT = """
+      <analysis-instructions>
+      Before generating your summary, analyze the transcript in <think>...</think> tags:
+      1. What did the user originally request? (Exact phrasing)
+      2. What actions succeeded? What failed and why?
+      3. Did the user correct or redirect the assistant at any point?
+      4. What was actively being worked on at the end?
+      5. What tasks remain incomplete or pending?
+      6. What specific details (IDs, paths, values, names) must survive compression?
+      </analysis-instructions>
+      
+      <summary-format>
+      ## User Intent
+      The user's original request and any refinements. Use direct quotes for key requirements.
+      If the user's goal evolved during the conversation, capture that progression.
+      
+      ## Completed Work
+      Actions successfully performed. Be specific:
+      - What was created, modified, or deleted
+      - Exact identifiers (file paths, record IDs, URLs, names)
+      - Specific values, configurations, or settings applied
+      
+      ## Errors & Corrections
+      - Problems encountered and how they were resolved
+      - Approaches that failed (so they aren't retried)
+      - User corrections: "don't do X", "actually I meant Y", "that's wrong because..."
+      Capture corrections verbatim—these represent learned preferences.
+      
+      ## Active Work
+      What was in progress when the session ended. Include:
+      - The specific task being performed
+      - Direct quotes showing exactly where work left off
+      - Any partial results or intermediate state
+      
+      ## Pending Tasks
+      Remaining items the user requested that haven't been started.
+      Distinguish between "explicitly requested" and "implied/assumed."
+      
+      ## Key References
+      Important details needed to continue:
+      - Identifiers: IDs, paths, URLs, names, keys
+      - Values: numbers, dates, configurations, credentials (redacted)
+      - Context: relevant background information, constraints, preferences
+      - Citations: sources referenced during the conversation
+      </summary-format>
+      
+      <preserve-rules>
+      Always preserve when present:
+      - Exact identifiers (IDs, paths, URLs, keys, names)
+      - Error messages verbatim
+      - User corrections and negative feedback
+      - Specific values, formulas, or configurations
+      - Technical constraints or requirements discovered
+      - The precise state of any in-progress work
+      </preserve-rules>
+      
+      <compression-rules>
+      - Weight recent messages more heavily—the end of the transcript is the active context
+      - Omit pleasantries, acknowledgments, and filler ("Sure!", "Great question")
+      - Omit system context that will be re-injected separately
+      - Keep each section under 500 words; condense older content to make room for recent
+      - If you must cut details, preserve: user corrections > errors > active work > completed work
+      </compression-rules>
+      """
+  markdown cell:
+    source:
+      ## Traditional compacting example
+      In traditional compaction, you generate one summary once the token threshold is reached.
+  code cell:
+    source:
+      # setup, we are using haiku for demo purposes
+      import anthropic
+      from dataclasses import dataclass, field
+      
+      client = anthropic.Anthropic()
+      MODEL = "claude-haiku-4-5-20251001"
+  code cell:
+    source:
+      import time
+      
+      class TraditionalCompactingChatSession:
+          """Traditional chat session with compaction after the fact."""
+          def __init__(self, context_limit: int = 700):
+              self.context_limit = context_limit
+              self.messages = []
+              self.current_tokens = 0
+              self.tokens_before_compaction = None  # Track for showing reduction
+              self.summary = None
+          
+          def compact(self):
+              prev_msg_count = len(self.messages)
+              self.tokens_before_compaction = self.current_tokens
+             
+              compaction_prompt = SESSION_CREATION_PROMPT + "\n\nTranscript:\n"
+              for msg in self.messages:
+                  role = "User" if msg["role"] == "user" else "Assistant"
+                  compaction_prompt += f"{role}: {msg['content']}\n"
+              
+              start_time = time.perf_counter()
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=1024,
+                  system="You are a helpful assistant that summarizes conversations.",
+                  messages=[{"role": "user", "content": compaction_prompt}]
+              )
+              elapsed = time.perf_counter() - start_time
+              
+              # Generate new summary message
+              self.summary = response.content[0].text
+              self.messages = [{
+                  "role": "user",
+                  "content": f"""This session is being continued from a previous conversation. Here is the session memory: {self.summary}.Continue from where we left off."""
+              }]
+              print(f"\n{'=' * 60}")
+              curr_msg_count = len(self.messages)
+              print(f"🔄 Compaction messages: {prev_msg_count} → {curr_msg_count}")
+              print(f"⏱️  Compaction time: {elapsed:.2f}s (user waiting...)")
+          
+          def chat(self, user_message: str):
+              if self.current_tokens >= self.context_limit:
+                  print("\n🧹 Context limit exceeded, compacting session memory...")
+                  self.compact()
+              
+              self.messages.append({"role": "user", "content": user_message})
+              
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=1024,
+                  system="You are a helpful coding assistant. Be concise but thorough.",
+                  messages=self.messages
+              )
+              
+              assistant_message = response.content[0].text
+              self.messages.append({"role": "assistant", "content": assistant_message})
+              
+              self.current_tokens = response.usage.input_tokens
+              
+              # Show token reduction if we just compacted
+              if self.tokens_before_compaction is not None:
+                  reduction = self.tokens_before_compaction - self.current_tokens
+                  pct = (reduction / self.tokens_before_compaction) * 100
+                  print(f"✅ Tokens reduced: {self.tokens_before_compaction:,} → {self.current_tokens:,} ({reduction:,} tokens saved, {pct:.0f}% reduction)")
+                  print(f"{'=' * 60}")
+                  self.tokens_before_compaction = None
+            
+              return assistant_message, response.usage
+    outputs:
+      output 0:
+        output_type: stream
+        name: stderr
+        text:
+          /root/.pyenv/versions/3.13.11/lib/python3.13/site-packages/coconut/compiler/util.py:457: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in staticmethod() if you want to preserve the old behavior
+            result = add_action(grammar, unpack).parseWithTabs().transformString(text)
+  markdown cell:
+    source:
+      ### Example use of traditional compaction
+  code cell:
+    source:
+      session = TraditionalCompactingChatSession()
+      
+      messages = [
+          "Explain Python decorators with a simple example.",
+          "Now show me a decorator that logs function arguments.",
+          "How do I make a decorator that accepts parameters?",
+      ]
+      
+      print("Starting conversation with traditional compacting chat session...\n")
+      
+      turn_count = 0
+       
+      for i, message in enumerate(messages, 1):
+          response, usage = session.chat(message)
+          turn_count += 1
+          print(
+              f"\n{'=' * 60}\n"
+              f"Turn {turn_count:2d}: Input={usage.input_tokens:7,} tokens | "
+              f"Output={usage.output_tokens:5,} tokens | "
+              f"Messages={len(session.messages):2d}"
+          )
+          print(f"\nUser: {message}\nAssistant: {response}\n{'-'*40}\n")
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          Starting conversation with traditional compacting chat session...
+          
+          
+          ============================================================
+          Turn  1: Input=     48 tokens | Output=  418 tokens | Messages= 2
+          
+          User: Explain Python decorators with a simple example.
+          Assistant: # Python Decorators Explained
+          
+          A **decorator** is a function that modifies or enhances another function or class without changing its source code. It wraps a function and executes code before and/or after the wrapped function runs.
+          
+          ## Simple Example
+          
+          ```python
+          def my_decorator(func):
+              def wrapper(*args, **kwargs):
+                  print("Something before the function")
+                  result = func(*args, **kwargs)
+                  print("Something after the function")
+                  return result
+              return wrapper
+          
+          @my_decorator
+          def say_hello(name):
+              print(f"Hello, {name}!")
+          
+          say_hello("Alice")
+          ```
+          
+          **Output:**
+          ```
+          Something before the function
+          Hello, Alice!
+          Something after the function
+          ```
+          
+          ## How It Works
+          
+          1. `my_decorator` takes a function as input
+          2. `wrapper` is a new function that:
+             - Runs code **before** calling the original function
+             - Calls the original function with `func(*args, **kwargs)`
+             - Runs code **after** the function completes
+          3. The `@my_decorator` syntax is shorthand for: `say_hello = my_decorator(say_hello)`
+          
+          ## Practical Use Case: Timing a Function
+          
+          ```python
+          import time
+          
+          def timer(func):
+              def wrapper(*args, **kwargs):
+                  start = time.time()
+                  result = func(*args, **kwargs)
+                  end = time.time()
+                  print(f"Executed in {end - start:.2f} seconds")
+                  return result
+              return wrapper
+          
+          @timer
+          def slow_function():
+              time.sleep(1)
+              print("Done!")
+          
+          slow_function()
+          ```
+          
+          Decorators are powerful for logging, authentication, timing, validation, and more—keeping your code DRY and organized.
+          ----------------------------------------
+          
+          
+          ============================================================
+          Turn  2: Input=    479 tokens | Output=  471 tokens | Messages= 4
+          
+          User: Now show me a decorator that logs function arguments.
+          Assistant: # Logging Arguments Decorator
+          
+          ```python
+          def log_arguments(func):
+              def wrapper(*args, **kwargs):
+                  print(f"Function: {func.__name__}")
+                  print(f"Args: {args}")
+                  print(f"Kwargs: {kwargs}")
+                  result = func(*args, **kwargs)
+                  print(f"Returned: {result}\n")
+                  return result
+              return wrapper
+          
+          @log_arguments
+          def add(a, b):
+              return a + b
+          
+          @log_arguments
+          def greet(name, greeting="Hello"):
+              return f"{greeting}, {name}!"
+          
+          add(5, 3)
+          greet("Bob")
+          greet("Alice", greeting="Hi")
+          ```
+          
+          **Output:**
+          ```
+          Function: add
+          Args: (5, 3)
+          Kwargs: {}
+          Returned: 8
+          
+          Function: greet
+          Args: ('Bob',)
+          Kwargs: {'greeting': 'Hello'}
+          Returned: Hello, Bob!
+          
+          Function: greet
+          Args: ('Alice',)
+          Kwargs: {'greeting': 'Hi'}
+          Returned: Hi, Alice!
+          ```
+          
+          ## Enhanced Version with functools
+          
+          ```python
+          from functools import wraps
+          
+          def log_arguments(func):
+              @wraps(func)  # Preserves original function metadata
+              def wrapper(*args, **kwargs):
+                  print(f"Calling {func.__name__}({args}, {kwargs})")
+                  result = func(*args, **kwargs)
+                  print(f"Result: {result}")
+                  return result
+              return wrapper
+          
+          @log_arguments
+          def multiply(x, y):
+              """Multiplies two numbers."""
+              return x * y
+          
+          multiply(4, 5)
+          print(multiply.__name__)  # Prints "multiply" (not "wrapper")
+          ```
+          
+          **Key Benefits:**
+          - `@wraps(func)` preserves the original function's name, docstring, and metadata
+          - Useful for debugging and understanding function calls
+          - Easy to apply to multiple functions
+          ----------------------------------------
+          
+          
+          ============================================================
+          Turn  3: Input=    963 tokens | Output=  549 tokens | Messages= 6
+          
+          User: How do I make a decorator that accepts parameters?
+          Assistant: # Parameterized Decorators
+          
+          To make a decorator that accepts parameters, you need **three levels of functions**: outer function (parameters), middle function (decorator), and inner function (wrapper).
+          
+          ## Simple Example
+          
+          ```python
+          def repeat(times):
+              def decorator(func):
+                  def wrapper(*args, **kwargs):
+                      results = []
+                      for _ in range(times):
+                          result = func(*args, **kwargs)
+                          results.append(result)
+                      return results
+                  return wrapper
+              return decorator
+          
+          @repeat(times=3)
+          def greet(name):
+              return f"Hello, {name}!"
+          
+          print(greet("Alice"))
+          ```
+          
+          **Output:**
+          ```
+          ['Hello, Alice!', 'Hello, Alice!', 'Hello, Alice!']
+          ```
+          
+          ## How It Works
+          
+          1. `repeat(times=3)` is called first → returns the `decorator` function
+          2. `decorator` is applied to `greet` → returns the `wrapper` function
+          3. When `greet("Alice")` is called → `wrapper` executes
+          
+          It's equivalent to:
+          ```python
+          greet = repeat(times=3)(greet)
+          ```
+          
+          ## More Practical Example: Rate Limiter
+          
+          ```python
+          from functools import wraps
+          import time
+          
+          def rate_limit(max_calls, time_window):
+              def decorator(func):
+                  last_called = [0]
+                  calls = [0]
+                  
+                  @wraps(func)
+                  def wrapper(*args, **kwargs):
+                      now = time.time()
+                      if now - last_called[0] > time_window:
+                          calls[0] = 0
+                          last_called[0] = now
+                      
+                      if calls[0] >= max_calls:
+                          raise Exception(f"Rate limit exceeded: {max_calls} calls per {time_window}s")
+                      
+                      calls[0] += 1
+                      return func(*args, **kwargs)
+                  return wrapper
+              return decorator
+          
+          @rate_limit(max_calls=3, time_window=10)
+          def api_call():
+              print("API called!")
+          
+          api_call()
+          api_call()
+          api_call()
+          # api_call()  # Would raise an exception
+          ```
+          
+          **Key Point:** Always use `@wraps(func)` from `functools` to preserve metadata!
+          ----------------------------------------
+          
+  code cell:
+    source:
+      response, _ = session.chat("What did we just talk about?")
+      print("\nFinal assistant response:")
+      print(response)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          
+          🧹 Context limit exceeded, compacting session memory...
+          
+          ============================================================
+          🔄 Compaction messages: 6 → 1
+          ⏱️  Compaction time: 5.97s (user waiting...)
+          ✅ Tokens reduced: 963 → 721 (242 tokens saved, 25% reduction)
+          ============================================================
+          
+          Final assistant response:
+          We just finished a **three-part tutorial on Python decorators**, progressing from basics to advanced:
+          
+          1. **Basic decorators** – Simple wrapper functions using `def decorator(func)` pattern with a nested `wrapper` function
+          2. **Logging decorators** – Capturing function arguments and return values, introducing `@wraps(func)` from `functools` to preserve metadata
+          3. **Parameterized decorators** – The "three-level nesting" pattern where decorators themselves accept arguments:
+             - `@repeat(times=3)` – calls a function multiple times
+             - `@rate_limit(max_calls, time_window)` – throttles function calls
+          
+          The key insight was that **parameterized decorators have three nested functions**: outer (parameters) → middle (decorator) → inner (wrapper).
+          
+          ---
+          
+          **Where would you like to go from here?** For example:
+          - Practical examples of using these decorators?
+          - Class-based decorators (using `__call__`)?
+          - Decorators with multiple stacked decorators?
+          - Real-world use cases (caching, authentication, validation)?
+          - Something else?
+  markdown cell:
+    source:
+      As a result the user experineces a wait time when compaction occurs. It is only a few seconds in this example, but for long context compaction, this can be must longer.
+  markdown cell:
+    source:
+      ## Instant Compaction
+  markdown cell:
+    source:
+      
+      The key insight: **build the session memory in the background** so it's ready when you need it.
+      
+      ```
+      Turn 1 → Turn 2 → ... → Turn K  → Turn K+1 → ... → CONTEXT FULL!
+                                 │           │                 │
+                           (threshold)  (update)          INSTANT!
+                                 ↓           ↓                 ↓
+                          [Background]  [Background]    [Just swap in
+                           memory init   memory update   pre-built memory]
+      ```
+      
+      The `InstantCompactingChatSession` class uses **threading** for background execution:
+      1. **`threading.Thread`** - runs memory updates in background without blocking
+      2. **Thread-safe state** - uses `threading.Lock` to safely update shared memory
+      3. **Daemon threads** - background work doesn't prevent program exit
+      4. **Instant compaction** - when context is full, just swap in the pre-built memory
+  code cell:
+    source:
+      import threading
+      import time
+      
+      
+      class InstantCompactingChatSession:
+          """
+          Maintains session memory via incremental background updates.
+          
+          Key insight: By updating memory in the background after each turn,
+          the summary is already ready when compaction is needed - instant swap!
+          """
+      
+          def __init__(
+              self,
+              context_limit: int = 2000,
+              min_tokens_to_init: int = 500,
+              min_tokens_between_updates: int = 300,
+          ):
+              # Thresholds
+              self.context_limit = context_limit
+              self.min_tokens_to_init = min_tokens_to_init
+              self.min_tokens_between_updates = min_tokens_between_updates
+      
+              # Conversation state
+              self.messages = []
+              self.current_tokens = 0
+      
+              # Session memory state
+              self.session_memory = None
+              self.last_summarized_index = 0
+              self.tokens_at_last_update = 0
+      
+              # Background update tracking
+              self._update_thread: threading.Thread | None = None
+              self.last_update_time = None
+              self._lock = threading.Lock()
+      
+          def chat(self, user_message: str):
+              """Process a chat turn with background session memory updates."""
+              if self.current_tokens >= self.context_limit:
+                  self.compact()
+      
+              self.messages.append({"role": "user", "content": user_message})
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=1024,
+                  system="You are a helpful coding assistant. Be concise but thorough.",
+                  messages=self.messages,
+              )
+      
+              assistant_message = response.content[0].text
+              self.messages.append({"role": "assistant", "content": assistant_message})
+      
+              self.current_tokens = response.usage.input_tokens
+      
+              # KEY DIFFERENCE: Trigger background memory update if needed
+              if self._should_init_memory() or self._should_update_memory():
+                  self._trigger_background_update()
+                  status = "initializing" if self.session_memory is None else "updating"
+                  print(f"   [Background] Session memory {status}...")
+      
+              return assistant_message, response.usage
+          
+          # Helper methods to determine when to init/update/compact
+          def _should_init_memory(self) -> bool:
+              return (
+                  self.session_memory is None
+                  and self.current_tokens >= self.min_tokens_to_init
+              )
+      
+          # Helper method to determine if memory should be updated
+          def _should_update_memory(self) -> bool:
+              if self.session_memory is None:
+                  return False
+              tokens_since = self.current_tokens - self.tokens_at_last_update
+              return tokens_since >= self.min_tokens_between_updates
+      
+          def _build_transcript(self, messages: list[dict]) -> str:
+              lines = []
+              for msg in messages:
+                  role = "User" if msg["role"] == "user" else "Assistant"
+                  lines.append(f"{role}: {msg['content']}")
+              return "\n\n".join(lines)
+      
+          def _create_session_memory(self, messages: list[dict]) -> str:
+              """Generate initial session memory from messages."""
+              transcript = self._build_transcript(messages)
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=1024,
+                  system="""You are a session memory agent. Compress the conversation into a structured summary 
+      that preserves all information needed to continue work seamlessly. Optimize for the assistant's 
+      ability to continue working, not human readability.""",
+                  messages=[
+                      {
+                          "role": "user",
+                          "content": f"""Conversation transcript:
+      {transcript}
+      
+      Create session memory using these instructions:
+      {SESSION_CREATION_PROMPT}
+      
+      First analyze in <think>...</think> tags, then output the structured summary.""",
+                      }
+                  ],
+              )
+              return response.content[0].text
+      
+          def _update_session_memory(self, new_messages: list[dict]) -> str:
+              """Update existing session memory with new messages."""
+              transcript = self._build_transcript(new_messages)
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=1024,
+                  system="""You are a session memory agent. Update the existing session memory with new information 
+      from the recent conversation. Preserve important existing details while integrating new content.""",
+                  messages=[
+                      {
+                          "role": "user",
+                          "content": f"""Current session memory:
+      {self.session_memory}
+      
+      New messages to integrate:
+      {transcript}
+      
+      Update the session memory following these guidelines:
+      {SESSION_CREATION_PROMPT}
+      
+      Output only the updated session memory (no analysis tags needed for updates).""",
+                      }
+                  ],
+              )
+              return response.content[0].text
+      
+          def _background_memory_update(
+              self, messages_snapshot: list[dict], snapshot_index: int, current_tokens: int
+          ):
+              """Run session memory update in a background thread."""
+              try:
+                  if self.session_memory is None:
+                      new_memory = self._create_session_memory(messages_snapshot)
+                  else:
+                      new_messages = messages_snapshot[self.last_summarized_index :]
+                      if not new_messages:
+                          return
+                      new_memory = self._update_session_memory(new_messages)
+      
+                  # Update state (thread-safe)
+                  with self._lock:
+                      self.session_memory = new_memory
+                      self.last_summarized_index = snapshot_index
+                      self.tokens_at_last_update = current_tokens
+                      self.last_update_time = time.time()
+      
+              except Exception as e:
+                  print(f"   [Background] Error updating memory: {e}")
+      
+          def _trigger_background_update(self):
+              """Trigger a background session memory update."""
+              if self._update_thread is not None and self._update_thread.is_alive():
+                  return
+      
+              messages_snapshot = self.messages.copy()
+              snapshot_index = len(messages_snapshot)
+              current_tokens = self.current_tokens
+      
+              self._update_thread = threading.Thread(
+                  target=self._background_memory_update,
+                  args=(messages_snapshot, snapshot_index, current_tokens),
+                  daemon=True,
+              )
+              self._update_thread.start()
+      
+          def wait_for_memory(self, timeout: float = 30.0):
+              """Wait for any pending background update to complete."""
+              if self._update_thread is not None and self._update_thread.is_alive():
+                  self._update_thread.join(timeout=timeout)
+      
+          def compact(self):
+              """INSTANT compaction using pre-built session memory."""
+              prev_msg_count = len(self.messages)
+      
+              if self.session_memory is None:
+                  if self._update_thread is not None and self._update_thread.is_alive():
+                      print("   ⏳ Waiting for background memory update...")
+                      self._update_thread.join(timeout=30.0)
+      
+                  if self.session_memory is None:
+                      print("   ⚠️  No pre-built memory, creating synchronously...")
+                      start = time.perf_counter()
+                      self.session_memory = self._create_session_memory(self.messages)
+                      elapsed = time.perf_counter() - start
+                      print(f"   ⏱️  Took {elapsed:.2f}s (but should be instant normally!)")
+                      self.last_summarized_index = len(self.messages)
+      
+              unsummarized = self.messages[self.last_summarized_index :]
+      
+              summary_message = {
+                  "role": "user",
+                  "content": f"""This session is being continued from a previous conversation.
+      
+      Here is the session memory:
+      {self.session_memory}
+      
+      Continue from where we left off.""",
+              }
+      
+              self.messages = [summary_message] + unsummarized
+              self.last_summarized_index = 1
+      
+              print(f"\n{'=' * 60}")
+              print(f"⚡ INSTANT COMPACTION! Messages: {prev_msg_count} → {len(self.messages)}")
+              print(f"   Kept {len(unsummarized)} unsummarized messages")
+              print(f"   Session memory was pre-built (no wait time!)")
+              print(f"{'=' * 60}")
+      
+          
+    outputs:
+      output 0:
+        output_type: stream
+        name: stderr
+        text:
+          /root/.pyenv/versions/3.13.11/lib/python3.13/site-packages/coconut/compiler/util.py:403: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in staticmethod() if you want to preserve the old behavior
+            grammar.streamline()
+          /root/.pyenv/versions/3.13.11/lib/python3.13/site-packages/coconut/compiler/util.py:457: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in staticmethod() if you want to preserve the old behavior
+            result = add_action(grammar, unpack).parseWithTabs().transformString(text)
+  markdown cell:
+    source:
+      ### Example use of Instant Compaction
+  code cell:
+    source:
+      # Low thresholds for demo - in production you'd use higher values
+      session = InstantCompactingChatSession(
+          context_limit=700,
+          min_tokens_to_init=200,
+          min_tokens_between_updates=150,
+      )
+      
+      messages = [
+          "Explain Python decorators with a simple example.",
+          "Now show me a decorator that logs function arguments.",
+          "How do I make a decorator that accepts parameters?",
+      ]
+      
+      print("=" * 60)
+      print("INSTANT COMPACTING SESSION")
+      print("=" * 60)
+      print("Session memory builds in background, so compaction is instant!\n")
+      
+      turn_count= 0
+      for i, message in enumerate(messages, 1):
+          turn_count += 1
+          response, usage = session.chat(message)
+          
+          memory_status = "ready" if session.session_memory else "not yet"
+          print(
+              f"\n{'=' * 60}\n"
+              f"Turn {turn_count:2d}: Input={usage.input_tokens:7,} tokens | "
+              f"Output={usage.output_tokens:5,} tokens | "
+              f"Messages={len(session.messages):2d}"
+          )
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          ============================================================
+          INSTANT COMPACTING SESSION
+          ============================================================
+          Session memory builds in background, so compaction is instant!
+          
+          
+          ============================================================
+          Turn  1: Input=     48 tokens | Output=  393 tokens | Messages= 2
+             [Background] Session memory initializing...
+          
+          ============================================================
+          Turn  2: Input=    454 tokens | Output=  476 tokens | Messages= 4
+             [Background] Session memory initializing...
+          
+          ============================================================
+          Turn  3: Input=    943 tokens | Output=  564 tokens | Messages= 6
+  code cell:
+    source:
+      response, _ = session.chat("What did we just talk about?")
+      print("\nFinal assistant response:")
+      print(response)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          
+          ============================================================
+          ⚡ INSTANT COMPACTION! Messages: 6 → 3
+             Kept 2 unsummarized messages
+             Session memory was pre-built (no wait time!)
+          ============================================================
+             [Background] Session memory updating...
+          
+          Final assistant response:
+          We just covered **parameterized decorators** — how to create decorators that accept their own parameters.
+          
+          The key concept: you need **three levels of nesting** instead of two:
+          
+          1. **Outer function** — accepts decorator parameters (e.g., `repeat(times=3)`)
+          2. **Middle function** — the decorator itself (takes the function to decorate)
+          3. **Inner function** — the wrapper (executes the actual behavior)
+          
+          I showed two examples:
+          - **`@repeat(times=3)`** — runs a function multiple times
+          - **`@rate_limit(max_calls=3, time_window=10)`** — prevents function calls exceeding a rate limit
+          
+          This was a follow-up to our earlier conversation about Python decorators and logging decorators.
+  code cell:
+    source:
+      # Side-by-side comparison: Traditional vs Instant compaction
+      
+      print("=" * 70)
+      print("COMPARISON: Traditional vs Instant Compaction")
+      print("=" * 70)
+      
+      messages = [
+          "Explain Python decorators with a simple example.",
+          "Now show me a decorator that logs function arguments.",
+          "How do I make a decorator that accepts parameters?",
+      ]
+      
+      # Traditional approach
+      print("\n📊 TRADITIONAL COMPACTION:")
+      print("-" * 40)
+      traditional = TraditionalCompactingChatSession(context_limit=500)
+      
+      for i, msg in enumerate(messages, 1):
+          response, usage = traditional.chat(msg)
+          print(f"  Turn {i}: {usage.input_tokens:,} tokens")
+      
+      # Force a compaction to measure time
+      start = time.perf_counter()
+      traditional.compact()
+      traditional_compaction_time = time.perf_counter() - start
+      
+      # Instant approach  
+      print("\n⚡ INSTANT COMPACTION:")
+      print("-" * 40)
+      instant = InstantCompactingChatSession(
+          context_limit=500,
+          min_tokens_to_init=100,
+          min_tokens_between_updates=100,
+      )
+      
+      for i, msg in enumerate(messages, 1):
+          response, usage = instant.chat(msg)
+          print(f"  Turn {i}: {usage.input_tokens:,} tokens | Memory: {'ready' if instant.session_memory else 'building...'}")
+      
+      # Wait for background to finish
+      instant.wait_for_memory()
+      
+      # Measure instant compaction time
+      start = time.perf_counter()
+      instant.compact()
+      instant_compaction_time = time.perf_counter() - start
+      
+      print("\n" + "=" * 70)
+      print("RESULTS:")
+      print(f"  Traditional compaction time: {traditional_compaction_time:.2f}s (user waiting)")
+      print(f"  Instant compaction time:     {instant_compaction_time:.4f}s (just a swap!)")
+      print(f"  Speedup: {traditional_compaction_time/max(instant_compaction_time, 0.001):.0f}x faster")
+      print("=" * 70)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          ======================================================================
+          COMPARISON: Traditional vs Instant Compaction
+          ======================================================================
+          
+          📊 TRADITIONAL COMPACTION:
+          ----------------------------------------
+            Turn 1: 48 tokens
+            Turn 2: 444 tokens
+            Turn 3: 915 tokens
+          
+          ============================================================
+          🔄 Compaction triggered! Messages: 6 → 1
+          ⏱️  Compaction time: 5.69s (user waiting...)
+          
+          ⚡ INSTANT COMPACTION:
+          ----------------------------------------
+            Turn 1: 48 tokens | Memory: building...
+             [Background] Session memory initializing...
+            Turn 2: 452 tokens | Memory: building...
+             [Background] Session memory initializing...
+            Turn 3: 1,024 tokens | Memory: building...
+          
+          ============================================================
+          ⚡ INSTANT COMPACTION! Messages: 6 → 3
+             Kept 2 unsummarized messages
+             Session memory was pre-built (no wait time!)
+          ============================================================
+          
+          ======================================================================
+          RESULTS:
+            Traditional compaction time: 5.69s (user waiting)
+            Instant compaction time:     0.0002s (just a swap!)
+            Speedup: 5692x faster
+          ======================================================================
+  markdown cell:
+    source:
+      ## Advanced: Adding Prompt Caching
+  markdown cell:
+    source:
+      
+      The background updates can be made **~10x cheaper** by using prompt caching. The trick:
+      1. Pass the **full conversation** to the background summarizer
+      2. Add `cache_control` markers so subsequent requests hit the cache
+      3. Only the new "summarize this" instruction is billed at full price
+      
+      ```
+      Main chat:         [System + Turn 1 + Turn 2 + ... + Turn N]
+
+                                    (cached automatically)
+                                    
+      Background update: [System + Turn 1 + Turn 2 + ... + Turn N] + [Summarize instruction]
+                                    ↑                                        ↑
+                               CACHE HIT! (10x cheaper)              Only this is billed
+      ```
+  markdown cell:
+    source:
+      ### How the Caching Works
+      
+      The key is in `_add_cache_control()` and `_create_session_memory_cached()`:
+      
+      ```python
+      # 1. Mark the last conversation message with cache_control
+      {
+          "role": "user",
+          "content": [{
+              "type": "text",
+              "text": msg["content"],
+              "cache_control": {"type": "ephemeral"}  # <-- This creates a cache breakpoint
+          }]
+      }
+      
+      # 2. Also mark the system prompt
+      system=[{
+          "type": "text",
+          "text": "You are a session memory agent...",
+          "cache_control": {"type": "ephemeral"}
+      }]
+      ```
+      
+      **Why this works:**
+      - The first background update creates a cache entry for `[System + Messages]`
+      - Subsequent updates with the same message prefix get **cache hits**
+      - Only the new summarization instruction is billed at full price
+      - Cache entries have a 5-minute TTL, so rapid updates benefit most
+      
+      **Cost math:**
+      - Without caching: 5,000 tokens × $3.00/1M = $0.015 per update
+      - With caching: 500 new tokens × $3.00/1M + 4,500 cached × $0.30/1M = $0.00285
+      - **Savings: ~80%** on background summarization costs
+  code cell:
+    source:
+      class CachedInstantCompactingSession:
+          """
+          Session memory with prompt caching for cheaper background updates.
+          
+          Key optimization: By passing the full conversation with cache_control markers,
+          background summarization requests get cache hits on 90%+ of input tokens.
+          """
+      
+          def __init__(
+              self,
+              context_limit: int = 2000,
+              min_tokens_to_init: int = 500,
+              min_tokens_between_updates: int = 300,
+              system_prompt: str = "You are a helpful coding assistant. Be concise but thorough.",
+          ):
+              self.context_limit = context_limit
+              self.min_tokens_to_init = min_tokens_to_init
+              self.min_tokens_between_updates = min_tokens_between_updates
+              self.system_prompt = system_prompt
+      
+              self.messages = []
+              self.current_tokens = 0
+      
+              self.session_memory = None
+              self.last_summarized_index = 0
+              self.tokens_at_last_update = 0
+      
+              self._update_thread = None
+              self._lock = threading.Lock()
+      
+              # Track cache stats
+              self.total_cache_read = 0
+              self.total_cache_created = 0
+              self.total_input_tokens = 0
+      
+          def _should_init_memory(self) -> bool:
+              return self.session_memory is None and self.current_tokens >= self.min_tokens_to_init
+      
+          def _should_update_memory(self) -> bool:
+              if self.session_memory is None:
+                  return False
+              return (self.current_tokens - self.tokens_at_last_update) >= self.min_tokens_between_updates
+      
+          def _should_compact(self) -> bool:
+              return self.current_tokens >= self.context_limit
+      
+          def _add_cache_control(self, messages: list[dict]) -> list[dict]:
+              """
+              Add cache_control markers to messages for prompt caching.
+              
+              Strategy: Mark the last message with cache_control so the entire
+              conversation prefix gets cached for subsequent requests.
+              """
+              if not messages:
+                  return messages
+      
+              cached_messages = []
+              for i, msg in enumerate(messages):
+                  if i == len(messages) - 1:
+                      # Last message: add cache_control marker
+                      cached_messages.append({
+                          "role": msg["role"],
+                          "content": [
+                              {
+                                  "type": "text",
+                                  "text": msg["content"],
+                                  "cache_control": {"type": "ephemeral"},
+                              }
+                          ],
+                      })
+                  else:
+                      cached_messages.append(msg)
+      
+              return cached_messages
+      
+          def _create_session_memory_cached(self, messages: list[dict]) -> tuple[str, dict]:
+              """
+              Generate session memory using the FULL conversation with caching.
+              
+              This passes the entire conversation + summarize instruction, so subsequent
+              calls with the same conversation prefix will hit the cache.
+              """
+              # Build conversation with cache marker on last message
+              cached_messages = self._add_cache_control(messages)
+      
+              # Add the summarization instruction as a new user message
+              cached_messages.append({
+                  "role": "user",
+                  "content": f"""Based on our conversation above, create a session memory summary.
+      
+      {SESSION_CREATION_PROMPT}
+      
+      Output the structured summary directly.""",
+              })
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=1024,
+                  system=[
+                      {
+                          "type": "text",
+                          "text": """You are a session memory agent. When asked, compress the conversation 
+      into a structured summary that preserves all information needed to continue work seamlessly.""",
+                          "cache_control": {"type": "ephemeral"},
+                      }
+                  ],
+                  messages=cached_messages,
+              )
+      
+              # Extract cache stats
+              cache_stats = {
+                  "cache_read": getattr(response.usage, "cache_read_input_tokens", 0),
+                  "cache_created": getattr(response.usage, "cache_creation_input_tokens", 0),
+                  "input_tokens": response.usage.input_tokens,
+              }
+      
+              return response.content[0].text, cache_stats
+      
+          def _background_memory_update(
+              self, messages_snapshot: list[dict], snapshot_index: int, current_tokens: int
+          ):
+              """Run cached session memory update in background thread."""
+              try:
+                  new_memory, cache_stats = self._create_session_memory_cached(messages_snapshot)
+      
+                  with self._lock:
+                      self.session_memory = new_memory
+                      self.last_summarized_index = snapshot_index
+                      self.tokens_at_last_update = current_tokens
+                      self.total_cache_read += cache_stats["cache_read"]
+                      self.total_cache_created += cache_stats["cache_created"]
+                      self.total_input_tokens += cache_stats["input_tokens"]
+      
+                  # Show cache performance
+                  if cache_stats["cache_read"] > 0:
+                      pct = (cache_stats["cache_read"] / cache_stats["input_tokens"]) * 100
+                      print(f"   [Cache] {cache_stats['cache_read']:,} read ({pct:.0f}% hit rate)")
+                  else:
+                      print(f"   [Cache] {cache_stats['cache_created']:,} created (first request)")
+      
+              except Exception as e:
+                  print(f"   [Background] Error: {e}")
+      
+          def _trigger_background_update(self):
+              if self._update_thread is not None and self._update_thread.is_alive():
+                  return
+      
+              self._update_thread = threading.Thread(
+                  target=self._background_memory_update,
+                  args=(self.messages.copy(), len(self.messages), self.current_tokens),
+                  daemon=True,
+              )
+              self._update_thread.start()
+      
+          def wait_for_memory(self, timeout: float = 30.0):
+              if self._update_thread is not None and self._update_thread.is_alive():
+                  self._update_thread.join(timeout=timeout)
+      
+          def compact(self):
+              prev_msg_count = len(self.messages)
+      
+              if self.session_memory is None:
+                  if self._update_thread is not None and self._update_thread.is_alive():
+                      print("   ⏳ Waiting for background update...")
+                      self._update_thread.join(timeout=30.0)
+      
+                  if self.session_memory is None:
+                      print("   ⚠️  Creating memory synchronously...")
+                      self.session_memory, _ = self._create_session_memory_cached(self.messages)
+                      self.last_summarized_index = len(self.messages)
+      
+              unsummarized = self.messages[self.last_summarized_index :]
+      
+              self.messages = [
+                  {
+                      "role": "user",
+                      "content": f"Session memory:\n{self.session_memory}\n\nContinue from where we left off.",
+                  }
+              ] + unsummarized
+              self.last_summarized_index = 1
+      
+              print(f"\n{'=' * 60}")
+              print(f"⚡ INSTANT COMPACTION! Messages: {prev_msg_count} → {len(self.messages)}")
+              print(f"{'=' * 60}")
+      
+          def chat(self, user_message: str):
+              if self._should_compact():
+                  self.compact()
+      
+              self.messages.append({"role": "user", "content": user_message})
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=1024,
+                  system=self.system_prompt,
+                  messages=self.messages,
+              )
+      
+              assistant_message = response.content[0].text
+              self.messages.append({"role": "assistant", "content": assistant_message})
+              self.current_tokens = response.usage.input_tokens
+      
+              if self._should_init_memory() or self._should_update_memory():
+                  self._trigger_background_update()
+                  print(f"   [Background] Updating session memory...")
+      
+              return assistant_message, response.usage
+      
+          def get_cache_savings(self) -> dict:
+              """Calculate cost savings from caching."""
+              if self.total_input_tokens == 0:
+                  return {"savings_pct": 0, "effective_rate": 0}
+      
+              # Cache reads are 10x cheaper than regular input
+              regular_cost = self.total_input_tokens
+              actual_cost = (self.total_input_tokens - self.total_cache_read) + (self.total_cache_read * 0.1)
+              savings_pct = ((regular_cost - actual_cost) / regular_cost) * 100 if regular_cost > 0 else 0
+      
+              return {
+                  "total_input": self.total_input_tokens,
+                  "cache_read": self.total_cache_read,
+                  "cache_created": self.total_cache_created,
+                  "savings_pct": savings_pct,
+              }

Generated by nbdime

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review

Recommendation: REQUEST_CHANGES

Summary

This PR adds two valuable cookbooks demonstrating session memory and incremental compaction techniques for efficient context management. The concepts and overall approach are excellent, but there are critical execution issues that prevent the notebooks from running top-to-bottom successfully.

Actionable Feedback (12 items)

Critical Issues (Must Fix):

  • misc/compaction_cookbook.ipynb (Cell 8) - References undefined variable kept_messages. This variable is never created in Cell 7, causing a NameError when executing the notebook
  • misc/compaction_cookbook.ipynb (Cell 11) - Calls undefined function chat(). The notebook uses client.messages.create() elsewhere but this cell expects a chat() helper that doesn't exist
  • misc/compaction_cookbook.ipynb (Cell 14) - References kept_messages again without definition, and messages may be empty after Cell 7 resets it
  • Both notebooks - Missing API key setup. Add import os; from dotenv import load_dotenv; load_dotenv() before client = anthropic.Anthropic() per project guidelines
  • Both notebooks - Missing prerequisites section. Add cell with %pip install -qU anthropic python-dotenv at the beginning
  • Both notebooks - Not registered in registry.yaml. Need to add entries with title, description, path, authors, and categories per CLAUDE.md
  • misc/session_memory_compaction.ipynb - Multiple cells (4, 6, 7, etc.) have incorrect "language": "coconut" metadata. These should be standard Python cells with "language": "python"
  • misc/session_memory.md - File contains only placeholder text. Either complete the documentation, remove the file, or clarify if it's meant to be generated by the notebook code

Important Issues (Should Fix):

  • Both notebooks - Missing Terminal Learning Objectives (TLOs) section. Add bullet points explaining what readers will learn
  • misc/session_memory_compaction.ipynb (in compact_conversation method) - Missing return type annotation. Should be -> str | None
  • misc/session_memory_compaction.ipynb (in _background_memory_update method) - Overly broad exception handling. Use more specific anthropic.APIError instead of bare Exception
  • Both notebooks - Missing conclusion sections that map back to learning objectives and provide next steps
Detailed Review

Code Quality

Strengths:

  • Well-structured classes with clear separation of concerns (SessionMemory dataclass)
  • Good use of type annotations for most methods
  • Proper threading implementation with Lock for thread safety
  • Clean demonstration of the incremental summarization pattern
  • Excellent ASCII diagrams showing traditional vs session memory flow

Issues:

  • Broken code execution flow in compaction_cookbook.ipynb where cells reference undefined variables
  • Inconsistent variable naming (e.g., bg_input vs more descriptive names)
  • Some methods missing return type annotations
  • Generic exception handling that should be more specific

Notebook Structure

Strengths:

  • Clear problem statement and solution framing
  • Good progression from concept to implementation
  • Practical examples with real conversation demonstrations
  • Performance and cost comparisons provide valuable context

Issues:

  • Missing prerequisites and setup sections (API keys, pip installs)
  • No Terminal Learning Objectives to set expectations
  • Some code blocks lack post-execution explanation
  • Missing conclusion sections

Security

Issue: Neither notebook demonstrates proper API key management. The project standard (per CLAUDE.md) requires using python-dotenv and os.environ.get("ANTHROPIC_API_KEY"), which is not shown in the setup cells.

Fix:

import os
from dotenv import load_dotenv
import anthropic

load_dotenv()
client = anthropic.Anthropic()  # Uses ANTHROPIC_API_KEY from .env

Documentation Quality

Strengths:

  • Excellent problem framing with clear explanation of latency issues
  • Honest discussion of trade-offs (cost vs latency)
  • Good inline comments explaining key concepts

Issues:

  • Session memory prompt is very long (~500 lines) without structure explanation
  • Some technical decisions lack justification (e.g., why sum input+output tokens?)
  • Missing links to related cookbooks or documentation

Model & API Usage

Correct model version: ✅ Uses claude-sonnet-4-5-20250929 which is current

Proper API usage patterns: ✅ Correctly uses client.messages.create() with proper parameters

Token counting: ⚠️ Unusual pattern of adding input_tokens + output_tokens for threshold checking (typically only input tokens matter for context limits)

Testing Concerns

Before merge, these notebooks must be:

  1. Executed top-to-bottom with a fresh kernel to verify they run without errors
  2. Tested with real API calls to ensure examples work as documented
  3. Linted with uv run ruff check per project requirements
  4. Verified that threading behavior is correct and doesn't cause race conditions

Suggestions

  1. Add visualization: Consider adding a simple matplotlib chart showing token usage over time or latency comparison
  2. Extract helper functions: The conversation loop in Cell 7 could be cleaner with extracted print helpers
  3. Improve SESSION_CREATION_PROMPT organization: Consider breaking the large prompt into sections with explanations
  4. Add "Next Steps" section: Guide readers on how to apply these patterns to their own use cases
  5. Clarify session_memory.md purpose: If this file is meant to be generated by the code, document that clearly

Next Steps: Please address the critical issues (especially the undefined variables and missing prerequisites) and ensure both notebooks execute successfully from top to bottom. Once these are fixed and the notebooks are added to registry.yaml, this will be a valuable addition to the cookbook collection.

@PedramNavid PedramNavid self-assigned this Jan 8, 2026
@github-actions

github-actions Bot commented Jan 9, 2026

Copy link
Copy Markdown

Notebook Changes

This PR modifies the following notebooks:

📓 misc/session_memory_compaction.ipynb

View diff
nbdiff /dev/null misc/session_memory_compaction.ipynb (e00b3c71d6fe5556024bbc8370d9663bde12813e)
--- /dev/null  2026-01-09 04:55:24.711597
+++ misc/session_memory_compaction.ipynb (e00b3c71d6fe5556024bbc8370d9663bde12813e)  (no timestamp)
## added /cells:
+  markdown cell:
+    source:
+      # Instant Compaction with Session Memory
+      
+      Traditional compaction is slow: when you hit the context limit, you wait for a summary.
+      
+      With **Instant compaction** the session memory is proactively generated once a soft token threshold is reached. Once the user triggers a compaction or a hard limit is reached, the summary is already available, so the user doesn't need to wait.
+      
+      Result: Instant compaction, no waiting.
+  markdown cell:
+    source:
+      
+      ```
+      TRADITIONAL COMPACTION (slow)
+      ─────────────────────────────
+      Turn 1 → Turn 2 → Turn 3 → ... → Turn N → CONTEXT FULL!
+
+
+                                          ┌─────────────────┐
+                                          │ Generate summary│
+                                          │ ( USER WAITS !) │
+                                          └─────────────────┘
+
+
+                                               Continue
+      
+      
+      SESSION MEMORY COMPACTION (instant)
+      ────────────────────────────────────
+      Turn 1 → Turn 2 → ... → Turn K → Turn K+1 → ... → Turn N → ..  → CONTEXT FULL!
+                                  │                         │            │
+                      (soft threshold met:              (update          │
+                         10k tokens init)                trigger)        │
+                                  │                                      │
+                                  │                         │            │
+                                  ▼                         ▼            │
+                             ┌────────┐                ┌────────┐        │
+                             │ Update │                │ Update │        │
+                             │ memory │ (background)   │ memory │        │
+                             └────────┘                └────────┘        │
+                                  │                         │            │
+                                  ▼                         ▼            ▼
+                           📝 session-memory.md ──────────────────► INSTANT SWAP!
+                             (continuously updated)
+      ```
+      
+      **Update triggers:** The first summary is generated after the initial 10k tokens. Updates can be triggered after every subsequent turn, or at periodically at natural breakpoints intervals (e.g. every ~5k tokens or 3+ tool calls).
+  markdown cell:
+    source:
+      ## Fundamentals: writing a compaction prompt
+  markdown cell:
+    source:
+      Make sure you have a well structured session memory prompt. 
+      
+      Some best practices include:
+      - Use chain-of-thought before summarizing — analyze first, then output                                                                                         
+      - Enumerate exactly what to preserve: file paths, code snippets, errors, user corrections                                                                      
+      - Weight recency heavily — the end of the conversation is the active context                                                                                   
+      - Require verbatim quotes for next steps to prevent task drift                                                                                                 
+      - Use structured sections with token budgets per section                                                                                                       
+      - Include a "Current State" section that always reflects the moment of compaction
+      
+      Some pitfalls include:
+      - Vague prompts like "summarize this conversation" produce lossy output                                                                                        
+      - Treating all messages equally loses the active working context                                                                                               
+      - Paraphrasing next steps introduces subtle drift that compounds                                                                                               
+      - Omitting error history causes the model to retry failed approaches                                                                                           
+      - Dropping user corrections makes the model revert to old behaviors                                                                                            
+      - No token limits lets one section consume the entire summary                                                                                                  
+      - Summarizing for human readability instead of model continuity
+      - Having the agent try to compress the results of tool calls here - this can be retrieved later if the agent needs it
+  code cell:
+    source:
+      SESSION_CREATION_PROMPT = """
+      <analysis-instructions>
+      Before generating your summary, analyze the transcript in <think>...</think> tags:
+      1. What did the user originally request? (Exact phrasing)
+      2. What actions succeeded? What failed and why?
+      3. Did the user correct or redirect the assistant at any point?
+      4. What was actively being worked on at the end?
+      5. What tasks remain incomplete or pending?
+      6. What specific details (IDs, paths, values, names) must survive compression?
+      </analysis-instructions>
+      
+      <summary-format>
+      ## User Intent
+      The user's original request and any refinements. Use direct quotes for key requirements.
+      If the user's goal evolved during the conversation, capture that progression.
+      
+      ## Completed Work
+      Actions successfully performed. Be specific:
+      - What was created, modified, or deleted
+      - Exact identifiers (file paths, record IDs, URLs, names)
+      - Specific values, configurations, or settings applied
+      
+      ## Errors & Corrections
+      - Problems encountered and how they were resolved
+      - Approaches that failed (so they aren't retried)
+      - User corrections: "don't do X", "actually I meant Y", "that's wrong because..."
+      Capture corrections verbatim—these represent learned preferences.
+      
+      ## Active Work
+      What was in progress when the session ended. Include:
+      - The specific task being performed
+      - Direct quotes showing exactly where work left off
+      - Any partial results or intermediate state
+      
+      ## Pending Tasks
+      Remaining items the user requested that haven't been started.
+      Distinguish between "explicitly requested" and "implied/assumed."
+      
+      ## Key References
+      Important details needed to continue:
+      - Identifiers: IDs, paths, URLs, names, keys
+      - Values: numbers, dates, configurations, credentials (redacted)
+      - Context: relevant background information, constraints, preferences
+      - Citations: sources referenced during the conversation
+      </summary-format>
+      
+      <preserve-rules>
+      Always preserve when present:
+      - Exact identifiers (IDs, paths, URLs, keys, names)
+      - Error messages verbatim
+      - User corrections and negative feedback
+      - Specific values, formulas, or configurations
+      - Technical constraints or requirements discovered
+      - The precise state of any in-progress work
+      </preserve-rules>
+      
+      <compression-rules>
+      - Weight recent messages more heavily—the end of the transcript is the active context
+      - Omit pleasantries, acknowledgments, and filler ("Sure!", "Great question")
+      - Omit system context that will be re-injected separately
+      - Keep each section under 500 words; condense older content to make room for recent
+      - If you must cut details, preserve: user corrections > errors > active work > completed work
+      </compression-rules>
+      """
+  markdown cell:
+    source:
+      ## Traditional compacting example
+      In traditional compaction, you generate one summary once the token threshold is reached.
+  code cell:
+    source:
+      # setup, we are using haiku for demo purposes
+      import anthropic
+      import warnings
+      
+      # Suppress noisy FutureWarning from coconut compiler
+      warnings.filterwarnings("ignore", category=FutureWarning, module="coconut")
+      
+      client = anthropic.Anthropic()
+      MODEL = "claude-haiku-4-5-20251001"
+      
+      # helper functions:
+      
+      def truncate_response(text: str, max_lines: int = 8) -> str:
+          """Truncate long responses for cleaner output display."""
+          lines = text.strip().split("\n")
+          if len(lines) <= max_lines:
+              return text
+          return "\n".join(lines[:max_lines]) + f"\n... ({len(lines) - max_lines} more lines)"
+      
+      def build_transcript(messages: list[dict]) -> str:
+          lines = []
+          for msg in messages:
+              role = "User" if msg["role"] == "user" else "Assistant"
+              lines.append(f"{role}: {msg['content']}")
+          return "\n\n".join(lines)
+      
+      def remove_thinking_blocks(text: str):                                                                                       
+          """Remove <think>...</think> blocks from the text."""                                                                    
+          import re                                                                                                                
+          matches = re.findall(r"<think>.*?</think>", text, flags=re.DOTALL)                                                       
+          cleaned = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()                                            
+          return cleaned, "".join(matches)         
+  code cell:
+    source:
+      import time
+      
+      class TraditionalCompactingChatSession:
+          """Traditional chat session with compaction after the fact."""
+          def __init__(self, context_limit: int = 1500):
+              self.context_limit = context_limit # the point at which the conversation is compacted so it does not exceed model limits. You would set this based on your model's context window size with a buffer for response tokens.
+              self.messages = []
+              self.current_context_window_tokens = 0
+              self.summary = None
+          
+          def chat(self, user_message: str):
+              # In traditional compaction, we check if we need to compact when the user sends a message. NOT IDEAL!
+              if self.current_context_window_tokens >= self.context_limit:
+                  print(f"\n🧹 Context window at {self.current_context_window_tokens} tokens. Limit exceeded, compacting session memory...")
+                  self.compact() # compacts everything before the new user message
+              
+              self.messages.append({"role": "user", "content": user_message})
+              
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=1024,
+                  system="You are a helpful coding assistant.",
+                  messages=self.messages
+              )
+              
+              assistant_message = response.content[0].text
+              self.messages.append({"role": "assistant", "content": assistant_message})
+              
+              # approximate current token count in the conversation before the next user message
+              self.current_context_window_tokens = response.usage.input_tokens + response.usage.output_tokens
+      
+              return assistant_message, response.usage
+          
+          def compact(self):
+              prev_msg_count = len(self.messages)
+              
+              compaction_prompt = SESSION_CREATION_PROMPT + "\n\nTranscript:\n" + build_transcript(self.messages)
+              
+              start_time = time.perf_counter()
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=800, # note that some of this will be eaten up by the thinking blocks we remove later
+                  system="""You are a session memory agent. Compress the conversation into a structured summary 
+      that preserves all information needed to continue work seamlessly. Optimize for the assistant's 
+      ability to continue working, not human readability""",
+                  messages=[{"role": "user", "content": compaction_prompt}]
+              )
+              elapsed = time.perf_counter() - start_time
+              
+              # Generate new summary message
+              self.summary, removed_text = remove_thinking_blocks(response.content[0].text) # clean up any <think> blocks because they are not needed in the session memory
+              approximate_summary_tokens = response.usage.output_tokens - round(len(removed_text) / 4)  # rough estimate of tokens removed from summary
+             
+              # Replace prior messages with new summary message
+              self.messages = [{
+                  "role": "user",
+                  "content": f"""This session is being continued from a previous conversation. Here is the session memory: {self.summary}.Continue from where we left off."""
+              }]
+      
+              # Show stats on compaction
+              curr_msg_count = len(self.messages)
+              
+              # Show token reduction if we just compacted
+              reduction = self.current_context_window_tokens - approximate_summary_tokens
+              pct = (reduction / self.current_context_window_tokens) * 100
+              
+              print(f"\n{'-' * 60}")
+              print(f"✅ Tokens reduced: {self.current_context_window_tokens:,} → {approximate_summary_tokens:.0f} ({reduction:,} tokens saved, {pct:.0f}% reduction)")
+              print(f"📝 New session memory created.")
+              print(f"🔄 Compaction messages: {prev_msg_count} → {curr_msg_count}")
+              print(f"⏱️ Compaction time: {elapsed:.2f}s (user waiting...)")
+              print(f"{'-' * 60}")
+              
+              # Update token count to reflect compacted state
+              self.current_context_window_tokens = approximate_summary_tokens
+  markdown cell:
+    source:
+      ### Example use of traditional compaction
+  code cell:
+    source:
+      session = TraditionalCompactingChatSession()
+      
+      messages = [
+          "Explain Python decorators with a simple example.",
+          "Now show me a decorator that logs function arguments.",
+          "How do I make a decorator that accepts parameters?",
+      ]
+      
+      print("Starting conversation with traditional compacting chat session...\n")
+      
+      turn_count = 0
+      
+      for i, message in enumerate(messages, 1):
+          response, usage = session.chat(message)
+          turn_count += 1
+          print(
+              f"==============================================\n"
+              f"Turn {turn_count}: Input={usage.input_tokens:,} | "
+              f"Output={usage.output_tokens:,} | "
+              f"Messages={len(session.messages)}"
+          )
+          print(f"\nUser: {message}")
+          print(f"\nAssistant: \n{truncate_response(response, max_lines=3)}")
+          print()
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          Starting conversation with traditional compacting chat session...
+          
+          ==============================================
+          Turn 1: Input=41 | Output=450 | Messages=2
+          
+          User: Explain Python decorators with a simple example.
+          
+          Assistant: 
+          # Python Decorators Explained
+          
+          A **decorator** is a function that modifies or enhances another function or class without permanently changing its source code.
+          ... (69 more lines)
+          
+          ==============================================
+          Turn 2: Input=504 | Output=871 | Messages=4
+          
+          User: Now show me a decorator that logs function arguments.
+          
+          Assistant: 
+          # Decorator that Logs Function Arguments
+          
+          Here's a practical logging decorator:
+          ... (122 more lines)
+          
+          ==============================================
+          Turn 3: Input=1,388 | Output=1,024 | Messages=6
+          
+          User: How do I make a decorator that accepts parameters?
+          
+          Assistant: 
+          # Decorators with Parameters
+          
+          To make a decorator that accepts parameters, you need an **extra layer of nesting**.
+          ... (144 more lines)
+          
+  code cell:
+    source:
+      response, _ = session.chat("What other related topics should we cover?")
+      print("Final response after compaction:")
+      print(response)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          
+          🧹 Context window at 2412 tokens. Limit exceeded, compacting session memory...
+          
+          ------------------------------------------------------------
+          ✅ Tokens reduced: 2,412 → 391 (2,021 tokens saved, 84% reduction)
+          📝 New session memory created.
+          🔄 Compaction messages: 6 → 1
+          ⏱️ Compaction time: 6.93s (user waiting...)
+          ------------------------------------------------------------
+          Final response after compaction:
+          # Continuing: Complete the `validate_types` Decorator
+          
+          Let me finish that incomplete example from Response 3:
+          
+          ```python
+          from functools import wraps
+          
+          def validate_types(**type_checks):
+              """Decorator that validates argument types before execution"""
+              def decorator(func):
+                  @wraps(func)
+                  def wrapper(*args, **kwargs):
+                      # Get function signature to match args to parameter names
+                      import inspect
+                      sig = inspect.signature(func)
+                      bound_args = sig.bind(*args, **kwargs)
+                      bound_args.apply_defaults()
+                      
+                      # Validate types
+                      for param_name, expected_type in type_checks.items():
+                          if param_name in bound_args.arguments:
+                              actual_value = bound_args.arguments[param_name]
+                              if not isinstance(actual_value, expected_type):
+                                  raise TypeError(
+                                      f"Parameter '{param_name}' expected {expected_type.__name__}, "
+                                      f"got {type(actual_value).__name__}"
+                                  )
+                      
+                      return func(*args, **kwargs)
+                  return wrapper
+              return decorator
+          
+          # Usage example:
+          @validate_types(name=str, age=int, email=str)
+          def create_user(name, age, email):
+              return f"User {name} ({age}) created with email {email}"
+          
+          # ✅ Success case
+          print(create_user("Alice", 30, "alice@example.com"))
+          # Output: User Alice (30) created with email alice@example.com
+          
+          # ❌ Error case - wrong type for 'age'
+          try:
+              print(create_user("Bob", "twenty-five", "bob@example.com"))
+          except TypeError as e:
+              print(f"❌ Error: {e}")
+              # Output: ❌ Error: Parameter 'age' expected int, got str
+          ```
+          
+          ---
+          
+          ## Related Advanced Topics to Cover Next
+          
+          Based on your progression, here are natural next steps:
+          
+          | Topic | Difficulty | Why It Matters |
+          |-------|-----------|----------------|
+          | **Class-based Decorators** | 🟡 Intermediate | Stateful decorators using `__call__`, persisting data between calls |
+          | **Stacking Decorators** | 🟡 Intermediate | Order matters! `@decorator1 @decorator2 def func()` |
+          | **Async Decorators** | 🔴 Advanced | `async def` decorators for async/await functions |
+          | **Decorator Composition** | 🔴 Advanced | Building decorators from other decorators (meta-decorating) |
+          | **Context Managers vs Decorators** | 🟡 Intermediate | When to use `with` statements instead |
+          | **Real-world patterns** | 🟢 Beginner-friendly | Flask/Django route decorators, authentication, timing |
+          
+          **My recommendation**: Pick one:
+          - 🟢 **"Show me real-world decorator patterns from Flask/Django"** — immediately practical
+          - 🟡 **"Explain class-based decorators"** — bridges to OOP patterns
+          - 🟡 **"What happens when I stack multiple decorators?"** — common mistake area
+          
+          Which interests you? 🎯
+  code cell:
+    source:
+      print(session.summary)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          ## User Intent
+          User requested progressive education on Python decorators:
+          1. "Explain Python decorators with a simple example"
+          2. "Now show me a decorator that logs function arguments"
+          3. "How do I make a decorator that accepts parameters?"
+          
+          User is learning decorator progression from basic to advanced (parameterized).
+          
+          ## Completed Work
+          - **Response 1**: Basic decorator explanation with gift-wrapping analogy, simple example (`my_decorator`), example with arguments (`*args, **kwargs`), and real-world use cases
+          - **Response 2**: Logging decorator (`log_arguments`) in three versions:
+            - Basic version logging args, kwargs, and result
+            - Enhanced version with formatted output and emoji
+            - Professional version using `functools.wraps` to preserve function metadata
+            - Applied to example functions: `greet()`, `multiply()`, `calculate()`, `divide()`
+          - **Response 3**: Parameterized decorators with multiple practical examples:
+            - `repeat(times)` - repeats function execution N times
+            - `rate_limit(max_calls, time_window)` - rate limiting with time window tracking
+            - `log_with_prefix(prefix)` - logging with custom prefix parameter
+            - `validate_types(**type_checks)` - type validation decorator (incomplete—cut off mid-example)
+          
+          ## Errors & Corrections
+          None identified. User progression was linear and organic.
+          
+          ## Active Work
+          **Incomplete**: Third response cut off mid-execution. Last example was `validate_types(**type_checks)` decorator demonstrating keyword argument type validation. The final test case `print(create_user` was incomplete (no closing parenthesis). This should be resumed with the complete example showing both successful call and error case.
+          
+          ## Pending Tasks
+          - Complete/deliver the final `
+  markdown cell:
+    source:
+      As a result the user experineces a wait time when compaction occurs. It is only a few seconds in this example, but for long context compaction, this can be must longer.
+  markdown cell:
+    source:
+      ## Instant Compaction
+  markdown cell:
+    source:
+      
+      The key insight: **build the session memory in the background** so it's ready when you need it.
+      
+      ```
+      Turn 1 → Turn 2 → ... → Turn K  → Turn K+1 → ... → CONTEXT FULL!
+                                 │           │                 │
+                           (threshold)  (update)          INSTANT!
+                                 ↓           ↓                 ↓
+                          [Background]  [Background]    [Just swap in
+                           memory init   memory update   pre-built memory]
+      ```
+      
+      This `InstantCompactingChatSession` class uses **threading** for background execution:
+      1. **`threading.Thread`** - runs memory updates in background without blocking
+      2. **Thread-safe state** - uses `threading.Lock` to safely update shared memory
+      3. **Daemon threads** - background work doesn't prevent program exit
+      4. **Instant compaction** - when context is full, just swap in the pre-built memory
+  code cell:
+    source:
+      import threading
+      import time
+      
+      
+      class InstantCompactingChatSession:
+          """
+          Maintains session memory via incremental background updates.
+          
+          Key insight: By updating memory in the background after each turn,
+          the summary is already ready when compaction is needed - instant swap!
+          """
+      
+          def __init__(
+              self,
+              context_limit: int = 1500,
+              min_tokens_to_init: int = 700,
+              min_tokens_between_updates: int = 300,
+          ):
+              # Thresholds
+              self.context_limit = context_limit # the point at which the conversation is compacted so it does not exceed model limits
+              self.min_tokens_to_init = min_tokens_to_init # tokens needed to trigger initial memory creation; note this happens PROACTIVELY in background unlike traditional compaction
+              self.min_tokens_between_updates = min_tokens_between_updates # tokens needed to trigger memory update. only comes into play after initial memory is created and additional compaction (memory update) is needed after that
+      
+              # Conversation state
+              self.messages = []
+              self.current_context_window_tokens = 0
+      
+              # Session memory state
+              self.session_memory = None # this is the compacted conversation in session memory; for the demo we are storing this in memory, but in production you would write to session_memory.md file
+              self.last_summarized_index = 0 # The index of the last message included in the session memory
+              self.tokens_at_last_update = 0 # TBD if I need this
+      
+              # Background update tracking
+              self._update_thread: threading.Thread | None = None
+              self.last_update_time = None
+              self._lock = threading.Lock()
+      
+          def chat(self, user_message: str):
+              """Process a chat turn with background session memory updates."""
+              if self.current_context_window_tokens >= self.context_limit:
+                  self.compact() # note that when this is triggered, the compaction has already been created and is just swapped in instantly
+      
+              self.messages.append({"role": "user", "content": user_message})
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=1024,
+                  system="You are a helpful coding assistant.",
+                  messages=self.messages,
+              )
+      
+              assistant_message = response.content[0].text
+              self.messages.append({"role": "assistant", "content": assistant_message})
+      
+              # approximate current token count in the conversation before the next user message
+              self.current_context_window_tokens = response.usage.input_tokens + response.usage.output_tokens
+      
+              # KEY DIFFERENCE: Trigger background memory update if needed proactively, before compaction is needed
+              background_status = None
+              if self._should_init_memory() or self._should_update_memory():
+                  self._trigger_background_update()
+                  background_status = "initializing" if self.session_memory is None else "updating"
+      
+              return assistant_message, response.usage, background_status
+          
+          # Helper methods to determine when to init/update/compact
+          def _should_init_memory(self) -> bool:
+              return (
+                  self.session_memory is None
+                  and self.current_context_window_tokens >= self.min_tokens_to_init
+              )
+      
+          # Helper method to determine if memory should be updated
+          def _should_update_memory(self) -> bool:
+              if self.session_memory is None:
+                  return False
+              tokens_since = self.current_context_window_tokens - self.tokens_at_last_update
+              return tokens_since >= self.min_tokens_between_updates
+      
+          # Methods to create initial session memory
+          def _create_session_memory(self, messages: list[dict]) -> str:
+              """Generate initial session memory from messages."""
+              transcript = build_transcript(messages)
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=1024,
+                  system="""You are a session memory agent. Compress the conversation into a structured summary 
+      that preserves all information needed to continue work seamlessly. Optimize for the assistant's 
+      ability to continue working, not human readability.""",
+                  messages=[
+                      {
+                          "role": "user",
+                          "content": f"""Conversation transcript:
+      {transcript}
+      
+      Create session memory using these instructions:
+      {SESSION_CREATION_PROMPT}
+      
+      First analyze in <think>...</think> tags, then output the structured summary.""",
+                      }
+                  ],
+              )
+              summary, _ = remove_thinking_blocks(response.content[0].text)  # clean up any <think> blocks because they are not needed in the session memory
+              return summary
+      
+          def _update_session_memory(self, new_messages: list[dict]) -> str:
+              """Update existing session memory with new messages. In practice, you may want to do this via file edit rather than full re-generation. But for demo purposes we do full regeneration here."""
+              transcript = build_transcript(new_messages)
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=1024,
+                  system="""You are a session memory agent. Update the existing session memory with new information 
+      from the recent conversation. Preserve important existing details while integrating new content.""",
+                  messages=[
+                      {
+                          "role": "user",
+                          "content": f"""Current session memory:
+      {self.session_memory}
+      
+      New messages to integrate:
+      {transcript}
+      
+      Update the session memory following these guidelines:
+      {SESSION_CREATION_PROMPT}
+      
+      Output only the updated session memory (no analysis tags needed for updates).
+      
+      First analyze in <think>...</think> tags, then output the updated structured summary.""",
+                      }
+                  ],
+              )
+              updated_summary, _ = remove_thinking_blocks(response.content[0].text)  # clean up any <think> blocks because they are not needed in the session memory
+              return updated_summary
+      
+          # Background memory update methods
+          def _background_memory_update(
+              self, messages_snapshot: list[dict], snapshot_index: int, current_tokens: int
+          ):
+              """Run session memory update in a background thread."""
+              try:
+                  with self._lock:
+                      current_session_memory = self.session_memory
+                      last_index = self.last_summarized_index
+      
+                  if current_session_memory is None:
+                      new_memory = self._create_session_memory(messages_snapshot)
+                  else:
+                      # Get new messages since last summary
+                      new_messages = messages_snapshot[last_index :]
+                      if not new_messages:
+                          return
+                      new_memory = self._update_session_memory(new_messages)
+      
+                  # Update state (thread-safe)
+                  with self._lock:
+                      self.session_memory = new_memory
+                      self.last_summarized_index = snapshot_index
+                      self.tokens_at_last_update = current_tokens
+                      self.last_update_time = time.time()
+      
+              except Exception as e:
+                  print(f"   [Background] Error updating memory: {e}")
+      
+          # This makes sure only one background update runs at a time. If one is already running, we skip starting another. If not, we start a new thread to do the update.
+          def _trigger_background_update(self):
+              """Trigger a background session memory update."""
+              if self._update_thread is not None and self._update_thread.is_alive():
+                  return
+      
+              messages_snapshot = self.messages.copy()
+              snapshot_index = len(messages_snapshot)
+              current_tokens = self.current_context_window_tokens
+      
+              self._update_thread = threading.Thread(
+                  target=self._background_memory_update,
+                  args=(messages_snapshot, snapshot_index, current_tokens),
+                  daemon=True,
+              )
+              self._update_thread.start()
+      
+          # Function to compact
+          def compact(self):
+              """INSTANT compaction using pre-built session memory."""
+              prev_msg_count = len(self.messages)
+      
+              # Ensure session memory is ready. Shouldn't be an issue normally, but here for safety.
+              if self.session_memory is None:
+                  if self._update_thread is not None and self._update_thread.is_alive():
+                      print("   ⏳ Waiting for background memory update...")
+                      self._update_thread.join(timeout=30.0)
+      
+                  if self.session_memory is None:
+                      print("   ⚠️  No pre-built memory, creating synchronously...")
+                      start = time.perf_counter()
+                      self.session_memory = self._create_session_memory(self.messages)
+                      elapsed = time.perf_counter() - start
+                      print(f"   ⏱️  Took {elapsed:.2f}s (but should be instant normally!)")
+                      self.last_summarized_index = len(self.messages)
+      
+              with self._lock:
+                  unsummarized = self.messages[self.last_summarized_index :]
+      
+                  summary_message = {
+                      "role": "user",
+                      "content": f"""This session is being continued from a previous conversation.
+      
+          Here is the session memory:
+          {self.session_memory}
+      
+          Continue from where we left off.""",
+                  }
+      
+                  self.messages = [summary_message] + unsummarized
+                  self.last_summarized_index = 1
+      
+                  print(f"\n{'=' * 60}")
+                  print(f"⚡ INSTANT COMPACTION! Messages: {prev_msg_count} → {len(self.messages)}")
+                  print(f"   Session memory was pre-built (no wait time!)")
+                  print(f"{'=' * 60}")
+      
+          
+  markdown cell:
+    source:
+      ### Example use of Instant Compaction
+  code cell:
+    source:
+      # Low thresholds for demo - in production you'd use higher values
+      session = InstantCompactingChatSession(
+          context_limit=1500,
+          min_tokens_to_init=700,
+          min_tokens_between_updates=300,
+      )
+      
+      messages = [
+          "Explain Python decorators with a simple example.",
+          "Now show me a decorator that logs function arguments.",
+          "How do I make a decorator that accepts parameters?",
+      ]
+      print("Starting conversation with instant compacting chat session...\n")
+      
+      turn_count = 0
+      for i, message in enumerate(messages, 1):
+          response, usage, background_status = session.chat(message)
+          turn_count += 1
+          print(
+              f"==============================================\n"
+              f"Turn {turn_count}: "
+          )
+          print(f"\nUser: {message}")
+          print(f"\nAssistant: \n{truncate_response(response, max_lines=3)}")
+          print(f"\n \nTurn end state: "
+              f"\nInput={usage.input_tokens:,} |"
+              f"Output={usage.output_tokens:,} | "
+              f"Messages={len(session.messages)} | "
+              f"Memory: {'ready' if session.session_memory else 'no memory created yet'}\n"
+          )
+          if background_status:
+              print("On the next response, the current conversation tokens will be:", session.current_context_window_tokens)
+              print(f"   [Background] Proactively {background_status} session memory...")
+          # Sleep to allow background updates to complete for demo purposes
+          if i < len(messages):
+              time.sleep(5)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          Starting conversation with instant compacting chat session...
+          
+          ==============================================
+          Turn 1: 
+          
+          User: Explain Python decorators with a simple example.
+          
+          Assistant: 
+          # Python Decorators Explained
+          
+          A **decorator** is a function that modifies or enhances another function or class without permanently changing its source code. It wraps a function with additional functionality.
+          ... (66 more lines)
+          
+           
+          Turn end state: 
+          Input=41 |Output=418 | Messages=2 | Memory: no memory created yet
+          
+          ==============================================
+          Turn 2: 
+          
+          User: Now show me a decorator that logs function arguments.
+          
+          Assistant: 
+          # Function Argument Logging Decorator
+          
+          Here are several approaches, from simple to advanced:
+          ... (104 more lines)
+          
+           
+          Turn end state: 
+          Input=472 |Output=750 | Messages=4 | Memory: no memory created yet
+          
+          On the next response, the current conversation tokens will be: 1222
+             [Background] Proactively initializing session memory...
+          ==============================================
+          Turn 3: 
+          
+          User: How do I make a decorator that accepts parameters?
+          
+          Assistant: 
+          # Decorators with Parameters
+          
+          To create a decorator that accepts parameters, you need an extra layer of nesting. Here's how:
+          ... (143 more lines)
+          
+           
+          Turn end state: 
+          Input=1,235 |Output=1,024 | Messages=6 | Memory: ready
+          
+          On the next response, the current conversation tokens will be: 2259
+             [Background] Proactively updating session memory...
+  code cell:
+    source:
+      message = "What did we just talk about?"
+      response, usage, background_status = session.chat(message)
+      print(f"\nUser: {message}")
+      print(f"\nAssistant: \n{truncate_response(response, max_lines=3)}")
+      print(f"\n \nTurn end state: "
+          f"\nInput={usage.input_tokens:,} |"
+          f"Output={usage.output_tokens:,} | "
+          f"Messages={len(session.messages)} | "
+          f"Memory: {'ready' if session.session_memory else 'no memory created yet'}\n"
+      )
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          
+          ============================================================
+          ⚡ INSTANT COMPACTION! Messages: 6 → 1
+             Kept 0 unsummarized messages
+             Session memory was pre-built (no wait time!)
+          ============================================================
+          
+          User: What did we just talk about?
+          
+          Assistant: 
+          # Session Summary
+          
+          We just completed a comprehensive discussion on **parameterized decorators** (your third request in this session).
+          ... (33 more lines)
+          
+           
+          Turn end state: 
+          Input=778 |Output=386 | Messages=3 | Memory: ready
+          
+  code cell:
+    source:
+      message = "What are some good follow up topics we should cover?"
+      response, usage, background_status = session.chat(message)
+      print(f"\nUser: {message}")
+      print(f"\nAssistant: \n{truncate_response(response, max_lines=3)}")
+      print(f"\n \nTurn end state: "
+          f"\nInput={usage.input_tokens:,} |"
+          f"Output={usage.output_tokens:,} | "
+          f"Messages={len(session.messages)} | "
+          f"Memory: {'ready' if session.session_memory else 'no memory created yet'}\n"
+      )
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          
+          User: What are some good follow up topics we should cover?
+          
+          Assistant: 
+          # Recommended Follow-Up Topics
+          
+          Based on your progression through decorator fundamentals → logging → parameterized decorators, here are the most logical next steps:
+          ... (63 more lines)
+          
+           
+          Turn end state: 
+          Input=1,178 |Output=723 | Messages=5 | Memory: ready
+          
+  markdown cell:
+    source:
+      ## Advanced: Adding Prompt Caching
+  markdown cell:
+    source:
+      
+      The background updates can be made **~10x cheaper** by using prompt caching. The trick:
+      1. Pass the **full conversation** to the background summarizer
+      2. Add `cache_control` markers so subsequent requests hit the cache
+      3. Only the new "summarize this" instruction is billed at full price
+      
+      ```
+      ┌─────────────────────────────────────────────────────────────────────────────────┐
+      │                    PROMPT CACHING FOR LONG CONVERSATIONS                        │
+      ├─────────────────────────────────────────────────────────────────────────────────┤
+      │                                                                                 │
+      │  WITHOUT CACHING: Pay full price for entire context every turn                 │
+      │  ════════════════════════════════════════════════════════════                   │
+      │                                                                                 │
+      │  Turn 1:  [System][User1][Asst1]                         →  500 tokens  @ $3/M │
+      │  Turn 2:  [System][User1][Asst1][User2][Asst2]           → 1500 tokens  @ $3/M │
+      │  Turn 3:  [System][User1][Asst1][User2][Asst2][User3]... → 3000 tokens  @ $3/M │
+      │  Turn 4:  [System][User1][Asst1][User2][Asst2][User3]... → 5000 tokens  @ $3/M │
+      │           ─────────────────────────────────────────────                         │
+      │                                              Total: 10,000 tokens = $0.030      │
+      │                                                                                 │
+      │                                                                                 │
+      │  WITH CACHING: Pay full price once, then 90% discount on prefix                │
+      │  ═══════════════════════════════════════════════════════════════                │
+      │                                                                                 │
+      │  Turn 1:  [System][User1][Asst1]◆                        →  500 tokens  @ $3/M │
+      │                                ▲                            (cache created)    │
+      │                          cache breakpoint                                       │
+      │                                                                                 │
+      │  Turn 2:  [System][User1][Asst1][User2][Asst2]◆                                │
+      │           ╰─────── cached ──────╯                                              │
+      │                500 @ $0.30/M + 1000 new @ $3/M  =  $0.0032                     │
+      │                                                                                 │
+      │  Turn 3:  [System][User1][Asst1][User2][Asst2][User3][Asst3]◆                  │
+      │           ╰──────────── cached ─────────────╯                                  │
+      │               1500 @ $0.30/M + 1500 new @ $3/M  =  $0.0050                     │
+      │                                                                                 │
+      │  Turn 4:  [System][User1][Asst1][User2][Asst2][User3][Asst3][User4][Asst4]◆    │
+      │           ╰───────────────────── cached ─────────────────────╯                 │
+      │                     3000 @ $0.30/M + 2000 new @ $3/M  =  $0.0069               │
+      │           ─────────────────────────────────────────────                         │
+      │                                              Total: $0.0166  (45% savings)     │
+      │                                                                                 │
+      ├─────────────────────────────────────────────────────────────────────────────────┤
+      │                                                                                 │
+      │  COMPACTION + CACHING: Double benefit                                           │
+      │  ════════════════════════════════════                                           │
+      │                                                                                 │
+      │    Main Chat                      Background Summarizer                         │
+      │    ─────────                      ─────────────────────                         │
+      │                                                                                 │
+      │  [Conversation grows...]          [Same conversation prefix]◆ + [Summarize!]   │
+      │         │                                    │                                  │
+      │         │                         Cache hit! Only pays for                      │
+      │         │                         the summarization prompt                      │
+      │         │                                    │                                  │
+      │         ▼                                    ▼                                  │
+      │  Context limit reached  ──────►  Session memory ready instantly                │
+      │                                  (built cheaply in background)                  │
+      │                                                                                 │
+      │  ┌──────────────────────────────────────────────────────────────────────────┐  │
+      │  │  Key insight: The background summarizer reuses the same conversation     │  │
+      │  │  prefix that was just sent to the main chat - automatic cache hit!       │  │
+      │  └──────────────────────────────────────────────────────────────────────────┘  │
+      │                                                                                 │
+      └─────────────────────────────────────────────────────────────────────────────────┘
+      
+      ◆ = cache_control breakpoint (cache everything before this point)
+      ```
+      
+      ### Why this matters for compaction
+      
+      | Scenario | Cost per background update | Notes |
+      |----------|---------------------------|-------|
+      | No caching | Full input cost | 5,000 tokens × $3/M = $0.015 |
+      | With caching | ~10% of input cost | 500 new + 4,500 cached = $0.003 |
+      | **Savings** | **~80%** | Compounds over many updates |
+      
+      The longer the conversation, the bigger the savings—exactly when you need compaction most!
+  markdown cell:
+    source:
+      ### How the Caching Works
+      
+      The key is in `_add_cache_control()` and `_create_session_memory_cached()`:
+      
+      ```python
+      # 1. Mark the last conversation message with cache_control
+      {
+          "role": "user",
+          "content": [{
+              "type": "text",
+              "text": msg["content"],
+              "cache_control": {"type": "ephemeral"}  # <-- This creates a cache breakpoint
+          }]
+      }
+      
+      # 2. Also mark the system prompt
+      system=[{
+          "type": "text",
+          "text": "You are a session memory agent...",
+          "cache_control": {"type": "ephemeral"}
+      }]
+      ```
+      
+      **Why this works:**
+      - The first background update creates a cache entry for `[System + Messages]`
+      - Subsequent updates with the same message prefix get **cache hits**
+      - Only the new summarization instruction is billed at full price
+      - Cache entries have a 5-minute TTL, so rapid updates benefit most
+      
+      **Cost math:**
+      - Without caching: 5,000 tokens × $3.00/1M = $0.015 per update
+      - With caching: 500 new tokens × $3.00/1M + 4,500 cached × $0.30/1M = $0.00285
+      - **Savings: ~80%** on background summarization costs
+  code cell:
+    source:
+      SMARTER_MODEL = "claude-sonnet-4-5-20250929"
+      class CachedInstantCompactingChatSession(InstantCompactingChatSession):
+          """Instant compacting session with prompt caching enabled."""
+          
+          def _add_cache_control(self, messages: list[dict]) -> list[dict]:
+              """Convert all messages to list format for consistent structure, with cache_control on the last message.
+      
+              For prompt caching to work, the message prefix structure must be identical between requests.
+              If we only convert the last message to list format, previous messages change from list→string
+              on the next turn, breaking the cache match.
+              """
+              if not messages:
+                  return messages
+      
+              cached_messages = []
+              for i, msg in enumerate(messages):
+                  is_last = (i == len(messages) - 1)
+                  content_block = {
+                      "type": "text",
+                      "text": msg["content"],
+                  }
+                  if is_last:
+                      content_block["cache_control"] = {"type": "ephemeral"}
+      
+                  cached_messages.append({
+                      "role": msg["role"],
+                      "content": [content_block],
+                  })
+      
+              return cached_messages
+      
+          def chat(self, user_message: str):
+              if self.current_context_window_tokens >= self.context_limit:
+                  self.compact()
+      
+              self.messages.append({"role": "user", "content": user_message})
+      
+              response = client.messages.create(
+                  model=SMARTER_MODEL,
+                  max_tokens=1024,
+                  system="You are a helpful coding assistant.",
+                  messages=self._add_cache_control(self.messages),
+              )
+      
+              assistant_message = response.content[0].text
+              self.messages.append({"role": "assistant", "content": assistant_message})
+      
+              self.current_context_window_tokens = response.usage.input_tokens + response.usage.output_tokens
+      
+              background_status = None
+              if self._should_init_memory() or self._should_update_memory():
+                  self._trigger_background_update()
+                  background_status = "initializing" if self.session_memory is None else "updating"
+      
+              return assistant_message, response.usage, background_status
+      
+          def _create_session_memory(self, messages: list[dict]) -> str:
+              transcript = build_transcript(messages)
+      
+              prompt_messages = [
+                  {
+                      "role": "user",
+                      "content": [
+                          {
+                              "type": "text",
+                              "text": f"""Conversation transcript:
+      {transcript}
+      
+      Create session memory using these instructions:
+      {SESSION_CREATION_PROMPT}
+      
+      First analyze in <think>...</think> tags, then output the structured summary.""",
+                              "cache_control": {"type": "ephemeral"},
+                          }
+                      ],
+                  }
+              ]
+      
+              response = client.messages.create(
+                  model=SMARTER_MODEL,
+                  max_tokens=1024,
+                  system="""You are a session memory agent. Compress the conversation into a structured summary 
+      that preserves all information needed to continue work seamlessly. Optimize for the assistant's 
+      ability to continue working, not human readability.""",
+                  messages=prompt_messages,
+              )
+              summary, _ = remove_thinking_blocks(response.content[0].text)
+              return summary
+      
+          def _update_session_memory(self, new_messages: list[dict]) -> str:
+              transcript = build_transcript(new_messages)
+      
+              prompt_messages = [
+                  {
+                      "role": "user",
+                      "content": [
+                          {
+                              "type": "text",
+                              "text": f"""Current session memory:
+      {self.session_memory}
+      
+      New messages to integrate:
+      {transcript}
+      
+      Update the session memory following these guidelines:
+      {SESSION_CREATION_PROMPT}
+      
+      Output only the updated session memory (no analysis tags needed for updates).
+      
+      First analyze in <think>...</think> tags, then output the updated structured summary.""",
+                              "cache_control": {"type": "ephemeral"},
+                          }
+                      ],
+                  }
+              ]
+      
+              response = client.messages.create(
+                  model=SMARTER_MODEL,
+                  max_tokens=1024,
+                  system="""You are a session memory agent. Update the existing session memory with new information 
+      from the recent conversation. Preserve important existing details while integrating new content.""",
+                  messages=prompt_messages,
+              )
+              updated_summary, _ = remove_thinking_blocks(response.content[0].text)
+              return updated_summary
+  code cell:
+    source:
+      # Low thresholds for demo - in production you'd use higher values
+      session = CachedInstantCompactingChatSession(
+          context_limit=2500,
+          min_tokens_to_init=1000,
+          min_tokens_between_updates=500,
+      )
+      
+      messages = [
+          "Explain Python decorators with a simple example.",
+          "Now show me a decorator that logs function arguments.",
+          "How do I make a decorator that accepts parameters?",
+      ]
+      print("Starting conversation with CACHED instant compacting chat session...\n")
+      
+      turn_count = 0
+      for i, message in enumerate(messages, 1):
+          response, usage, bg_status = session.chat(message)
+          turn_count += 1
+          
+          # Cache stats
+          cache_created = getattr(usage, 'cache_creation_input_tokens', 0) or 0
+          cache_read = getattr(usage, 'cache_read_input_tokens', 0) or 0
+          cache_hit_pct = (cache_read / usage.input_tokens * 100) if usage.input_tokens > 0 else 0
+          
+          print(
+              f"==============================================\n"
+              f"Turn {turn_count}:"
+          )
+          print(f"\nUser: {message}")
+          print(f"\nAssistant: \n{truncate_response(response, max_lines=3)}")
+          print(
+              f"\nTurn end state:"
+              f"\n  Input={usage.input_tokens:,} | Output={usage.output_tokens:,}"
+              f"\n  Cache: {cache_read:,} read, {cache_created:,} created ({cache_hit_pct:.0f}% hit rate)"
+              f"\n  Messages={len(session.messages)} | Memory: {'ready' if session.session_memory else 'not yet'}"
+          )
+          
+          if bg_status:
+              print(f"\n  [Background] Proactively {bg_status} session memory...")
+          
+          if i < len(messages):
+              time.sleep(5)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          Starting conversation with CACHED instant compacting chat session...
+          
+          ==============================================
+          Turn 1:
+          
+          User: Explain Python decorators with a simple example.
+          
+          Assistant: 
+          # Python Decorators Explained
+          
+          ## What is a Decorator?
+          ... (66 more lines)
+          
+          Turn end state:
+            Input=41 | Output=422
+            Cache: 0 read, 0 created (0% hit rate)
+            Messages=2 | Memory: not yet
+          ==============================================
+          Turn 2:
+          
+          User: Now show me a decorator that logs function arguments.
+          
+          Assistant: 
+          # Decorator that Logs Function Arguments
+          
+          ## Basic Logging Decorator
+          ... (124 more lines)
+          
+          Turn end state:
+            Input=476 | Output=1,024
+            Cache: 0 read, 0 created (0% hit rate)
+            Messages=4 | Memory: not yet
+          
+            [Background] Proactively initializing session memory...
+          ==============================================
+          Turn 3:
+          
+          User: How do I make a decorator that accepts parameters?
+          
+          Assistant: 
+          # Decorators with Parameters
+          
+          ## The Pattern: Three Levels of Functions
+          ... (144 more lines)
+          
+          Turn end state:
+            Input=1,516 | Output=1,024
+            Cache: 0 read, 0 created (0% hit rate)
+            Messages=6 | Memory: not yet
+          
+            [Background] Proactively initializing session memory...
+  code cell:
+    source:
+      message = "What did we just talk about?"
+      response, usage, background_status = session.chat(message)
+      print(f"\nUser: {message}")
+      print(f"\nAssistant: \n{truncate_response(response, max_lines=3)}")
+      print(f"\n \nTurn end state: "
+          f"\nInput={usage.input_tokens:,} |"
+          f"Output={usage.output_tokens:,} | "
+          f"Messages={len(session.messages)} | "
+          f"Memory: {'ready' if session.session_memory else 'no memory created yet'}\n"
+      )
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          
+          ============================================================
+          ⚡ INSTANT COMPACTION! Messages: 6 → 3
+             Session memory was pre-built (no wait time!)
+          ============================================================
+          
+          User: What did we just talk about?
+          
+          Assistant: 
+          We just talked about **decorators with parameters** in Python!
+          
+          ## Quick Summary:
+          ... (28 more lines)
+          
+           
+          Turn end state: 
+          Input=1,780 |Output=347 | Messages=5 | Memory: ready
+          
+  code cell:
+    source:
+      # Debug: Print the structure of _add_cache_control output
+      import json
+      
+      cached_messages = session._add_cache_control(session.messages)
+      print("Number of messages:", len(cached_messages))
+      print("\nStructure of cached messages:")
+      for i, msg in enumerate(cached_messages):
+          print(f"\n--- Message {i} (role: {msg['role']}) ---")
+          if isinstance(msg.get('content'), list):
+              print(f"Content is a list with {len(msg['content'])} item(s)")
+              for j, block in enumerate(msg['content']):
+                  print(f"  Block {j}: type={block.get('type')}, has cache_control={('cache_control' in block)}")
+                  if 'cache_control' in block:
+                      print(f"    cache_control: {block['cache_control']}")
+          else:
+              content_preview = str(msg.get('content', ''))[:100]
+              print(f"Content is string: {content_preview}...")

Generated by nbdime

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review

Recommendation: REQUEST_CHANGES

Summary

This PR adds a comprehensive notebook demonstrating instant compaction using session memory with background threading patterns. The code is functional and demonstrates valuable production patterns, but requires structural improvements to meet cookbook standards before merging.

Actionable Feedback (8 items)

Critical Issues

  • Add notebook entry to registry.yaml with title, description, path, authors, and categories (see CLAUDE.md Section "Adding a New Cookbook")
  • Add author jsham042 to authors.yaml if this is a new contributor
  • misc/session_memory.md:1 - Either delete this placeholder file or implement it with actual documentation content. Single-line placeholders provide no value.
  • Add Prerequisites & Setup section after the introduction with required knowledge, tools, pip install commands, dotenv setup, and MODEL constant
  • Rewrite introduction (cells 1-3) to follow problem-focused learning objective pattern per style guide Section 1

Important Issues

  • Add conclusion section that maps back to learning objectives, suggests next steps, and links to related resources (style guide Section 4)
  • Add explanatory text after major code blocks (cells 15, 23) explaining key implementation details and what was learned (style guide Section 3)
  • Add return type annotations to functions: remove_thinking_blocks() in cell 6 should return -> tuple[str, str]
Detailed Review

Code Quality

Strengths:

  • Working implementation with proper thread safety using threading.Lock() context managers
  • Good use of daemon threads for non-blocking background updates
  • Clear separation between initialization and update thresholds
  • Helpful utility functions (truncate_response, build_transcript, remove_thinking_blocks)
  • Uses current Claude models per project standards (claude-haiku-4-5-20251001)
  • Proper cost optimization section demonstrating prompt caching (~80% savings)

Issues:

  • Missing return type annotations on several functions
  • Import organization could be improved (standard library before third-party within cells)
  • Variable naming: SMARTER_MODEL is vague, consider SONNET_MODEL or CACHED_MODEL

Security

Good practices:

  • No hardcoded API keys
  • Uses client = anthropic.Anthropic() which reads from environment

Missing:

  • No python-dotenv setup in Prerequisites section
  • Should demonstrate load_dotenv() pattern explicitly per CLAUDE.md guidelines

Notebook Structure

Missing Required Sections:

  1. Prerequisites & Setup - Notebook jumps directly into content without setup instructions
  2. Conclusion - Notebook ends abruptly without mapping back to learning objectives or suggesting next steps

Introduction Issues:

  • Current intro explains the solution before establishing why the problem matters
  • Missing clear learning objectives
  • Doesn't follow the problem-focused pattern per style guide Section 1

Educational Value

Strengths:

  • Excellent teaching progression: problem (traditional) → solution (instant) → optimization (caching)
  • ASCII diagrams effectively illustrate timing differences
  • Practical demo with low token thresholds makes mechanics clear

Improvements Needed:

  • Add metrics/observability section showing how to track compaction effectiveness
  • Enhance error handling examples to show production-ready patterns

Positive Notes

This notebook demonstrates sophisticated patterns that will be valuable to the community:

  • Real-world production patterns for long-context applications
  • Proper threading with thread safety considerations
  • Cost optimization with prompt caching
  • Clear comparison between traditional and instant approaches
  • The code actually works and runs top-to-bottom successfully

Registry & Discovery

Blocker: The notebook is not listed in registry.yaml, making it undiscoverable in the cookbook index. This is a required step per CLAUDE.md.

Suggested entry:

- title: Instant compaction with session memory
  description: Implement proactive background memory updates for instant context compaction without user wait time.
  path: misc/session_memory_compaction.ipynb
  authors:
  - jsham042
  date: '2026-01-09'
  categories:
  - Agent Patterns
  - Context Management

Also verify that jsham042 exists in authors.yaml or add author information if this is a new contributor.

@github-actions

Copy link
Copy Markdown

Notebook Changes

This PR modifies the following notebooks:

📓 misc/session_memory_compaction.ipynb

View diff
nbdiff /dev/null misc/session_memory_compaction.ipynb (5f52ba8e0641aa90a12a0eb2e43c084d21653968)
--- /dev/null  2026-01-10 03:19:34.523816
+++ misc/session_memory_compaction.ipynb (5f52ba8e0641aa90a12a0eb2e43c084d21653968)  (no timestamp)
## added /cells:
+  markdown cell:
+    source:
+      # Session memory compaction
+  markdown cell:
+    source:
+      This cookbook covers three main topics:
+      1. Writing a quality prompt to compact session chat history
+      2. Utlizing instant compacting to improve user chat experience
+      3. Managing costs and latency for long context conversations using prompt caching
+  markdown cell:
+    source:
+      ## Fundamentals: writing a compaction prompt
+  markdown cell:
+    source:
+      Make sure you have a well structured session memory prompt. 
+      
+      Some best practices include:
+      - Use chain-of-thought before summarizing — analyze first, then output                                                                                         
+      - Enumerate exactly what to preserve: file paths, code snippets, errors, user corrections                                                                      
+      - Weight recency heavily — the end of the conversation is the active context                                                                                   
+      - Require verbatim quotes for next steps to prevent task drift                                                                                                 
+      - Use structured sections with token budgets per section                                                                                                       
+      - Include a "Current State" section that always reflects the moment of compaction
+      
+      Some pitfalls include:
+      - Vague prompts like "summarize this conversation" produce lossy output                                                                                        
+      - Treating all messages equally loses the active working context                                                                                               
+      - Paraphrasing next steps introduces subtle drift that compounds                                                                                               
+      - Omitting error history causes the model to retry failed approaches                                                                                           
+      - Dropping user corrections makes the model revert to old behaviors                                                                                            
+      - No token limits lets one section consume the entire summary                                                                                                  
+      - Summarizing for human readability instead of model continuity
+      - Having the agent try to compress the results of tool calls here - this can be retrieved later if the agent needs it
+  code cell:
+    source:
+      SESSION_CREATION_PROMPT = """
+      You are a session memory agent. Compress the conversation into a structured summary 
+      that preserves all information needed to continue work seamlessly. Optimize for the assistant's 
+      ability to continue working, not human readability.
+      
+      <analysis-instructions>
+      Before generating your summary, analyze the transcript in <think>...</think> tags:
+      1. What did the user originally request? (Exact phrasing)
+      2. What actions succeeded? What failed and why?
+      3. Did the user correct or redirect the assistant at any point?
+      4. What was actively being worked on at the end?
+      5. What tasks remain incomplete or pending?
+      6. What specific details (IDs, paths, values, names) must survive compression?
+      </analysis-instructions>
+      
+      <summary-format>
+      ## User Intent
+      The user's original request and any refinements. Use direct quotes for key requirements.
+      If the user's goal evolved during the conversation, capture that progression.
+      
+      ## Completed Work
+      Actions successfully performed. Be specific:
+      - What was created, modified, or deleted
+      - Exact identifiers (file paths, record IDs, URLs, names)
+      - Specific values, configurations, or settings applied
+      
+      ## Errors & Corrections
+      - Problems encountered and how they were resolved
+      - Approaches that failed (so they aren't retried)
+      - User corrections: "don't do X", "actually I meant Y", "that's wrong because..."
+      Capture corrections verbatim—these represent learned preferences.
+      
+      ## Active Work
+      What was in progress when the session ended. Include:
+      - The specific task being performed
+      - Direct quotes showing exactly where work left off
+      - Any partial results or intermediate state
+      
+      ## Pending Tasks
+      Remaining items the user requested that haven't been started.
+      Distinguish between "explicitly requested" and "implied/assumed."
+      
+      ## Key References
+      Important details needed to continue:
+      - Identifiers: IDs, paths, URLs, names, keys
+      - Values: numbers, dates, configurations, credentials (redacted)
+      - Context: relevant background information, constraints, preferences
+      - Citations: sources referenced during the conversation
+      </summary-format>
+      
+      <preserve-rules>
+      Always preserve when present:
+      - Exact identifiers (IDs, paths, URLs, keys, names)
+      - Error messages verbatim
+      - User corrections and negative feedback
+      - Specific values, formulas, or configurations
+      - Technical constraints or requirements discovered
+      - The precise state of any in-progress work
+      </preserve-rules>
+      
+      <compression-rules>
+      - Weight recent messages more heavily—the end of the transcript is the active context
+      - Omit pleasantries, acknowledgments, and filler ("Sure!", "Great question")
+      - Omit system context that will be re-injected separately
+      - Keep each section under 500 words; condense older content to make room for recent
+      - If you must cut details, preserve: user corrections > errors > active work > completed work
+      </compression-rules>
+      """
+  markdown cell:
+    source:
+      ## Traditional compacting
+      In traditional compaction, you generate one summary once the token threshold is reached.
+      Traditional compaction is slow: when you hit the context limit, you wait for a summary.
+  markdown cell:
+    source:
+      
+      ```
+      TRADITIONAL COMPACTION (slow)
+      ─────────────────────────────
+      Turn 1 → Turn 2 → Turn 3 → ... → Turn N → CONTEXT FULL!
+
+
+                                          ┌─────────────────┐
+                                          │ Generate summary│
+                                          │ ( USER WAITS !) │
+                                          └─────────────────┘
+
+
+                                               Continue
+      
+      ```
+  markdown cell:
+    source:
+      #### Helper functions:
+  code cell:
+    source:
+      # setup, we are using haiku for demo purposes
+      import anthropic
+      from anthropic.types import MessageParam, TextBlockParam
+      import warnings
+      import os
+      
+      # Suppress noisy FutureWarning from coconut compiler
+      warnings.filterwarnings("ignore", category=FutureWarning, module="coconut")
+      
+      client = anthropic.Anthropic()
+      MODEL = "claude-sonnet-4-5"
+                                                                                           
+      import pandas as pd                                                                                                                         
+      pd.set_option('display.max_rows', None)                                                                                                     
+      pd.set_option('display.max_columns', None)                                                                                                  
+                                                                                                                                                                                                                                                     
+      from IPython.display import display, HTML                                                                                                   
+      display(HTML('<style>div.output_scroll { height: unset; }</style>'))        
+      
+      # helper functions:
+      def truncate_response(text: str, max_lines: int = 15) -> str:
+          """Truncate long responses for cleaner output display."""
+          lines = text.strip().split("\n")
+          if len(lines) <= max_lines:
+              return text
+          return "\n".join(lines[:max_lines]) + f"\n... ({len(lines) - max_lines} more lines)"
+      
+      def build_transcript(messages: list[dict]) -> str:
+          lines = []
+          for msg in messages:
+              role = "User" if msg["role"] == "user" else "Assistant"
+              lines.append(f"{role}: {msg['content']}")
+          return "\n\n".join(lines)
+      
+      def remove_thinking_blocks(text: str):
+          """Remove <think>...</think> blocks from the text."""
+          import re
+      
+          matches = re.findall(r"<think>.*?</think>", text, flags=re.DOTALL)
+          cleaned = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()
+          return cleaned, "".join(matches)
+      
+      def add_cache_control(messages: list[dict]) -> list[MessageParam]:
+          """Add cache_control to the last user message for prompt caching.
+      
+          For prompt caching to work, the message prefix structure must be identical between requests.
+          All messages are converted to list format for consistency, and cache_control is placed on
+          the last user message to match the standard API call pattern.
+          """
+          cached_messages: list[MessageParam] = []
+          last_user_idx = None
+      
+          # Find last user message index
+          for i, msg in enumerate(messages):
+              if msg["role"] == "user":
+                  last_user_idx = i
+      
+          for i, msg in enumerate(messages):
+              content = msg["content"]
+              text = content if isinstance(content, str) else content[0]["text"]
+      
+              content_block: TextBlockParam = {"type": "text", "text": text}
+              if i == last_user_idx:
+                  content_block["cache_control"] = {"type": "ephemeral"}
+      
+              cached_messages.append({"role": msg["role"], "content": [content_block]})
+      
+          return cached_messages
+      
+      def add_cache_control_to_system_prompt(system_text: str) -> list[TextBlockParam]:                                                                     
+            """Format system prompt with cache_control."""                                                                                          
+            block: TextBlockParam = {                                                                                                               
+                "type": "text",                                                                                                                     
+                "text": system_text,                                                                                                                
+                "cache_control": {"type": "ephemeral"}                                                                                              
+            }                                                                                                                                       
+            return [block]                                                                                                                            
+               
+    outputs:
+      output 0:
+        output_type: display_data
+        data:
+          text/html: <style>div.output_scroll { height: unset; }</style>
+          text/plain: <IPython.core.display.HTML object>
+  markdown cell:
+    source:
+      #### Example use of traditional compaction
+  code cell:
+    source:
+      import time
+      
+      class TraditionalCompactingChatSession:
+          """Traditional chat session with compaction after the fact."""
+          def __init__(self, system_message="You are a helpful assistant", context_limit: int = 10000):
+              self.system_message = system_message
+              self.context_limit = context_limit # the point at which the conversation is compacted so it does not exceed model limits.
+              self.messages = []
+              self.current_context_window_tokens = 0
+              self.summary = None
+          
+          def chat(self, user_message: str):
+              # In traditional compaction, we check if we need to compact when the user sends a message. NOT IDEAL!
+              if self.current_context_window_tokens >= self.context_limit:
+                  print(f"\n🧹 Context window at {self.current_context_window_tokens} tokens. Limit exceeded, compacting session memory...")
+                  self.compact() # compacts everything before the new user message
+              
+              self.messages.append({"role": "user", "content": user_message})      
+              print(f"\nUser: {user_message}")                                                                                        
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=3500,
+                  system=add_cache_control_to_system_prompt(self.system_message),
+                  messages=add_cache_control(self.messages)
+              )
+              assistant_message = response.content[0].text
+              self.messages.append({"role": "assistant", "content": assistant_message})
+             
+              print(f"\nAssistant: \n{truncate_response(assistant_message, max_lines=15)}")
+              
+              # approximate current token count in the conversation before the next user message
+              cache_read = getattr(response.usage, "cache_read_input_tokens", 0) or 0                                                                                                                          
+              total_input = response.usage.input_tokens + cache_read                                                                                                                                           
+              self.current_context_window_tokens = total_input + response.usage.output_tokens                                                     
+             
+              print(
+                  f"Input={total_input:,}, Prompt cached used= {cache_read > 0} | "
+                  f"Output={response.usage.output_tokens:,} | "
+                  f"Messages={len(self.messages)}"
+              )
+              return assistant_message, response.usage
+          
+          def compact(self): 
+              start_time = time.perf_counter()
+              self.messages.append({"role": "user", "content": "\n\nThe full transcript of our conversation so far is included."})
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=5000,
+                  system= SESSION_CREATION_PROMPT,
+                  messages=add_cache_control(self.messages) # use exact same messages for prompt caching purposes
+              )
+              print(response)
+              elapsed = time.perf_counter() - start_time
+              
+              # Generate new summary message
+              self.summary, removed_text = remove_thinking_blocks(response.content[0].text) # clean up any <think> blocks because they are not needed in the session memory
+              approximate_summary_tokens = response.usage.output_tokens - round(len(removed_text) / 4)  # rough estimate of tokens removed from summary
+             
+              # Replace prior messages with new summary message
+              self.messages = [{
+                  "role": "user",
+                  "content": f"""This session is being continued from a previous conversation. Here is the session memory: {self.summary}.Continue from where we left off."""
+              }]
+              
+              # Show token reduction if we just compacted
+              reduction = self.current_context_window_tokens - approximate_summary_tokens
+              pct = (reduction / self.current_context_window_tokens) * 100
+              
+              print(f"\n{'-' * 60}")
+              print(f"📝 New session memory created.")
+              print(f"✅ Tokens reduced: {self.current_context_window_tokens:,} → {approximate_summary_tokens:.0f} ({reduction:,} tokens saved, {pct:.0f}% reduction)")
+              print(f"⏱️ Compaction time: {elapsed:.2f}s (user waiting...)")
+              print(f"{'-' * 60}")
+              
+              # Update token count to reflect compacted state
+              self.current_context_window_tokens = approximate_summary_tokens
+  code cell:
+    source:
+      SYSTEM_PROMPT = """
+      You are a short story writer who helps authors develop their ideas into compelling narratives.
+      
+      ## What You Do
+      
+      **Plot Development**
+      - Help authors work through story structure, pacing, and narrative arc
+      - Identify plot holes, inconsistencies, or missed opportunities
+      - Suggest ways to raise stakes, add tension, or deepen conflict
+      - Brainstorm twists, resolutions, and scene transitions
+      
+      **Character Development**
+      - Develop backstories, motivations, and internal conflicts
+      - Ensure characters have distinct voices and consistent behavior
+      - Explore character relationships and how they drive the plot
+      - Help authors understand what their characters want vs. what they need
+      
+      **Drafting**
+      - Write short stories or scenes based on the author's ideas and direction
+      - Match tone, genre conventions, and stylistic preferences
+      - Show rather than tell when bringing scenes to life
+      - Craft dialogue that reveals character and advances plot
+      
+      ## How You Work
+      - You are the lead writer. When you disagree with a creative choice, say so respectfully, but ultimately defer to what the author wants.
+      - DO NOT ask the user to provide more context or clarify their request. Assume you have enough information to proceed.
+      """
+  code cell:
+    source:
+      session = TraditionalCompactingChatSession(system_message=SYSTEM_PROMPT)
+      
+      messages = [
+          "I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well throught out plot ideas for me to consider.",
+          "I don't like those ideas, can you think of one plot something more unique and unexpected?",
+          "Ok I like it. Can you help me develop the main character's backstory and motivations?",
+          "Can you draft a detailed outline for the story, breaking it down into chapters and key events?",
+          "Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.", 
+          "Can you draft a second chapter that builds on the first one, introducing a new twist in the mystery?"                   
+      ]
+      
+      print("Starting conversation...\n")
+      
+      turn_count = 0
+      
+      for i, message in enumerate(messages, 1):
+          turn_count += 1
+          print((
+              f"==============================================\n"
+              f"Turn {turn_count}:\n"
+          ))
+          response, usage = session.chat(message)
+          
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          Starting conversation...
+          
+          ==============================================
+          Turn 1:
+          
+          
+          User: I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well throught out plot ideas for me to consider.
+          
+          Assistant: 
+          # Three Mystery Plot Ideas
+          
+          ## 1. **The Silence Pact**
+          
+          **Setup:** A beloved high school teacher dies in what appears to be a tragic accident during a school field trip to nearby caves. Our detective, fresh out of the academy and assigned to this sleepy town as a "soft landing," is told it's an open-and-shut case. But something feels wrong—the students who were there have all gone eerily quiet, even with each other.
+          
+          **The Mystery:** As our detective digs deeper, they discover the students aren't protecting a person—they're protecting a secret about the town itself. Twenty years ago, another teacher disappeared during a similar field trip, and the current town leadership (mayor, police chief, school board) were all students on *that* trip. The dead teacher had discovered what really happened and was killed to keep the past buried.
+          
+          **The Complication:** The detective's own mentor—the retiring detective they're replacing—was the lead investigator who closed the original case as a "runaway." Solving this means destroying the man who gave them their career, and possibly exposing that he was complicit in the cover-up all along.
+          
+          **Emotional Core:** Justice vs. loyalty; the cost of truth in a community built on comfortable lies.
+          
+          ---
+          
+          ## 2. **The Inheritance Game**
+          ... (24 more lines)
+          Input=318, Prompt cached used= False | Output=1,004 | Messages=2
+          ==============================================
+          Turn 2:
+          
+          
+          User: I don't like those ideas, can you think of one plot something more unique and unexpected?
+          
+          Assistant: 
+          # **The Cartographer's Daughter**
+          
+          **Setup:** Our young detective arrives in a mountain town of 3,000 people to investigate a truly bizarre crime: someone has been systematically stealing *maps*. Not valuable antiques—just maps. From the library, the historical society, the ranger station, even GPS units from parked cars. The thefts escalate to break-ins, and finally, the town's oldest resident—a retired cartographer—is murdered in her home. All her maps, a lifetime of work documenting the region, are gone.
+          
+          **The Mystery:** The detective discovers the cartographer had been deliberately mapping things *wrong* for sixty years. Roads that don't quite connect. Trails that dead-end where her maps say they continue. Property lines slightly off. She'd been systematically introducing tiny errors into every official map of the region, and her personal maps—the stolen ones—show what's *really* there.
+          
+          What's really there: an entire ghost neighborhood. Eight families have been living on what they believed was public land for three generations, building homes, raising children, completely off-grid. They thought they were just deeply rural. But the cartographer's accurate maps prove they're squatting on land owned by a development company that's just been sold. The new owners want to build a resort. The thefts weren't about the maps—they were about erasing proof that this land is occupied, so the families can be removed as "trespassers" rather than residents with potential legal claims.
+          
+          **The Complication:** The detective finds the hidden community and realizes the murderer is one of the eight families—someone who killed the cartographer to *protect* everyone else, believing her death would end the investigation and the maps would never surface. But now the detective's investigation is doing exactly what the murder was meant to prevent: shining a spotlight on this vulnerable community.
+          
+          The second complication: The detective discovers they're not the only one who knows. A surveyor has been hired by the development company and is systematically re-mapping the area. They'll "discover" the squatters any day now, and without the cartographer's maps showing 60 years of continuous occupation, the families have no legal standing.
+          
+          **The Choice:** The detective can solve the murder—which means exposing the community and likely forcing eight families from their homes. Or they can bury the case, let a murderer go free, and give these families time to establish legal residency claims. Justice for one person, or protection for dozens.
+          
+          **The Twist:** The cartographer's granddaughter, who hired the detective and has been helping the investigation, already knows about the hidden community—she grew up visiting them. Her grandmother was protecting them her whole career. The granddaughter wants the maps found not to solve the murder, but to establish the families' legal claims before the developers strike. She's been manipulating the investigation from the start, using the detective as an unwitting advocate.
+          ... (8 more lines)
+          Input=1,344, Prompt cached used= False | Output=764 | Messages=4
+          ==============================================
+          Turn 3:
+          
+          
+          User: Ok I like it. Can you help me develop the main character's backstory and motivations?
+          
+          Assistant: 
+          # The Detective: Building from the Ground Up
+          
+          Let me develop a character whose personal history makes this particular case land like a gut punch.
+          
+          ---
+          
+          ## **MAYA KOVAČ** (27)
+          
+          **The Surface:**
+          Maya is a newly minted detective who specifically requested assignment to a small, rural department. On paper, this looks like someone who couldn't handle the pressure of city police work. She's meticulous to the point of rigidity, keeps people at arm's length, and has a reputation for being "by the book" in a way that even her academy instructors found excessive.
+          
+          **The Backstory:**
+          Maya grew up in a multi-generational household in a tight-knit immigrant community—her grandmother escaped war in the Balkans with false papers, her parents were undocumented for the first decade in America. She was eight years old when ICE raided their neighborhood. Her family wasn't taken (her mother had just gotten her green card), but her best friend's entire family was deported overnight. The girl was at Maya's house for a sleepover. In the morning, she had no home to return to.
+          
+          Maya became *obsessed* with rules after that. If her family followed every rule perfectly, kept every paper in order, never gave anyone a reason to look twice—they'd be safe. She color-coded her parents' documents. She learned immigration law as a teenager. She became a cop because she believed that being on the "inside" of the system meant understanding how to protect people from it.
+          ... (41 more lines)
+          Input=2,131, Prompt cached used= False | Output=1,323 | Messages=6
+          ==============================================
+          Turn 4:
+          
+          
+          User: Can you draft a detailed outline for the story, breaking it down into chapters and key events?
+          
+          Assistant: 
+          # **THE CARTOGRAPHER'S DAUGHTER**
+          ## Detailed Story Outline
+          
+          ---
+          
+          ## **ACT ONE: THE WRONG KIND OF CRIME**
+          
+          ### **Chapter 1: Arrival**
+          Maya drives into Millbrook (pop. 2,847) on a gray October morning. She's been a detective for exactly four months. The police station is attached to the town hall—three officers total. She meets CHIEF WARREN HAYES (58), who's retiring in six weeks. He's skeptical of her city credentials and thinks the map thefts are "kids being weird."
+          
+          **Key beats:**
+          - Establish Maya's rigid professionalism—she's already read every case file from the past five years
+          - Warren is warm but dismissive: "You'll find small-town crime is mostly boredom punctuated by stupidity"
+          - Maya tours the town: tourist-dependent, surrounded by national forest, economically struggling
+          - She notices something: the town feels *smaller* than the maps suggest, like there's negative space at the edges
+          ... (250 more lines)
+          Input=3,476, Prompt cached used= False | Output=3,500 | Messages=8
+          ==============================================
+          Turn 5:
+          
+          
+          User: Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.
+          
+          Assistant: 
+          # THE CARTOGRAPHER'S DAUGHTER
+          ## Chapter One: Arrival
+          
+          The Welcome to Millbrook sign had been hand-painted, Maya noticed. Population 2,847 in careful serif letters, with a little pine tree border that someone had taken time to get right. She'd passed a dozen mountain towns on the drive from Denver, each one bleeding into the next—same gas stations, same chain restaurants at the highway exits—but Millbrook had stayed small enough, or stubborn enough, to remain itself.
+          
+          Her GPS had lost signal twenty minutes ago. Maya was navigating by the printed directions Chief Hayes had emailed her, which felt oddly appropriate. A return to analog methods for analog policing.
+          
+          She'd been a detective for four months and six days.
+          
+          The town proper appeared around a curve in the two-lane highway: a main street of old brick buildings, half of them with "For Lease" signs in dark windows, the other half trying hard with flower boxes and fresh paint. The police station was attached to the town hall, a squat concrete building with an American flag snapping in the October wind. Maya parked her Civic between two pickup trucks that made her sedan look like a child's toy.
+          
+          She checked her reflection in the rearview mirror. Professional. Serious. The blazer she'd ironed this morning was still crisp despite four hours in the car. Her dark hair was pulled into a bun so tight it gave her a headache, but that was intentional—she'd learned early that looking severe made people take her more seriously. At twenty-seven, with her small frame and what her academy instructors had diplomatically called "youthful features," she needed every advantage.
+          
+          Maya pulled out her phone and opened the document where she'd compiled her notes on Millbrook: population demographics, crime statistics for the past five years, economic indicators, even a topographical analysis of the surrounding terrain. She'd color-coded everything. Green for low-risk factors, yellow for moderate concern, red for—
+          
+          ... (121 more lines)
+          Input=7,012, Prompt cached used= False | Output=2,778 | Messages=10
+          ==============================================
+          Turn 6:
+          
+          
+          User: Can you draft a second chapter that builds on the first one, introducing a new twist in the mystery?
+          
+          Assistant: 
+          # THE CARTOGRAPHER'S DAUGHTER
+          ## Chapter Two: The Crime Scene
+          
+          Maya woke at 5:47 AM, thirteen minutes before her first alarm. She'd slept in twenty-three minute intervals, jerking awake each time the radiator clanked or a car passed on the street below. Her body still expected city noise—sirens, garbage trucks, the couple in the apartment next door who fought at 3 AM. This rural silence kept tricking her brain into thinking something was wrong.
+          
+          She was showered, dressed, and reviewing her notes by 6:15. The coffee maker she'd brought from Denver burbled on the counter, filling the small apartment with the smell of something that wasn't burned sludge. Small victories.
+          
+          At 6:45, her phone rang. The screen showed: CHIEF HAYES.
+          
+          "Kovač," she answered, already reaching for her blazer.
+          
+          "Maya. Sorry to call so early, but we've got a situation. Elena Farkas—she's a local woman, lives out on Ridgeline Road—her neighbor called it in about twenty minutes ago. She's dead, looks like a home invasion."
+          
+          Maya's pulse quickened. She pulled her notebook toward her, pen already moving. "I'm on my way. Have you secured the scene?"
+          
+          ... (171 more lines)
+          Input=9,814, Prompt cached used= True | Output=3,500 | Messages=12
+  markdown cell:
+    source:
+      This is a long conversation with several turns. You'll notice a few things here:
+      
+      Prompt caching: You'll notice here that the input tokens eventually grew to a point where prompt caching was used (turn 6). This helps reduce costs and speed as these conversations grow!
+  markdown cell:
+    source:
+      On the next turn, we are going to hit our 10K context window limit, which triggers compaction:
+  code cell:
+    source:
+      response, usage = session.chat("Propose a title for the book")
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          
+          🧹 Context window at 13314 tokens. Limit exceeded, compacting session memory...
+          Message(id='msg_01CesB6prqUDQyH556365iJw', content=[TextBlock(citations=None, text='<think>\nLet me analyze this conversation systematically:\n\n1. **User\'s original request**: Generate 3 plot ideas for a story about a young detective solving a mysterious case in a small town.\n\n2. **What succeeded**: \n   - I provided 3 initial plot ideas\n   - User rejected them as not unique enough\n   - I created "The Cartographer\'s Daughter" plot, which user liked\n   - I developed Maya Kovač\'s character backstory successfully\n   - I created a detailed chapter-by-chapter outline\n   - I drafted Chapter 1 (arrival scene)\n   - I drafted Chapter 2 (crime scene discovery)\n\n3. **User corrections/redirections**: \n   - "I don\'t like those ideas, can you think of one plot something more unique and unexpected?" - User wanted something less conventional\n   - User accepted the cartographer concept and asked to continue developing it\n\n4. **Active work at end**: Just completed drafting Chapter 2, which introduces Elena\'s murder and the discrepancies in her maps. The chapter ends with Iris Farkas arriving at the station.\n\n5. **Incomplete/pending tasks**: None explicitly stated. The user appears satisfied with the progression and hasn\'t requested additional chapters or edits.\n\n6. **Critical details to preserve**:\n   - Story title: "The Cartographer\'s Daughter"\n   - Main character: Maya Kovač, 27, new detective, 4 months experience\n   - Setting: Millbrook (pop. 2,847), mountain town\n   - Mystery: Maps being stolen, cartographer Elena Farkas murdered, she deliberately falsified maps for 60 years to hide a community of 8 families living on private land\n   - Key characters: Chief Warren Hayes (retiring), Iris Farkas (granddaughter), Daniel Reeves (surveyor), Scott Chen (ranger)\n   - Maya\'s backstory: immigrant family background, childhood friend\'s family deported, became cop to work within the system\n   - Central conflict: Maya must choose between solving murder (exposing hidden community) or protecting vulnerable people\n</think>\n\n## User Intent\nCreate a mystery story about a young detective in a small town. User rejected initial conventional plots as not unique enough, specifically requesting "something more unique and unexpected." Accepted "The Cartographer\'s Daughter" concept and proceeded to develop it fully.\n\n## Completed Work\n**Story concept created**: "The Cartographer\'s Daughter" - A young detective investigates map thefts and a murder in a mountain town, discovering the victim falsified maps for 60 years to hide a community of families living on privately-owned land they believed was public forest.\n\n**Main character developed**: Maya Kovač, 27, new detective (4 months experience). Immigrant family background—grandmother escaped with false papers, childhood friend\'s family deported by ICE when Maya was 8. Became obsessive about rules/systems as protection mechanism. Requested small-town assignment seeking "simpler" justice, avoiding moral complexity. Character arc: must learn that protecting people sometimes requires working against the system, not within it.\n\n**Supporting cast**: Chief Warren Hayes (58, retiring in 6 weeks, will recommend Maya as chief); Iris Farkas (32, art teacher, Elena\'s granddaughter, secretly manipulating investigation); Elena Farkas (89, murdered cartographer); Thomas Wade (52, hidden community leader, will be revealed as killer); Daniel Reeves (41, corporate surveyor); Scott Chen (35, ranger); 8 families in hidden settlement "Millbrook Heights."\n\n**Detailed outline created**: 13-chapter structure spanning 3 acts:\n- Act 1: Map thefts lead to Elena\'s murder, Maya discovers discrepancies in maps\n- Act 2: Maya finds hidden community, learns Iris has been manipulating her, discovers Elena\'s accurate maps in storage unit\n- Act 3: Thomas confesses to accidental killing, Maya must choose between justice and protection\n\n**Two complete chapters drafted**:\n- Chapter 1 (~2,000 words): Maya arrives in Millbrook, meets Hayes, learns about map thefts, given case as "orientation," promoted to future chief (unexpected), notices town feels "smaller than maps suggest"\n- Chapter 2 (~2,500 words): Elena\'s murder discovered, crime scene investigation, learns surveyor Reeves found GPS discrepancies with Elena\'s maps, Margaret Yates (neighbor) mentions development company visit, chapter ends with Iris arriving at station\n\n## Errors & Corrections\nUser\'s only correction: rejected first three plot ideas as not unique enough. Required more unexpected/unconventional mystery concept.\n\n## Active Work\nJust completed Chapter 2. The chapter ends mid-scene with Iris Farkas entering the police station to meet Maya for the first time. This is the moment where Maya begins interacting with the character who will manipulate the entire investigation. The scene was left at: "Detective Kovač? I\'m Iris Farkas. Please, sit down."\n\n## Pending Tasks\nNone explicitly requested. User has not asked for Chapter 3 or further development, but the natural continuation would be drafting subsequent chapters.\n\n## Key References\n\n**Story title**: "The Cartographer\'s Daughter"\n\n**Setting**: Millbrook, population 2,847, mountain town in Colorado (near Denver, ~4 hour drive), surrounded by national forest, economically struggling, tourist-dependent\n\n**Central mystery elements**:\n- 6 weeks of map thefts before murder (library, historical society, ranger station, 4 GPS units from cars)\n- Elena Farkas: 89-year-old cartographer, murdered in her home studio, all maps stolen, she was burning maps before death\n- Hidden community: 8 families, settlement called "Millbrook Heights" (ironic name), living 3 miles into forest on unmapped trails, on land owned by Consolidated Mountain Properties\n- Elena falsified maps for 60 years, showing empty forest where community exists\n- Thomas Wade will be revealed as accidental killer (pushed Elena during argument, she hit drafting table)\n- Consolidated Mountain Properties: bought 400 acres, hired Daniel Reeves to survey for resort development\n- Elena\'s accurate maps hidden in storage unit by Iris, showing continuous occupation since 1962\n\n**Character details**:\n- Maya Kovač: 27, detective for "four months and six days" at start, from Denver PD, requested Millbrook assignment, drives Honda Civic, color-codes everything, obsessively organized, immigrant parents (Balkan grandmother with false papers), childhood friend deported at age 8\n- Warren Hayes: retiring in 6 weeks to Scottsdale, will recommend Maya as chief\n- Iris Farkas: 32, high school art teacher, paint-stained flannel shirt, has been manipulating investigation to get maps into evidence\n- Elena Farkas: lived on Ridgeline Road, had studio attached to house, made all official maps for region for 60 years, killed ~10 PM with head trauma (drafting table corner)\n- Margaret Yates: 70s, neighbor who discovered body during morning dog walk\n- Scott Chen: 35, forest ranger, father Leon Chen (67) lives in hidden community (estranged)\n- Daniel Reeves: 41, corporate surveyor for Consolidated Mountain Properties\n- Thomas Wade: 52, hidden community leader, meeting scheduled with Elena "T.W. - 9 PM" night of murder\n- Grace Wade: 17, Thomas\'s daughter\n- Sofia Hernandez: 28, community member with children\n\n**Timeline established in chapters**:\n- Maya arrives October (gray morning)\n- First alarm set for 6:00 AM Day 2\n- Hayes calls 6:45 AM: Elena\'s body discovered ~6:25 AM by Margaret Yates\n- Death occurred previous evening ~10 PM (10-14 hours before discovery)\n- Map thefts began 6 weeks prior\n- Consolidated purchased land "last month" (one month before current timeline)\n\n**Physical descriptions from chapters**:\n- Police station: attached to town hall, concrete building, smells like "burned coffee and printer toner"\n- Maya\'s apartment: above hardware store on Main Street, one bedroom, shower from "Reagan administration," two-burner stove\n- Elena\'s house: ranch-style, weathered cedar siding, workshop attached, Ridgeline Road (10 minutes from town)\n- Elena\'s studio: north-facing windows, flat files, drafting table, fireplace in corner with metal waste bin\n\n**Key quotes to preserve tone**:\n- Maya on rules: "safety came from being beyond reproach, from giving the system no reason to notice you"\n- Hayes on Elena: "If Elena said something existed, it existed. And if her maps said something wasn\'t there..."\n- Scott on Elena\'s accuracy: "Elena\'s maps are gospel"\n- Hayes cryptically: "sometimes the maps we draw aren\'t just about geography. They\'re about protecting what we love"', type='text')], model='claude-sonnet-4-5-20250929', role='assistant', stop_reason='end_turn', stop_sequence=None, type='message', usage=Usage(cache_creation=CacheCreation(ephemeral_1h_input_tokens=0, ephemeral_5m_input_tokens=0), cache_creation_input_tokens=0, cache_read_input_tokens=0, input_tokens=13723, output_tokens=2021, server_tool_use=None, service_tier='standard'))
+          
+          ------------------------------------------------------------
+          📝 New session memory created.
+          ✅ Tokens reduced: 13,314 → 1536 (11,778 tokens saved, 88% reduction)
+          ⏱️ Compaction time: 44.49s (user waiting...)
+          ------------------------------------------------------------
+          
+          User: Propose a title for the book
+          
+          Assistant: 
+          Looking at the story's core themes and the path Maya will travel, I'd propose:
+          
+          **"The Unmapped"**
+          
+          It works on multiple levels:
+          - The literal hidden community that doesn't appear on any map
+          - The moral territory Maya must navigate that her rule-based worldview never prepared her for
+          - Elena's secret—the truth she kept unmapped for 60 years
+          - Maya herself, who arrives thinking she understands justice but discovers she's been operating with an incomplete map of what protection really means
+          
+          The single-word title has that literary mystery feel (think *Gone Girl*, *The Searcher*) while being evocative rather than explanatory.
+          
+          Alternative if you want something that hints more at the cartography angle: **"False Survey"** or **"The Cartographer's Secret"**
+          
+          But I think **"The Unmapped"** captures both the mystery and Maya's internal journey most effectively.
+          ... (6 more lines)
+          Input=1,864, Prompt cached used= False | Output=235 | Messages=3
+  markdown cell:
+    source:
+      
+      You'll notice here that it took xx seconds for the agent to compact the conversation. Because we used traditional compaction, the user would be waiting on Claude to compact the conversation, which is not an ideal user experience.
+      
+      Below you can see the result of the compaction. It captures the key elements of conversation in less than 2K tokens.
+  code cell:
+    source:
+      print(session.summary)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          ## User Intent
+          Create a mystery story about a young detective in a small town. User rejected initial conventional plots as not unique enough, specifically requesting "something more unique and unexpected." Accepted "The Cartographer's Daughter" concept and proceeded to develop it fully.
+          
+          ## Completed Work
+          **Story concept created**: "The Cartographer's Daughter" - A young detective investigates map thefts and a murder in a mountain town, discovering the victim falsified maps for 60 years to hide a community of families living on privately-owned land they believed was public forest.
+          
+          **Main character developed**: Maya Kovač, 27, new detective (4 months experience). Immigrant family background—grandmother escaped with false papers, childhood friend's family deported by ICE when Maya was 8. Became obsessive about rules/systems as protection mechanism. Requested small-town assignment seeking "simpler" justice, avoiding moral complexity. Character arc: must learn that protecting people sometimes requires working against the system, not within it.
+          
+          **Supporting cast**: Chief Warren Hayes (58, retiring in 6 weeks, will recommend Maya as chief); Iris Farkas (32, art teacher, Elena's granddaughter, secretly manipulating investigation); Elena Farkas (89, murdered cartographer); Thomas Wade (52, hidden community leader, will be revealed as killer); Daniel Reeves (41, corporate surveyor); Scott Chen (35, ranger); 8 families in hidden settlement "Millbrook Heights."
+          
+          **Detailed outline created**: 13-chapter structure spanning 3 acts:
+          - Act 1: Map thefts lead to Elena's murder, Maya discovers discrepancies in maps
+          - Act 2: Maya finds hidden community, learns Iris has been manipulating her, discovers Elena's accurate maps in storage unit
+          - Act 3: Thomas confesses to accidental killing, Maya must choose between justice and protection
+          
+          **Two complete chapters drafted**:
+          - Chapter 1 (~2,000 words): Maya arrives in Millbrook, meets Hayes, learns about map thefts, given case as "orientation," promoted to future chief (unexpected), notices town feels "smaller than maps suggest"
+          - Chapter 2 (~2,500 words): Elena's murder discovered, crime scene investigation, learns surveyor Reeves found GPS discrepancies with Elena's maps, Margaret Yates (neighbor) mentions development company visit, chapter ends with Iris arriving at station
+          
+          ## Errors & Corrections
+          User's only correction: rejected first three plot ideas as not unique enough. Required more unexpected/unconventional mystery concept.
+          
+          ## Active Work
+          Just completed Chapter 2. The chapter ends mid-scene with Iris Farkas entering the police station to meet Maya for the first time. This is the moment where Maya begins interacting with the character who will manipulate the entire investigation. The scene was left at: "Detective Kovač? I'm Iris Farkas. Please, sit down."
+          
+          ## Pending Tasks
+          None explicitly requested. User has not asked for Chapter 3 or further development, but the natural continuation would be drafting subsequent chapters.
+          
+          ## Key References
+          
+          **Story title**: "The Cartographer's Daughter"
+          
+          **Setting**: Millbrook, population 2,847, mountain town in Colorado (near Denver, ~4 hour drive), surrounded by national forest, economically struggling, tourist-dependent
+          
+          **Central mystery elements**:
+          - 6 weeks of map thefts before murder (library, historical society, ranger station, 4 GPS units from cars)
+          - Elena Farkas: 89-year-old cartographer, murdered in her home studio, all maps stolen, she was burning maps before death
+          - Hidden community: 8 families, settlement called "Millbrook Heights" (ironic name), living 3 miles into forest on unmapped trails, on land owned by Consolidated Mountain Properties
+          - Elena falsified maps for 60 years, showing empty forest where community exists
+          - Thomas Wade will be revealed as accidental killer (pushed Elena during argument, she hit drafting table)
+          - Consolidated Mountain Properties: bought 400 acres, hired Daniel Reeves to survey for resort development
+          - Elena's accurate maps hidden in storage unit by Iris, showing continuous occupation since 1962
+          
+          **Character details**:
+          - Maya Kovač: 27, detective for "four months and six days" at start, from Denver PD, requested Millbrook assignment, drives Honda Civic, color-codes everything, obsessively organized, immigrant parents (Balkan grandmother with false papers), childhood friend deported at age 8
+          - Warren Hayes: retiring in 6 weeks to Scottsdale, will recommend Maya as chief
+          - Iris Farkas: 32, high school art teacher, paint-stained flannel shirt, has been manipulating investigation to get maps into evidence
+          - Elena Farkas: lived on Ridgeline Road, had studio attached to house, made all official maps for region for 60 years, killed ~10 PM with head trauma (drafting table corner)
+          - Margaret Yates: 70s, neighbor who discovered body during morning dog walk
+          - Scott Chen: 35, forest ranger, father Leon Chen (67) lives in hidden community (estranged)
+          - Daniel Reeves: 41, corporate surveyor for Consolidated Mountain Properties
+          - Thomas Wade: 52, hidden community leader, meeting scheduled with Elena "T.W. - 9 PM" night of murder
+          - Grace Wade: 17, Thomas's daughter
+          - Sofia Hernandez: 28, community member with children
+          
+          **Timeline established in chapters**:
+          - Maya arrives October (gray morning)
+          - First alarm set for 6:00 AM Day 2
+          - Hayes calls 6:45 AM: Elena's body discovered ~6:25 AM by Margaret Yates
+          - Death occurred previous evening ~10 PM (10-14 hours before discovery)
+          - Map thefts began 6 weeks prior
+          - Consolidated purchased land "last month" (one month before current timeline)
+          
+          **Physical descriptions from chapters**:
+          - Police station: attached to town hall, concrete building, smells like "burned coffee and printer toner"
+          - Maya's apartment: above hardware store on Main Street, one bedroom, shower from "Reagan administration," two-burner stove
+          - Elena's house: ranch-style, weathered cedar siding, workshop attached, Ridgeline Road (10 minutes from town)
+          - Elena's studio: north-facing windows, flat files, drafting table, fireplace in corner with metal waste bin
+          
+          **Key quotes to preserve tone**:
+          - Maya on rules: "safety came from being beyond reproach, from giving the system no reason to notice you"
+          - Hayes on Elena: "If Elena said something existed, it existed. And if her maps said something wasn't there..."
+          - Scott on Elena's accuracy: "Elena's maps are gospel"
+          - Hayes cryptically: "sometimes the maps we draw aren't just about geography. They're about protecting what we love"
+  markdown cell:
+    source:
+      ## Instant Compaction
+      
+      With **Instant compaction** the session memory is PROACTIVELY generated once a soft token threshold is reached. 
+      
+      Once the user triggers a compaction or a hard limit is reached, the summary is already available, so the user doesn't need to wait.
+      
+      Result: Instant compaction, no waiting.
+  markdown cell:
+    source:
+      
+      SESSION MEMORY COMPACTION (instant)
+      ```
+      ────────────────────────────────────
+      Turn 1 → Turn 2 → ... → Turn K → Turn K+1 → ... → Turn N → ..  → CONTEXT FULL!
+                                  │                         │            │
+                      (soft threshold met:              (update          │
+                         5k tokens init)                trigger)        │
+                                  │                                      │
+                                  │                         │            │
+                                  ▼                         ▼            │
+                             ┌────────┐                ┌────────┐        │
+                             │ Update │                │ Update │        │
+                             │ memory │ (background)   │ memory │        │
+                             └────────┘                └────────┘        │
+                                  │                         │            │
+                                  ▼                         ▼            ▼
+                           📝 session-memory.md ──────────────────► INSTANT SWAP!
+                             (continuously updated)
+      ```
+      
+      **Update triggers:** The first summary is generated after the initial 5k tokens. Updates can be triggered after every subsequent turn, or at periodically at natural breakpoints intervals (e.g. every ~5k tokens or 3+ tool calls).
+  markdown cell:
+    source:
+      This `InstantCompactingChatSession` class uses **threading** for background execution:
+      1. **`threading.Thread`** - runs memory updates in background without blocking
+      2. **Thread-safe state** - uses `threading.Lock` to safely update shared memory
+      3. **Daemon threads** - background work doesn't prevent program exit
+      4. **Instant compaction** - when context is full, just swap in the pre-built memory
+  code cell:
+    source:
+      import threading
+      import time
+      
+      
+      class InstantCompactingChatSession:
+          """
+          Maintains session memory via incremental background updates.
+          
+          Key insight: By updating memory in the background after each turn,
+          the summary is already ready when compaction is needed - instant swap!
+          """
+      
+          def __init__(
+              self,
+              system_message="You are a helpful assistant",
+              context_limit: int = 10000,
+              min_tokens_to_init: int = 5000,
+              min_tokens_between_updates: int = 2000,
+          ):
+              # Thresholds
+              self.context_limit = context_limit # the point at which the conversation is compacted so it does not exceed model limits
+              self.min_tokens_to_init = min_tokens_to_init # tokens needed to trigger initial memory creation; note this happens PROACTIVELY in background unlike traditional compaction
+              self.min_tokens_between_updates = min_tokens_between_updates # tokens needed to trigger memory update. only comes into play after initial memory is created and additional compaction (memory update) is needed after that
+      
+              # Conversation state
+              self.system_message = system_message
+              self.messages = []
+              self.current_context_window_tokens = 0
+      
+              # Session memory state
+              self.session_memory = None # this is the compacted conversation in session memory; for the demo we are storing this in memory, but in production you would write to session_memory.md file
+              self.last_summarized_index = 0 # The index of the last message included in the session memory
+              self.tokens_at_last_update = 0 # TBD if I need this
+      
+              # Background update tracking
+              self._update_thread: threading.Thread | None = None
+              self.last_update_time = None
+              self._lock = threading.Lock()
+      
+          def chat(self, user_message: str):
+              """Process a chat turn with background session memory updates."""
+              if self.current_context_window_tokens >= self.context_limit:
+                  self.compact() # note that when this is triggered, the compaction has already been created and is just swapped in instantly
+      
+              self.messages.append({"role": "user", "content": user_message})
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=3500,
+                  system=add_cache_control_to_system_prompt(self.system_message),
+                  messages=add_cache_control(self.messages),
+              )
+      
+              assistant_message = response.content[0].text
+              self.messages.append({"role": "assistant", "content": assistant_message})
+      
+              # Calculate token usage including cache
+              cache_read = getattr(response.usage, "cache_read_input_tokens", 0) or 0
+              total_input = response.usage.input_tokens + cache_read
+              
+              # Update context window tokens (includes cached tokens since they still count toward context)
+              self.current_context_window_tokens = total_input + response.usage.output_tokens
+      
+              # KEY DIFFERENCE: Trigger background memory update if needed proactively, before compaction is needed
+              background_status = None
+              if self._should_init_memory() or self._should_update_memory():
+                  self._trigger_background_update()
+                  background_status = "initializing" if self.session_memory is None else "updating"
+      
+              # Return usage info with cache stats
+              return assistant_message, response.usage, background_status
+          
+          # Helper methods to determine when to init/update/compact
+          def _should_init_memory(self) -> bool:
+              return (
+                  self.session_memory is None
+                  and self.current_context_window_tokens >= self.min_tokens_to_init
+              )
+      
+          # Helper method to determine if memory should be updated
+          def _should_update_memory(self) -> bool:
+              if self.session_memory is None:
+                  return False
+              tokens_since = self.current_context_window_tokens - self.tokens_at_last_update
+              return tokens_since >= self.min_tokens_between_updates
+      
+          # Methods to create initial session memory
+          def _create_session_memory(self, messages: list[dict]) -> str:
+              """Generate initial session memory from messages."""
+              messages = messages + [{"role": "user", "content": "\n\nThe full transcript of our conversation so far is included."}]
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=5000,
+                  system= SESSION_CREATION_PROMPT,
+                  messages=add_cache_control(messages)
+              )
+              summary, _ = remove_thinking_blocks(response.content[0].text)  # clean up any <think> blocks because they are not needed in the session memory
+              return summary
+      
+          def _update_session_memory(self, new_messages: list[dict]) -> str:
+              """Update existing session memory with new messages. In practice, you may want to do this via file edit rather than full re-generation. But for demo purposes we do full regeneration here."""
+              new_messages = new_messages + [{"role": "user", "content": "\n\nNew messages to integrate into the session memory are included."}]
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=5000,
+                  system= f"""{SESSION_CREATION_PROMPT}\n\nCurrent session memory:\n{self.session_memory}""",
+                  messages=add_cache_control(new_messages) # use exact same messages for prompt caching purposes
+              )
+              updated_summary, _ = remove_thinking_blocks(response.content[0].text)  # clean up any <think> blocks because they are not needed in the session memory
+              return updated_summary
+      
+          # Background memory update methods
+          def _background_memory_update(
+              self, messages_snapshot: list[dict], snapshot_index: int, current_tokens: int
+          ):
+              """Run session memory update in a background thread."""
+              try:
+                  with self._lock:
+                      current_session_memory = self.session_memory
+                      last_index = self.last_summarized_index
+      
+                  if current_session_memory is None:
+                      new_memory = self._create_session_memory(messages_snapshot)
+                  else:
+                      # Get new messages since last summary
+                      new_messages = messages_snapshot[last_index :]
+                      if not new_messages:
+                          return
+                      new_memory = self._update_session_memory(new_messages)
+      
+                  # Update state (thread-safe)
+                  with self._lock:
+                      self.session_memory = new_memory
+                      self.last_summarized_index = snapshot_index
+                      self.tokens_at_last_update = current_tokens
+                      self.last_update_time = time.time()
+      
+              except Exception as e:
+                  print(f"   [Background] Error updating memory: {e}")
+      
+          # This makes sure only one background update runs at a time. If one is already running, we skip starting another. If not, we start a new thread to do the update.
+          def _trigger_background_update(self):
+              """Trigger a background session memory update."""
+              if self._update_thread is not None and self._update_thread.is_alive():
+                  return
+      
+              messages_snapshot = self.messages.copy()
+              snapshot_index = len(messages_snapshot)
+              current_tokens = self.current_context_window_tokens
+      
+              self._update_thread = threading.Thread(
+                  target=self._background_memory_update,
+                  args=(messages_snapshot, snapshot_index, current_tokens),
+                  daemon=True,
+              )
+              self._update_thread.start()
+      
+          # Function to compact
+          def compact(self):
+              """INSTANT compaction using pre-built session memory."""
+              prev_msg_count = len(self.messages)
+      
+              # Ensure session memory is ready. Shouldn't be an issue normally, but here for safety.
+              if self.session_memory is None:
+                  if self._update_thread is not None and self._update_thread.is_alive():
+                      print("   ⏳ Waiting for background memory update...")
+                      self._update_thread.join(timeout=30.0)
+      
+                  if self.session_memory is None:
+                      print("   ⚠️  No pre-built memory, creating synchronously...")
+                      start = time.perf_counter()
+                      self.session_memory = self._create_session_memory(self.messages)
+                      elapsed = time.perf_counter() - start
+                      print(f"   ⏱️  Took {elapsed:.2f}s (but should be instant normally!)")
+                      self.last_summarized_index = len(self.messages)
+      
+              with self._lock:
+                  unsummarized = self.messages[self.last_summarized_index :]
+                  summary_message = [{
+                      "role": "user",
+                      "content": f"""This session is being continued from a previous conversation. Here is the session memory: {self.session_memory}.Continue from where we left off."""
+                  }]
+                  self.messages = summary_message + unsummarized
+                  self.last_summarized_index = 1
+      
+                  print(f"\n{'=' * 60}")
+                  print(f"⚡ INSTANT COMPACTION! Messages: {prev_msg_count} → {len(self.messages)}")
+                  print(f"   Session memory was pre-built (no wait time!)")
+                  print(f"{'=' * 60}")
+  markdown cell:
+    source:
+      ### Example use of Instant Compaction
+  code cell:
+    source:
+      # Low thresholds for demo - in production you'd use higher values
+      session = InstantCompactingChatSession(
+          system_message=SYSTEM_PROMPT,
+      )
+      
+      messages = [
+          "I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well throught out plot ideas for me to consider.",
+          "I don't like those ideas, can you think of one plot something more unique and unexpected?",
+          "Ok I like it. Can you help me develop the main character's backstory and motivations?",
+          "Can you draft a detailed outline for the story, breaking it down into chapters and key events?",
+          "Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.", 
+          "Can you draft a second chapter that builds on the first one, introducing a new twist in the mystery?"                   
+      ]
+      print("Starting conversation with instant compacting chat session...\n")
+      
+      turn_count = 0
+      for i, message in enumerate(messages, 1):
+          response, usage, background_status = session.chat(message)
+          turn_count += 1
+          
+          # Calculate cache stats
+          cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
+          cache_created = getattr(usage, "cache_creation_input_tokens", 0) or 0
+          total_input = usage.input_tokens + cache_read
+          
+          print(f"{'='*60}")
+          print(f"Turn {turn_count}:")
+          print(f"\nUser: {message}")
+          print(f"\nAssistant: \n{truncate_response(response, max_lines=3)}")
+          print(f"\nToken Usage:")
+          print(f"  Input: {total_input:,} (new: {usage.input_tokens:,}, cached: {cache_read:,})")
+          print(f"  Output: {usage.output_tokens:,}")
+          print(f"  Messages: {len(session.messages)} | Memory: {'ready' if session.session_memory else 'not yet'}")
+          
+          if cache_read > 0:
+              cache_pct = (cache_read / total_input) * 100
+              print(f"  ✓ Cache hit! {cache_pct:.0f}% of input from cache")
+          
+          if background_status:
+              print(f"\n  [Background] Proactively {background_status} session memory...")
+              print(f"  Context window: {session.current_context_window_tokens:,} tokens")
+          
+          print()
+          # Sleep to allow background updates to complete for demo purposes
+          if i < len(messages):
+              time.sleep(5)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          Starting conversation with instant compacting chat session...
+          
+          ============================================================
+          Turn 1:
+          
+          User: I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well throught out plot ideas for me to consider.
+          
+          Assistant: 
+          # Three Mystery Plot Ideas
+          
+          ## 1. **The Drowning Season**
+          ... (24 more lines)
+          
+          Token Usage:
+            Input: 318 (new: 318, cached: 0)
+            Output: 659
+            Messages: 2 | Memory: not yet
+          
+          ============================================================
+          Turn 2:
+          
+          User: I don't like those ideas, can you think of one plot something more unique and unexpected?
+          
+          Assistant: 
+          # **The Cartographer's Error**
+          
+          A detective takes a job in a declining mining town where she's assigned what seems like busywork: investigating why survey maps keep disappearing from the county records office. It's treated as petty theft, probably teenagers.
+          ... (10 more lines)
+          
+          Token Usage:
+            Input: 999 (new: 999, cached: 0)
+            Output: 411
+            Messages: 4 | Memory: not yet
+          
+          ============================================================
+          Turn 3:
+          
+          User: Ok I like it. Can you help me develop the main character's backstory and motivations?
+          
+          Assistant: 
+          # The Detective: **Nora Hayward**
+          
+          ## Background
+          ... (40 more lines)
+          
+          Token Usage:
+            Input: 1,433 (new: 1,433, cached: 0)
+            Output: 833
+            Messages: 6 | Memory: not yet
+          
+          ============================================================
+          Turn 4:
+          
+          User: Can you draft a detailed outline for the story, breaking it down into chapters and key events?
+          
+          Assistant: 
+          # **The Cartographer's Error** - Story Outline
+          
+          ## ACT ONE: The Impossible Geography
+          ... (136 more lines)
+          
+          Token Usage:
+            Input: 2,288 (new: 2,288, cached: 0)
+            Output: 3,035
+            Messages: 8 | Memory: not yet
+          
+            [Background] Proactively initializing session memory...
+            Context window: 5,323 tokens
+          
+          ============================================================
+          Turn 5:
+          
+          User: Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.
+          
+          Assistant: 
+          # Chapter One: Arrival
+          
+          The town appeared through the rain like a photograph developing in reverse—color leaching into gray, details softening into suggestion. Nora Hayward's wipers beat a rhythm against the October drizzle as she followed Route 34 down the mountain pass, watching Millstone materialize in the valley below.
+          ... (136 more lines)
+          
+          Token Usage:
+            Input: 5,356 (new: 5,356, cached: 0)
+            Output: 2,778
+            Messages: 10 | Memory: ready
+          
+            [Background] Proactively updating session memory...
+            Context window: 8,134 tokens
+          
+          ============================================================
+          Turn 6:
+          
+          User: Can you draft a second chapter that builds on the first one, introducing a new twist in the mystery?
+          
+          Assistant: 
+          # Chapter Two: The Missing Maps
+          
+          Nora spent the weekend in her new apartment—a one-bedroom above a closed insurance office on what the lease said was Pine Street but the street sign claimed was Birch. She told herself the discrepancy was charming, in a declining-town sort of way. Small places were idiosyncratic. That was part of their appeal.
+          ... (210 more lines)
+          
+          Token Usage:
+            Input: 8,158 (new: 4,062, cached: 4,096)
+            Output: 3,500
+            Messages: 12 | Memory: ready
+            ✓ Cache hit! 50% of input from cache
+          
+            [Background] Proactively updating session memory...
+            Context window: 11,658 tokens
+          
+  code cell:
+    source:
+      message = "What did we just talk about? Give me one sentence"
+      response, usage, background_status = session.chat(message)
+      
+      # Calculate cache stats
+      cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
+      total_input = usage.input_tokens + cache_read
+      
+      print(f"\nUser: {message}")
+      print(f"\nAssistant: \n{truncate_response(response, max_lines=3)}")
+      print(f"\nToken Usage:")
+      print(f"  Input: {total_input:,} (new: {usage.input_tokens:,}, cached: {cache_read:,})")
+      print(f"  Output: {usage.output_tokens:,}")
+      print(f"  Messages: {len(session.messages)} | Memory: {'ready' if session.session_memory else 'not yet'}")
+      
+      if cache_read > 0:
+          cache_pct = (cache_read / total_input) * 100
+          print(f"  ✓ Cache hit! {cache_pct:.0f}% of input from cache")
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          
+          ============================================================
+          ⚡ INSTANT COMPACTION! Messages: 12 → 3
+             Session memory was pre-built (no wait time!)
+          ============================================================
+          
+          User: What did we just talk about? Give me one sentence
+          
+          Assistant: 
+          I just drafted Chapter Two where Nora investigates stolen survey maps from the 1950s and discovers that GPS can't reliably locate buildings and street intersections don't match between historical maps and current reality, confirming something is physically wrong with Millstone's geography.
+          
+          Token Usage:
+            Input: 5,286 (new: 5,286, cached: 0)
+            Output: 60
+            Messages: 5 | Memory: ready
+  code cell:
+    source:
+      message = "What about a sequel?"
+      response, usage, background_status = session.chat(message)
+      
+      # Calculate cache stats
+      cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
+      total_input = usage.input_tokens + cache_read
+      
+      print(f"\nUser: {message}")
+      print(f"\nAssistant: \n{truncate_response(response, max_lines=3)}")
+      print(f"\nToken Usage:")
+      print(f"  Input: {total_input:,} (new: {usage.input_tokens:,}, cached: {cache_read:,})")
+      print(f"  Output: {usage.output_tokens:,}")
+      print(f"  Messages: {len(session.messages)} | Memory: {'ready' if session.session_memory else 'not yet'}")
+      
+      if cache_read > 0:
+          cache_pct = (cache_read / total_input) * 100
+          print(f"  ✓ Cache hit! {cache_pct:.0f}% of input from cache")
+  markdown cell:
+    source:
+      ## Advanced: Understanding Prompt Caching
+  markdown cell:
+    source:
+      
+      The background updates can be made **~10x cheaper** by using prompt caching. The trick:
+      1. Pass the **full conversation** to the background summarizer
+      2. Add `cache_control` markers so subsequent requests hit the cache
+      3. Only the new "summarize this" instruction is billed at full price
+      
+      ```
+      ┌─────────────────────────────────────────────────────────────────────────────────┐
+      │                    PROMPT CACHING FOR LONG CONVERSATIONS                        │
+      ├─────────────────────────────────────────────────────────────────────────────────┤
+      │                                                                                 │
+      │  WITHOUT CACHING: Pay full price for entire context every turn                 │
+      │  ════════════════════════════════════════════════════════════                   │
+      │                                                                                 │
+      │  Turn 1:  [System][User1][Asst1]                         →  500 tokens  @ $3/M │
+      │  Turn 2:  [System][User1][Asst1][User2][Asst2]           → 1500 tokens  @ $3/M │
+      │  Turn 3:  [System][User1][Asst1][User2][Asst2][User3]... → 3000 tokens  @ $3/M │
+      │  Turn 4:  [System][User1][Asst1][User2][Asst2][User3]... → 5000 tokens  @ $3/M │
+      │           ─────────────────────────────────────────────                         │
+      │                                              Total: 10,000 tokens = $0.030      │
+      │                                                                                 │
+      │                                                                                 │
+      │  WITH CACHING: Pay full price once, then 90% discount on prefix                │
+      │  ═══════════════════════════════════════════════════════════════                │
+      │                                                                                 │
+      │  Turn 1:  [System][User1][Asst1]◆                        →  500 tokens  @ $3/M │
+      │                                ▲                            (cache created)    │
+      │                          cache breakpoint                                       │
+      │                                                                                 │
+      │  Turn 2:  [System][User1][Asst1][User2][Asst2]◆                                │
+      │           ╰─────── cached ──────╯                                              │
+      │                500 @ $0.30/M + 1000 new @ $3/M  =  $0.0032                     │
+      │                                                                                 │
+      │  Turn 3:  [System][User1][Asst1][User2][Asst2][User3][Asst3]◆                  │
+      │           ╰──────────── cached ─────────────╯                                  │
+      │               1500 @ $0.30/M + 1500 new @ $3/M  =  $0.0050                     │
+      │                                                                                 │
+      │  Turn 4:  [System][User1][Asst1][User2][Asst2][User3][Asst3][User4][Asst4]◆    │
+      │           ╰───────────────────── cached ─────────────────────╯                 │
+      │                     3000 @ $0.30/M + 2000 new @ $3/M  =  $0.0069               │
+      │           ─────────────────────────────────────────────                         │
+      │                                              Total: $0.0166  (45% savings)     │
+      │                                                                                 │
+      ├─────────────────────────────────────────────────────────────────────────────────┤
+      │                                                                                 │
+      │  COMPACTION + CACHING: Double benefit                                           │
+      │  ════════════════════════════════════                                           │
+      │                                                                                 │
+      │    Main Chat                      Background Summarizer                         │
+      │    ─────────                      ─────────────────────                         │
+      │                                                                                 │
+      │  [Conversation grows...]          [Same conversation prefix]◆ + [Summarize!]   │
+      │         │                                    │                                  │
+      │         │                         Cache hit! Only pays for                      │
+      │         │                         the summarization prompt                      │
+      │         │                                    │                                  │
+      │         ▼                                    ▼                                  │
+      │  Context limit reached  ──────►  Session memory ready instantly                │
+      │                                  (built cheaply in background)                  │
+      │                                                                                 │
+      │  ┌──────────────────────────────────────────────────────────────────────────┐  │
+      │  │  Key insight: The background summarizer reuses the same conversation     │  │
+      │  │  prefix that was just sent to the main chat - automatic cache hit!       │  │
+      │  └──────────────────────────────────────────────────────────────────────────┘  │
+      │                                                                                 │
+      └─────────────────────────────────────────────────────────────────────────────────┘
+      
+      ◆ = cache_control breakpoint (cache everything before this point)
+      ```
+      
+      ### Why this matters for compaction
+      
+      | Scenario | Cost per background update | Notes |
+      |----------|---------------------------|-------|
+      | No caching | Full input cost | 5,000 tokens × $3/M = $0.015 |
+      | With caching | ~10% of input cost | 500 new + 4,500 cached = $0.003 |
+      | **Savings** | **~80%** | Compounds over many updates |
+      
+      The longer the conversation, the bigger the savings—exactly when you need compaction most!
+  markdown cell:
+    source:
+      ### How the Caching Works
+      
+      The key is in `_add_cache_control()` and `_create_session_memory_cached()`:
+      
+      ```python
+      # 1. Mark the last conversation message with cache_control
+      {
+          "role": "user",
+          "content": [{
+              "type": "text",
+              "text": msg["content"],
+              "cache_control": {"type": "ephemeral"}  # <-- This creates a cache breakpoint
+          }]
+      }
+      
+      # 2. Also mark the system prompt
+      system=[{
+          "type": "text",
+          "text": "You are a session memory agent...",
+          "cache_control": {"type": "ephemeral"}
+      }]
+      ```
+      
+      **Why this works:**
+      - The first background update creates a cache entry for `[System + Messages]`
+      - Subsequent updates with the same message prefix get **cache hits**
+      - Only the new summarization instruction is billed at full price
+      - Cache entries have a 5-minute TTL, so rapid updates benefit most
+      
+      **Cost math:**
+      - Without caching: 5,000 tokens × $3.00/1M = $0.015 per update
+      - With caching: 500 new tokens × $3.00/1M + 4,500 cached × $0.30/1M = $0.00285
+      - **Savings: ~80%** on background summarization costs

Generated by nbdime

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review

Recommendation: REQUEST_CHANGES

Summary

This PR adds a valuable notebook on session memory compaction patterns with solid technical implementation. However, it has critical issues that must be addressed: an accidental deletion of .env.example, missing registry entry, and gaps in pedagogical structure required by the cookbook style guide.

Actionable Feedback (8 critical items)

Critical Issues (Must Fix):

  • Restore .env.example - This file was deleted but is referenced in CLAUDE.md Quick Start instructions and serves as the template for new contributors. This deletion appears accidental and unrelated to the notebook content.

  • Register in registry.yaml - Add entry for misc/session_memory_compaction.ipynb with title, description, authors, date, and categories. Without this, the notebook won't appear in the cookbook index.

  • misc/session_memory_compaction.ipynb (cell 1) - Add proper problem-focused introduction following style guide. Should include: Hook explaining the problem, Why it matters, and 2-4 Terminal Learning Objectives as bullets.

  • misc/session_memory_compaction.ipynb (after cell 1) - Add Prerequisites and Setup section with pip install, load_dotenv() for API key setup, and MODEL constant definition.

  • misc/session_memory_compaction.ipynb (cell 1) - Fix typo: Utlizing → Utilizing

  • misc/session_memory_compaction.ipynb (end) - Add conclusion section that maps back to learning objectives, summarizes accomplishments, and suggests next steps.

  • misc/session_memory.md - Remove this placeholder file or add actual content. Placeholder files should not be committed.

  • misc/session_memory_compaction.ipynb:203 - Update MODEL to use full model ID: claude-sonnet-4-5-20250929 per CLAUDE.md standards.

Important Issues (Should Fix):

  • Clarify relationship to existing tool_use/automatic-context-compaction.ipynb - Consider adding cross-references or merging if content overlaps significantly.

  • misc/session_memory_compaction.ipynb (cells 4,8,10+) - Change code cell languageId from coconut to python for proper syntax highlighting.

  • misc/session_memory_compaction.ipynb:cell-16 - Replace "it took xx seconds" placeholder with actual timing.

  • Run make check and make fix before merging to ensure linting/formatting compliance.

Detailed Review

Code Quality Strengths:

  • Threading implementation is solid with proper locking and daemon threads
  • Cache control helpers are well-designed and reusable
  • Token tracking provides clear cost visibility
  • Helper functions have good docstrings

Pedagogical Structure:
The notebook needs work to meet cookbook standards. Missing: problem-focused introduction, prerequisites/setup section with pip install, conclusion mapping to learning objectives. However, the logical progression (fundamentals → traditional → instant → advanced) is effective.

Security:
No hardcoded API keys detected. Proper use of anthropic.Anthropic() client. Setup section should explicitly demonstrate load_dotenv() for educational purposes.

Content Considerations:
This notebook overlaps with existing tool_use/automatic-context-compaction.ipynb. Consider adding cross-references or clarifying differentiation (manual patterns vs SDK automatic feature).

Positive Notes:

  • Threading pattern for background compaction is valuable
  • Token metrics help readers understand cost implications
  • Prompt engineering best practices are clearly explained
  • Code quality is generally high

Next Steps: Please address the critical items above, particularly restoring .env.example and adding the registry entry. Once these are resolved, I'll be happy to re-review!

jsham042 and others added 6 commits January 11, 2026 21:25
Move compaction instructions from system prompt to user message to enable
cache sharing between main chat and compaction calls. This allows the
compaction call to reuse the cached conversation prefix from main chat,
reducing costs by ~90% for the prefix tokens.

Changes:
- TraditionalCompactingChatSession.compact() now uses self.system_message
- InstantCompactingChatSession._create_session_memory() uses self.system_message
- InstantCompactingChatSession._update_session_memory() uses self.system_message
- Compaction instructions moved to user message with role-switching preamble
- Removed unused add_cache_control_to_system_prompt helper function

Claude-Generated-By: Claude Code (cli/claude-opus-4-5=100%)
Claude-Steers: 4
Claude-Permission-Prompts: 2
Claude-Escapes: 0
🏠 Remote-Dev: homespace
…mization

feat: optimize prompt caching for conversation compaction
@github-actions

Copy link
Copy Markdown

Notebook Changes

This PR modifies the following notebooks:

📓 misc/session_memory_compaction.ipynb

View diff
nbdiff /dev/null misc/session_memory_compaction.ipynb (fc4bc6491a16cfcc00178b1df5eb8842e2b8b3d1)
--- /dev/null  2026-01-12 23:10:57.712177
+++ misc/session_memory_compaction.ipynb (fc4bc6491a16cfcc00178b1df5eb8842e2b8b3d1)  (no timestamp)
## added /cells:
+  markdown cell:
+    source:
+      # Session memory compaction
+  markdown cell:
+    source:
+      This cookbook covers three main topics:
+      1. Writing a quality prompt to compact session chat history
+      2. Utlizing instant compacting to improve user chat experience
+      3. Managing costs and latency for long context conversations using prompt caching
+  markdown cell:
+    source:
+      #### Setup
+  code cell:
+    source:
+      import anthropic
+      from anthropic.types import MessageParam, TextBlockParam
+      import warnings
+      
+      client = anthropic.Anthropic()
+      MODEL = "claude-sonnet-4-5"
+                                                                                           
+      # helper functions:
+      def truncate_response(text: str, max_lines: int = 15) -> str:
+          """Truncate long responses for cleaner output display."""
+          lines = text.strip().split("\n")
+          if len(lines) <= max_lines:
+              return text
+          return "\n".join(lines[:max_lines]) + f"\n... ({len(lines) - max_lines} more lines)"
+      
+      def remove_thinking_blocks(text: str):
+          """Remove <think>...</think> blocks from the text."""
+          import re
+      
+          matches = re.findall(r"<think>.*?</think>", text, flags=re.DOTALL)
+          cleaned = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()
+          return cleaned, "".join(matches)
+      
+      def add_cache_control(messages: list[dict]) -> list[MessageParam]:
+          """Add cache_control to the last user message for prompt caching.
+      
+          For prompt caching to work, the message prefix structure must be identical between requests.
+          All messages are converted to list format for consistency, and cache_control is placed on
+          the last user message to match the standard API call pattern.
+          """
+          cached_messages: list[MessageParam] = []
+          last_user_idx = None
+      
+          # Find last user message index
+          for i, msg in enumerate(messages):
+              if msg["role"] == "user":
+                  last_user_idx = i
+      
+          for i, msg in enumerate(messages):
+              content = msg["content"]
+              text = content if isinstance(content, str) else content[0]["text"]
+      
+              content_block: TextBlockParam = {"type": "text", "text": text}
+              if i == last_user_idx:
+                  content_block["cache_control"] = {"type": "ephemeral"}
+      
+              cached_messages.append({"role": msg["role"], "content": [content_block]})
+      
+          return cached_messages
+      
+      def estimate_tokens(text: str) -> int:
+          """Rudimentary token estimation: 1 token per 4 characters."""
+          return len(text) // 4
+  markdown cell:
+    source:
+      ## 1. Writing a compaction prompt
+  markdown cell:
+    source:
+      Make sure you have a well structured session memory prompt. 
+      
+      Some best practices include:
+      - Use chain-of-thought before summarizing — analyze first, then output                                                                                         
+      - Enumerate exactly what to preserve: file paths, code snippets, errors, user corrections                                                                      
+      - Weight recency heavily — the end of the conversation is the active context                                                                                   
+      - Require verbatim quotes for next steps to prevent task drift                                                                                                 
+      - Use structured sections with token budgets per section                                                                                                       
+      - Include a "Current State" section that always reflects the moment of compaction
+      
+      Some pitfalls include:
+      - Vague prompts like "summarize this conversation" produce lossy output                                                                                        
+      - Treating all messages equally loses the active working context                                                                                               
+      - Paraphrasing next steps introduces subtle drift that compounds                                                                                               
+      - Omitting error history causes the model to retry failed approaches                                                                                           
+      - Dropping user corrections makes the model revert to old behaviors                                                                                            
+      - No token limits lets one section consume the entire summary                                                                                                  
+      - Summarizing for human readability instead of model continuity
+      - Having the agent try to compress the results of tool calls here - this can be retrieved later if the agent needs it
+  code cell:
+    source:
+      SESSION_MEMORY_PROMPT = """
+      Compress the conversation into a structured summary 
+      that preserves all information needed to continue work seamlessly. Optimize for the assistant's 
+      ability to continue working, not human readability.
+      
+      <analysis-instructions>
+      Before generating your summary, analyze the transcript in <think>...</think> tags:
+      1. What did the user originally request? (Exact phrasing)
+      2. What actions succeeded? What failed and why?
+      3. Did the user correct or redirect the assistant at any point?
+      4. What was actively being worked on at the end?
+      5. What tasks remain incomplete or pending?
+      6. What specific details (IDs, paths, values, names) must survive compression?
+      </analysis-instructions>
+      
+      <summary-format>
+      ## User Intent
+      The user's original request and any refinements. Use direct quotes for key requirements.
+      If the user's goal evolved during the conversation, capture that progression.
+      
+      ## Completed Work
+      Actions successfully performed. Be specific:
+      - What was created, modified, or deleted
+      - Exact identifiers (file paths, record IDs, URLs, names)
+      - Specific values, configurations, or settings applied
+      
+      ## Errors & Corrections
+      - Problems encountered and how they were resolved
+      - Approaches that failed (so they aren't retried)
+      - User corrections: "don't do X", "actually I meant Y", "that's wrong because..."
+      Capture corrections verbatim—these represent learned preferences.
+      
+      ## Active Work
+      What was in progress when the session ended. Include:
+      - The specific task being performed
+      - Direct quotes showing exactly where work left off
+      - Any partial results or intermediate state
+      
+      ## Pending Tasks
+      Remaining items the user requested that haven't been started.
+      Distinguish between "explicitly requested" and "implied/assumed."
+      
+      ## Key References
+      Important details needed to continue:
+      - Identifiers: IDs, paths, URLs, names, keys
+      - Values: numbers, dates, configurations, credentials (redacted)
+      - Context: relevant background information, constraints, preferences
+      - Citations: sources referenced during the conversation
+      </summary-format>
+      
+      <preserve-rules>
+      Always preserve when present:
+      - Exact identifiers (IDs, paths, URLs, keys, names)
+      - Error messages verbatim
+      - User corrections and negative feedback
+      - Specific values, formulas, or configurations
+      - Technical constraints or requirements discovered
+      - The precise state of any in-progress work
+      </preserve-rules>
+      
+      <compression-rules>
+      - Weight recent messages more heavily—the end of the transcript is the active context
+      - Omit pleasantries, acknowledgments, and filler ("Sure!", "Great question")
+      - Omit system context that will be re-injected separately
+      - Keep each section under 500 words; condense older content to make room for recent
+      - If you must cut details, preserve: user corrections > errors > active work > completed work
+      </compression-rules>
+      """
+  markdown cell:
+    source:
+      ### Code example using traditional compacting
+      In traditional compaction, you generate one summary once the token threshold is reached.
+      Traditional compaction is slow: when you hit the context limit, you wait for a summary.
+  markdown cell:
+    source:
+      
+      ```
+      TRADITIONAL COMPACTION (slow)
+      ─────────────────────────────
+      Turn 1 → Turn 2 → Turn 3 → ... → Turn N → CONTEXT FULL!
+
+
+                                          ┌─────────────────┐
+                                          │ Generate summary│
+                                          │ ( USER WAITS !) │
+                                          └─────────────────┘
+
+
+                                               Continue
+      
+      ```
+  code cell:
+    source:
+      import time
+      
+      class TraditionalCompactingChatSession:
+          """Traditional chat session with compaction after the fact."""
+          def __init__(self, system_message="You are a helpful assistant", context_limit: int = 10000):
+              self.system_message = system_message
+              self.context_limit = context_limit # the point at which the conversation is compacted so it does not exceed model limits.
+              self.messages = []
+              self.current_context_window_tokens = 0
+              self.summary = None
+          
+          def chat(self, user_message: str):
+              # In traditional compaction, we check if we need to compact when the user sends a message. NOT IDEAL!
+              if self.current_context_window_tokens >= self.context_limit:
+                  print(f"\n🧹 Context window at {self.current_context_window_tokens} tokens. Limit exceeded, compacting session memory...")
+                  self.compact() # compacts everything before the new user message
+              
+              self.messages.append({"role": "user", "content": user_message})      
+              print(f"\nUser: {user_message}")                                                                                        
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=3500,
+                  system=self.system_message,
+                  messages=add_cache_control(self.messages)
+              )
+              assistant_message = response.content[0].text
+              self.messages.append({"role": "assistant", "content": assistant_message})
+             
+              print(f"\nAssistant: \n{truncate_response(assistant_message, max_lines=15)}")
+              
+              # approximate current token count in the conversation before the next user message
+              cache_read = getattr(response.usage, "cache_read_input_tokens", 0) or 0                                                                                                                          
+              total_input = response.usage.input_tokens + cache_read                                                                                                                                           
+              self.current_context_window_tokens = total_input + response.usage.output_tokens                                                     
+             
+              print(
+                  f"Input={total_input:,}, Prompt cached used= {cache_read > 0} | "
+                  f"Output={response.usage.output_tokens:,} | "
+                  f"Messages={len(self.messages)}"
+              )
+              return assistant_message, response.usage
+          
+          def compact(self): 
+              start_time = time.perf_counter()
+              
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=5000,
+                  system=self.system_message,  # Same as main chat for cache sharing
+                  messages=add_cache_control(self.messages) + [{
+                      "role": "user",
+                      "content": SESSION_MEMORY_PROMPT
+                  }]
+              )
+              elapsed = time.perf_counter() - start_time
+              
+              # Generate new summary message
+              self.summary, removed_text = remove_thinking_blocks(response.content[0].text) # clean up any <think> blocks because they are not needed in the session memory
+              approximate_summary_tokens = response.usage.output_tokens - round(len(removed_text) / 4)  # rough estimate of tokens removed from summary
+             
+              # Replace prior messages with new summary message
+              self.messages = [{
+                  "role": "user",
+                  "content": f"""This session is being continued from a previous conversation. Here is the session memory: {self.summary}.Continue from where we left off."""
+              }]
+              
+              # Show token reduction if we just compacted
+              reduction = self.current_context_window_tokens - approximate_summary_tokens
+              pct = (reduction / self.current_context_window_tokens) * 100
+              
+              print(f"\n{'-' * 60}")
+              print(f"📝 New session memory created.")
+              print(f"✅ Tokens reduced: {self.current_context_window_tokens:,} → {approximate_summary_tokens:.0f} ({reduction:,} tokens saved, {pct:.0f}% reduction)")
+              print(f"⏱️ Compaction time: {elapsed:.2f}s (user waiting...)")
+              print(f" Cache used: {getattr(response.usage, 'cache_read_input_tokens', 0) > 0}")
+              print(f"{'-' * 60}")
+              
+              # Update token count to reflect compacted state
+              self.current_context_window_tokens = approximate_summary_tokens
+  markdown cell:
+    source:
+      Below we simulate a conversation between an author and an LLM that helps write stories.
+  code cell:
+    source:
+      SYSTEM_PROMPT = """
+      You are a short story writer who helps authors develop their ideas into compelling narratives.
+      
+      ## What You Do
+      
+      **Plot Development**
+      - Help authors work through story structure, pacing, and narrative arc
+      - Identify plot holes, inconsistencies, or missed opportunities
+      - Suggest ways to raise stakes, add tension, or deepen conflict
+      - Brainstorm twists, resolutions, and scene transitions
+      
+      **Character Development**
+      - Develop backstories, motivations, and internal conflicts
+      - Ensure characters have distinct voices and consistent behavior
+      - Explore character relationships and how they drive the plot
+      - Help authors understand what their characters want vs. what they need
+      
+      **Drafting**
+      - Write short stories or scenes based on the author's ideas and direction
+      - Match tone, genre conventions, and stylistic preferences
+      - Show rather than tell when bringing scenes to life
+      - Craft dialogue that reveals character and advances plot
+      
+      ## How You Work
+      - You are the lead writer. When you disagree with a creative choice, say so respectfully, but ultimately defer to what the author wants.
+      - DO NOT ask the user to provide more context or clarify their request. Assume you have enough information to proceed.
+      """
+  code cell:
+    source:
+      session = TraditionalCompactingChatSession(system_message=SYSTEM_PROMPT)
+      
+      # Simulated conversation
+      messages = [
+          "I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well throught out plot ideas for me to consider.",
+          "I don't like those ideas, can you think of one plot something more unique and unexpected?",
+          "Ok I like it. Can you help me develop the main character's backstory and motivations?",
+          "Can you draft a detailed outline for the story, breaking it down into chapters and key events?",
+          "Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.", 
+          "Can you draft a second chapter that builds on the first one, introducing a new twist in the mystery?"                   
+      ]
+      
+      print("Starting conversation...\n")
+      
+      turn_count = 0
+      
+      for i, message in enumerate(messages, 1):
+          turn_count += 1
+          print((
+              f"==============================================\n"
+              f"Turn {turn_count}:\n"
+          ))
+          response, usage = session.chat(message)
+          
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          Starting conversation...
+          
+          ==============================================
+          Turn 1:
+          
+          
+          User: I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well throught out plot ideas for me to consider.
+          
+          Assistant: 
+          # Three Mystery Plot Ideas
+          
+          ## 1. **The Drowning Season**
+          
+          **Setup:** Twenty-three-year-old Detective Maya Reeves returns to her hometown of Millbrook after five years away, assigned to investigate what locals insist is just another tragic drowning at the quarry—the third one this summer. The sheriff's department is understaffed, and Maya's the only one taking it seriously.
+          
+          **The Mystery:** Each victim was found at dawn, fully clothed, with no water in their lungs. They all have the same strange detail: fresh soil under their fingernails, though the quarry is surrounded by rock. Maya discovers all three victims had recently started remembering "lost time" from their childhood—gaps of hours or days they couldn't explain. The town has a secret: twenty years ago, a local therapist ran a controversial "memory recovery" program for troubled kids. Maya was one of those kids, and she's starting to remember why she really left town.
+          
+          **What Makes It Work:** The murders are connected to buried trauma, both literal and psychological. The killer believes they're saving the victims from remembering something terrible. Maya must solve the case while confronting her own suppressed memories, never sure if what she's remembering is real or if she might be the next target.
+          
+          ---
+          
+          ## 2. **The Lighthouse Keeper's Daughter**
+          
+          **Setup:** Rookie detective James Park is sent to the coastal town of Beacon's Rest to investigate the disappearance of fifteen-year-old Clare Whitmore, daughter of the reclusive lighthouse keeper. She vanished during the annual Founder's Day festival three days ago. The town is cooperative but oddly unsurprised—seven girls have disappeared from Beacon's Rest over the past forty years, always during Founder's Day, always at age fifteen, and always without a trace.
+          ... (18 more lines)
+          Input=318, Prompt cached used= False | Output=988 | Messages=2
+          ==============================================
+          Turn 2:
+          
+          
+          User: I don't like those ideas, can you think of one plot something more unique and unexpected?
+          
+          Assistant: 
+          # The Cartographer's Cipher
+          
+          **Setup:** Detective Nora Alike, 24, takes her first solo case in the dying mining town of Shepherdstown (population 847 and falling). An elderly cartographer named Eugene Fisk was found dead in his cluttered workshop, officially ruled a heart attack. But his daughter insists something's wrong: her father spent his final month frantically updating a massive hand-drawn map of the town—adding buildings that don't exist, removing ones that do, and marking dozens of red X's in seemingly random locations.
+          
+          **The Twist:** Nora starts investigating the X's out of curiosity. The first one leads to an old park bench—underneath it, she finds a coffee can containing $2,300 in cash and a note: "For the Hendersons' mortgage, June 1987." The second X marks a spot behind the defunct movie theater where she uncovers an envelope with photographs proving the former mayor's affair (the scandal that ended his career). The third X reveals a time capsule with a child's confession to starting the fire that burned down the elementary school forty years ago.
+          
+          **The Real Mystery:** Eugene wasn't murdered—but he wasn't mapping the town as it is. He was mapping every secret buried within it, physical and metaphorical. For sixty years, he'd been Shepherdstown's unofficial confessor. People would visit his workshop and tell him things they couldn't tell anyone else, and he'd mark them on his map in an elaborate code only he understood. In his final month, knowing he was dying, he decoded everything—creating a map that, if made public, would destroy half the town.
+          
+          **The Investigation:** Nora realizes someone is also following the map. Items are being dug up, locations disturbed. She races to decipher Eugene's system before this unknown person does. But the deeper she digs, the more she uncovers: covered-up deaths, stolen inheritances, a hit-and-run from 1992, evidence of who really embezzled the union funds that closed the mine. Every secret connects to another. The town isn't dying by accident—it's been poisoned from within by decades of buried truth.
+          
+          **The Impossible Choice:** The person following the map is Eugene's daughter, who wants to expose everything and burn the town's lies to the ground. She blames these secrets—and the people who keep them—for turning Shepherdstown into a hollowed-out shell. Nora must decide: help her complete her father's final work and reveal every terrible truth, or destroy the map and let sleeping sins lie. The case becomes about whether a detective's job is to uncover the truth or to protect the living from what the truth will do to them.
+          
+          **What Makes It Unique:** 
+          - No traditional villain—just a town full of people who made choices
+          - The "victim" weaponized his own death
+          ... (5 more lines)
+          Input=1,328, Prompt cached used= False | Output=688 | Messages=4
+          ==============================================
+          Turn 3:
+          
+          
+          User: Ok I like it. Can you help me develop the main character's backstory and motivations?
+          
+          Assistant: 
+          # Nora Alike: Character Deep Dive
+          
+          ## Background
+          
+          **Age:** 24, but people consistently guess younger—which she hates. She's been fighting to be taken seriously her entire life.
+          
+          **Origin:** Grew up in a suburb of a mid-sized city, the youngest of four siblings by seven years (she was the "accident baby"). Her parents were older, tired, and emotionally checked out by the time she came along. Her siblings had already left home, so she essentially raised herself in a house that felt like a museum to other people's childhoods.
+          
+          **The Formative Incident:** When Nora was sixteen, her oldest brother Martin disappeared. Just stopped coming to family dinners, didn't return calls. Her parents were worried but passive—"He's an adult, he'll reach out when he's ready." After three weeks of everyone just *waiting*, Nora took the bus to his apartment herself. She found him in the middle of a breakdown, his apartment filthy, convinced he'd ruined his life after losing his job. Her parents had known something was wrong but didn't want to pry, didn't want to intrude. Their politeness, their respect for privacy, almost killed him.
+          
+          That's when Nora learned: sometimes the most destructive thing you can do is mind your own business.
+          
+          ## Why She Became a Detective
+          
+          **Surface Reason:** She tells people she's interested in justice, in puzzles, in helping people. Standard cop interview answers.
+          ... (44 more lines)
+          Input=2,039, Prompt cached used= False | Output=1,350 | Messages=6
+          ==============================================
+          Turn 4:
+          
+          
+          User: Can you draft a detailed outline for the story, breaking it down into chapters and key events?
+          
+          Assistant: 
+          # The Cartographer's Cipher: Detailed Outline
+          
+          ## ACT ONE: ARRIVAL AND DISCOVERY
+          
+          ### Chapter 1: The Smallest Case
+          - **Setting:** Nora arrives in Shepherdstown on a gray October morning. Description of the town—half the storefronts empty, population aging, mines closed for twenty years.
+          - **Key Event:** She meets Sheriff Dale Kovach (58, gruff, counting days to retirement). He makes it clear this is a nothing case—old man had a heart attack, daughter is in denial, just close it out.
+          - **Character Moment:** Nora's crappy motel room. She can't sleep. Lies awake analyzing the case file that shouldn't be a case.
+          - **Setup:** Brief flashback to why she's here—her last case, pushing too hard, burning bridges.
+          
+          ### Chapter 2: The Workshop
+          - **Setting:** Eugene Fisk's workshop—a converted garage behind his house, packed with decades of maps, surveying equipment, and obsessive documentation.
+          - **Key Event:** Nora meets Eugene's daughter, **Caroline Fisk** (44, librarian, wire-thin with grief and rage). Caroline shows her the map—a massive 8x10 foot rendering of Shepherdstown with bizarre alterations.
+          - **The Map:** Buildings that don't exist (a church that burned down in 1963, a restaurant that was never built). Streets rerouted. And 63 red X's scattered across the town.
+          - **Caroline's Plea:** "My father didn't have heart attacks. He had *purpose*. Someone scared him to death."
+          ... (113 more lines)
+          Input=3,411, Prompt cached used= False | Output=3,500 | Messages=8
+          ==============================================
+          Turn 5:
+          
+          
+          User: Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.
+          
+          Assistant: 
+          # Chapter 1: The Smallest Case
+          
+          The thing about dying towns is they all smell the same—wet cardboard and regret.
+          
+          Nora Alike noticed it the moment she stepped out of her Honda Civic into the gray October drizzle. Shepherdstown, Pennsylvania stretched out before her like a body someone had forgotten to bury. Half the storefronts on Main Street were empty, their windows either papered over with faded FOR LEASE signs or simply left naked to expose the gutted interiors. A diner called Mabel's still had its lights on, and a Dollar General looked operational, but that was about it for signs of life.
+          
+          Population 847, the sign at the town limits had declared. Nora suspected that number was aspirational.
+          
+          She locked her car—force of habit, though she doubted anyone here had the energy for theft—and crossed the street toward the sheriff's office. The building was brick, squat, probably built in the sixties when the coal mine was still running and the town still had a future. A rusted flagpole stood out front, the rope slapping against the metal in the wind like a half-hearted call for help.
+          
+          *This is what you get*, Nora told herself. *This is what happens when you push too hard.*
+          
+          Three months ago, she'd been working property crimes in Harrisburg. Not glamorous, but it was something. Then she'd caught a case involving a city councilman's nephew and a string of car thefts. Everyone told her to go easy, that the kid was "going through something," that his uncle was "a good man who'd served the community for twenty years." Nora had arrested him anyway, because the evidence was there and because she couldn't understand why everyone wanted her to look the other way.
+          
+          The charges stuck. The councilman made calls. Nora's lieutenant made it clear that while he couldn't fire her, he could make her life miserable. Two weeks later, she'd accepted a "transfer opportunity" to Shepherdstown, where the entire sheriff's department consisted of three people and the most exciting case in recent memory was someone stealing catalytic converters from the Dollar General parking lot.
+          ... (98 more lines)
+          Input=6,947, Prompt cached used= False | Output=2,670 | Messages=10
+          ==============================================
+          Turn 6:
+          
+          
+          User: Can you draft a second chapter that builds on the first one, introducing a new twist in the mystery?
+          
+          Assistant: 
+          # Chapter 2: The Workshop
+          
+          Caroline Fisk's house was a narrow two-story Victorian on Elm Street, painted a faded yellow that had probably been cheerful once. The detached garage sat thirty feet behind it, connected by a flagstone path that cut through what had been a carefully tended garden. Past tense, because the garden was dying now—rosebushes gone leggy and wild, perennials choked with weeds, a birdbath tipped over on its side.
+          
+          Nora parked on the street and walked up the front path. Before she could knock, the door opened.
+          
+          Caroline Fisk was forty-four but looked older, the way grief ages people in fast-forward. Thin to the point of frailty, with graying brown hair pulled back in a hasty ponytail. She wore jeans and an oversized cardigan that might have been her father's. Her eyes were red-rimmed but sharp, evaluating Nora with the focused intensity of someone who'd cried herself out and moved on to anger.
+          
+          "You're the detective." Not a question.
+          
+          "Detective Alike. I'm sorry for your loss, Ms. Fisk."
+          
+          "Are you?" Caroline stepped aside to let Nora in. "Or are you here to tell me I'm a hysterical woman who can't accept that her father died of natural causes?"
+          
+          "I'm here to listen."
+          ... (161 more lines)
+          Input=9,641, Prompt cached used= True | Output=3,500 | Messages=12
+  markdown cell:
+    source:
+      This is a long conversation with several turns. You'll notice a few things here:
+      
+      Prompt caching: You'll notice here that the input tokens eventually grew to a point where prompt caching was used (turn 6). This helps reduce costs and speed as these conversations grow!
+  markdown cell:
+    source:
+      On the next turn, we are going to hit our 10K context window limit, which triggers compaction:
+  code cell:
+    source:
+      response, usage = session.chat("Propose a title for the book")
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          
+          🧹 Context window at 13141 tokens. Limit exceeded, compacting session memory...
+          
+          ------------------------------------------------------------
+          📝 New session memory created.
+          ✅ Tokens reduced: 13,141 → 1559 (11,582 tokens saved, 88% reduction)
+          ⏱️ Compaction time: 36.13s (user waiting...)
+           Cache used: True
+          ------------------------------------------------------------
+          
+          User: Propose a title for the book
+          
+          Assistant: 
+          Looking at the story we've developed, I'd propose:
+          
+          **"The Cartographer's Confession"**
+          
+          Here's why this works:
+          
+          **Thematic Resonance:**
+          - The double meaning captures Eugene's dual role: he kept confessions *and* his final map is itself a confession
+          - "Cartographer" immediately signals the unique hook of your premise
+          - "Confession" ties to the central tension between exposure and privacy
+          
+          **Alternative Titles to Consider:**
+          
+          1. **"Burial Ground"** - More commercial, emphasizes the literal buried evidence and metaphorical buried truths
+          
+          ... (13 more lines)
+          Input=1,840, Prompt cached used= False | Output=325 | Messages=3
+  markdown cell:
+    source:
+      
+      You'll notice here that it took time for the agent to compact the conversation. Because we used traditional compaction, the user would be waiting on Claude to compact the conversation, which is not an ideal user experience.
+      
+      Below you can see the result of the compaction. It captures the key elements of conversation in less than 2K tokens.
+  code cell:
+    source:
+      print(session.summary)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          ## User Intent
+          User requested: "I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well throught out plot ideas for me to consider."
+          
+          After rejecting initial 3 plots, user specified: "I don't like those ideas, can you think of one plot something more unique and unexpected?"
+          
+          Accepted "The Cartographer's Cipher" concept. Then requested: character development, detailed outline, and chapter drafts.
+          
+          ## Completed Work
+          
+          **Story Concept Developed:**
+          - Title: "The Cartographer's Cipher"
+          - Premise: Detective investigates death of cartographer who spent final month decoding 40 years of town secrets onto a map with 63 red X's marking buried physical evidence
+          
+          **Character: Detective Nora Alike**
+          - Age: 24, physically small (5'4"), socially awkward, insomniac
+          - Backstory: Youngest of 4 siblings by 7 years, essentially raised herself. At 16, "rescued" brother Martin from breakdown—he resents the intrusion
+          - Transferred to Shepherdstown from Harrisburg Property Crimes after arresting city councilman's nephew, making powerful enemies
+          - Fatal flaw: Believes exposure always equals healing; pathological need to know; invasive about others' secrets, intensely private about her own
+          - Character arc: Must learn some truths cause more harm than good
+          
+          **Setting: Shepherdstown, PA**
+          - Population: 847 (declining)
+          - Coal mine closed 1998, town dying since
+          - Key locations: Sheriff's office, Fisk house/workshop, abandoned mine, Motor Lodge
+          
+          **Supporting Characters:**
+          - Eugene Fisk: 79, cartographer, died 4 days before story opens. Stage 4 pancreatic cancer. Spent 40 years as town's "confessor"
+          - Caroline Fisk: 44, librarian, Eugene's daughter, wants to expose all secrets
+          - Sheriff Dale Kovach: 58, wants case closed quickly
+          - Deputy Marcus Webb: 31, local, volunteers to help (his father's secret on the map)
+          - Helen Morrison: 73, visited Eugene asking him to "mark" something, he refused
+          
+          **17-Chapter Outline Created:**
+          - Act 1 (Chapters 1-5): Nora arrives, discovers map, investigates first X's revealing buried secrets (Hendersons' mortgage, mayor's affair, fire confession, etc.)
+          - Act 2 (Chapters 6-11): Someone else following map, threatens Nora. Interviews living victims who beg her not to expose secrets. Discovers mine embezzlement cover-up
+          - Act 3 (Chapters 12-17): Town meeting confrontation, Nora discovers Mayor Ortiz's father covered up mine safety violations. Phone call from brother Martin reveals he resents her "rescue." Nora compromises: exposes mine cover-up (affects everyone), buries personal secrets. Arrests Mayor and father. Ambiguous ending about whether truth serves justice
+          
+          **Key Plot Points:**
+          - Mine closed due to covered-up safety violations, not economics
+          - Vernon Pike (union treasurer) embezzled funds as scapegoat at Donald Mercer's direction
+          - Donald Mercer is Mayor Linda Ortiz's father
+          - Eugene's final journal: "I've mapped every lie, every buried truth... Maybe the only cure is exposure. Or maybe exposure is just another kind of death."
+          - Each X marks physical evidence of a secret (cash, photos, confessions, documents)
+          
+          **Chapters Drafted:**
+          - Chapter 1 (~2,000 words): Nora arrives Shepherdstown, meets Sheriff Kovach who dismisses case, assigned to talk to Caroline Fisk
+          - Chapter 2 (continuation): Nora visits Caroline, sees workshop and massive incorrect map with 63 red X's. Caroline explains Eugene's paranoid final month. Journal entry reveals "Morrison girl" visit. Coroner confirms extremely elevated stress hormones. Chapter ends with Nora heading to investigate old mine location with survey map
+          
+          ## Errors & Corrections
+          
+          User rejected first 3 plot concepts as not unique/unexpected enough:
+          1. "The Drowning Season" (memory recovery therapy murders)
+          2. "The Lighthouse Keeper's Daughter" (ritual sacrifices every 5-6 years)
+          3. "The Memory Box Murders" (classmates hunting each other over past crime)
+          
+          User directive: "can you think of one plot something more unique and unexpected?" Led to cartographer concept.
+          
+          ## Active Work
+          
+          Chapter 2 just completed. Ends with:
+          "She headed for the door, the survey map folded in her pocket and Eugene Fisk's final journal entry echoing in her mind: *Maybe the only cure is exposure. Or maybe exposure is just another kind of death.*
+          
+          Outside, the clouds had thickened again, pressing down on Shepherdstown like a shroud. Somewhere in this dying town, someone had scared an old man to death."
+          
+          Nora is heading to investigate the old mine (northern edge of town, surrounded by woods, one overgrown access road). She has Eugene's survey map showing cluster of X's around mine area, dates back to 1998.
+          
+          ## Pending Tasks
+          
+          No explicit requests pending. Story development ongoing—presumably more chapters to draft following the 17-chapter outline structure.
+          
+          ## Key References
+          
+          **Timeline:**
+          - 1998: Mine closes (covered-up safety violations)
+          - 2003: Martha Fisk (Eugene's wife) dies of cancer
+          - 5 weeks before present: Eugene starts creating "corrected" map
+          - 2 weeks before present: Caroline finds Eugene shaking, says "should have left them buried"
+          - 1 week before present: Helen Morrison visits, Eugene refuses to mark something
+          - 4 days before present: Eugene found dead with extremely elevated cortisol levels
+          
+          **Map Details:**
+          - 8ft x 10ft, mounted on foam board with acetate cover
+          - Shows Shepherdstown with deliberate "errors": church on Third & Maple (doesn't exist), Giovanni's restaurant on Main, rerouted streets
+          - 63+ red X's scattered across town
+          - Each X marks buried physical evidence of a secret
+          - Survey map subset focuses on mine area with names/dates/timeline from 1998
+          
+          **Character Relationships:**
+          - Nora/Martin (brother): She "saved" him 8 years ago when he was 23 and having breakdown; he resents the public humiliation; they barely speak now
+          - Eugene/Caroline: She cared for him through cancer; he left her the decoded map knowing she'd find it
+          - Eugene/townspeople: He was unofficial confessor for 40 years; people trusted him to keep secrets safe
+  markdown cell:
+    source:
+      ## Instant Compaction
+      
+      With **Instant compaction** the session memory is PROACTIVELY generated once a soft token threshold is reached. 
+      
+      Once the user triggers a compaction or a hard limit is reached, the summary is already available, so the user doesn't need to wait.
+      
+      Result: Instant compaction, no waiting.
+  markdown cell:
+    source:
+      
+      SESSION MEMORY COMPACTION (instant)
+      ```
+      ────────────────────────────────────
+      Turn 1 → Turn 2 → ... → Turn K → Turn K+1 → ... → Turn N → ..  → CONTEXT FULL!
+                                  │                         │            │
+                      (soft token threshold met:        (update          │
+                     initialize session memory)          trigger)        │
+                                  │                                      │
+                                  │                         │            │
+                                  ▼                         ▼            │
+                             ┌────────┐                ┌────────┐        │
+                             │ Create │                │ Update │        │
+                             │ memory │ (background)   │ memory │        │
+                             └────────┘                └────────┘        │
+                                  │                         │            │
+                                  ▼                         ▼            ▼
+                           📝 session-memory.md ──────────────────► INSTANT SWAP!
+                             (continuously updated)
+      ```
+      
+      **Update triggers:** The first summary is generated after the initial soft token limit. Updates can be triggered after every subsequent turn, or at periodically at natural breakpoints intervals (e.g. every ~10k tokens or 3+ tool calls).
+  markdown cell:
+    source:
+      This `InstantCompactingChatSession` class uses **threading** for background execution:
+      1. **`threading.Thread`** - runs memory updates in background without blocking
+      2. **Thread-safe state** - uses `threading.Lock` to safely update shared memory
+      3. **Daemon threads** - background work doesn't prevent program exit
+      4. **Instant compaction** - when context is full, just swap in the pre-built memory
+  code cell:
+    source:
+      import threading
+      import time
+      
+      
+      class InstantCompactingChatSession:
+          """
+          Maintains session memory via incremental background updates.
+          
+          Key insight: By updating memory in the background after each turn,
+          the summary is already ready when compaction is needed - instant swap!
+          """
+      
+          def __init__(
+              self,
+              system_message="You are a helpful assistant",
+              context_limit: int = 12000,
+              min_tokens_to_init: int = 7500,
+              min_tokens_between_updates: int = 2000,
+          ):
+              # Thresholds
+              self.context_limit = context_limit # the point at which the conversation is compacted so it does not exceed model limits
+              self.min_tokens_to_init = min_tokens_to_init # tokens needed to trigger initial memory creation; note this happens PROACTIVELY in background unlike traditional compaction
+              self.min_tokens_between_updates = min_tokens_between_updates # tokens needed to trigger memory update. only comes into play after initial memory is created and additional compaction (memory update) is needed after that
+      
+              # Conversation state
+              self.system_message = system_message
+              self.messages = []
+              self.current_context_window_tokens = 0
+      
+              # Session memory state
+              self.session_memory = None # this is the compacted conversation in session memory; for the demo we are storing this in memory, but in production you would write to session_memory.md file
+              self.last_summarized_index = 0 # The index of the last message included in the session memory
+              self.tokens_at_last_update = 0 # To track tokens at last memory update and see if enough new tokens have been added to trigger another update
+      
+              # Background update tracking
+              self._update_thread: threading.Thread | None = None
+              self.last_update_time = None
+              self._lock = threading.Lock()
+      
+          def chat(self, user_message: str):
+              """Process a chat turn with background session memory updates."""
+      
+              if self.current_context_window_tokens + estimate_tokens(user_message) >= self.context_limit:
+                  self.compact() # note that when this is triggered, the compaction has already been created and is just swapped in instantly
+      
+              self.messages.append({"role": "user", "content": user_message})
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=3500,
+                  system=self.system_message,
+                  messages=add_cache_control(self.messages),
+              )
+      
+              assistant_message = response.content[0].text
+              self.messages.append({"role": "assistant", "content": assistant_message})
+      
+              # Calculate token usage including cache
+              cache_read = getattr(response.usage, "cache_read_input_tokens", 0) or 0
+              total_input = response.usage.input_tokens + cache_read
+              
+              # Update context window tokens (includes cached tokens since they still count toward context)
+              self.current_context_window_tokens = total_input + response.usage.output_tokens
+      
+              # KEY DIFFERENCE: Trigger background memory update if needed proactively, before compaction is needed
+              background_status = None
+              if self._should_init_memory() or self._should_update_memory():
+                  self._trigger_background_update()
+                  background_status = "initializing" if self.session_memory is None else "updating"
+      
+              # Return usage info with cache stats
+              return assistant_message, response.usage, background_status
+          
+          # Helper methods to determine when to init session memory
+          def _should_init_memory(self) -> bool:
+              return (
+                  self.session_memory is None
+                  and self.current_context_window_tokens >= self.min_tokens_to_init
+              )
+      
+          # Helper method to determine if memory should be updated
+          def _should_update_memory(self) -> bool:
+              if self.session_memory is None:
+                  return False
+              tokens_since = self.current_context_window_tokens - self.tokens_at_last_update
+              return tokens_since >= self.min_tokens_between_updates
+      
+          # Methods to create initial session memory
+          def _create_session_memory(self, messages: list[dict]) -> str:
+              """Generate initial session memory from messages."""
+              # Put compaction instructions in user message to share cache with main chat
+              compaction_messages = [{"role": "user", "content": SESSION_MEMORY_PROMPT}]
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=5000,
+                  system=self.system_message,  # Same as main chat for cache sharing
+                  messages=add_cache_control(messages) + compaction_messages
+              )
+              summary, _ = remove_thinking_blocks(response.content[0].text)  # clean up any <think> blocks because they are not needed in the session memory
+              print(f"   [Background] Initial session memory created. Cache hit={getattr(response.usage, 'cache_read_input_tokens', 0) > 0}")
+              return summary
+      
+          def _update_session_memory(self, new_messages: list[dict]) -> str:
+              """Update existing session memory with new messages. In practice, you may want to do this via file edit rather than full re-generation. But for demo purposes we do full regeneration here."""
+              # Put compaction instructions in user message to share cache with main chat
+              compaction_update_messages = [{"role": "user", "content": SESSION_MEMORY_PROMPT + f"""There is an existing session memory: {self.session_memory}. Return the entire session memory with updates to reflect new messages."""}]
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=5000,
+                  system=self.system_message,
+                  messages=new_messages + compaction_update_messages # you may want to use prompt caching instead, in which case you'd use add_cache_control(self.messages) here
+              )
+              updated_summary, _ = remove_thinking_blocks(response.content[0].text)  # clean up any <think> blocks because they are not needed in the session memory
+              print(f"   [Background] Session memory updated.")
+              return updated_summary
+      
+          # Background memory update methods
+          def _background_memory_update(
+              self, messages_snapshot: list[dict], snapshot_index: int, current_tokens: int
+          ):
+              """Run session memory update in a background thread."""
+              try:
+                  with self._lock:
+                      current_session_memory = self.session_memory
+                      last_index = self.last_summarized_index
+      
+                  if current_session_memory is None:
+                      new_memory = self._create_session_memory(messages_snapshot)
+                  else:
+                      # Get new messages since last summary
+                      new_messages = messages_snapshot[last_index :]
+                      if not new_messages:
+                          return
+                      new_memory = self._update_session_memory(new_messages)
+      
+                  # Update state (thread-safe)
+                  with self._lock:
+                      self.session_memory = new_memory
+                      self.last_summarized_index = snapshot_index
+                      self.tokens_at_last_update = current_tokens
+                      self.last_update_time = time.time()
+      
+              except Exception as e:
+                  print(f"   [Background] Error updating memory: {e}")
+      
+          # This makes sure only one background update runs at a time. If one is already running, we skip starting another. If not, we start a new thread to do the update.
+          def _trigger_background_update(self):
+              """Trigger a background session memory update."""
+              if self._update_thread is not None and self._update_thread.is_alive():
+                  return
+      
+              messages_snapshot = self.messages.copy()
+              snapshot_index = len(messages_snapshot)
+              current_tokens = self.current_context_window_tokens
+      
+              self._update_thread = threading.Thread(
+                  target=self._background_memory_update,
+                  args=(messages_snapshot, snapshot_index, current_tokens),
+                  daemon=True,
+              )
+              self._update_thread.start()
+      
+          # Function to compact
+          def compact(self):
+              """INSTANT compaction using pre-built session memory."""
+              prev_msg_count = len(self.messages)
+      
+              # Ensure session memory is ready. Shouldn't be an issue normally, but here for safety.
+              if self.session_memory is None:
+                  if self._update_thread is not None and self._update_thread.is_alive():
+                      print("   ⏳ Waiting for background memory update...")
+                      self._update_thread.join(timeout=30.0)
+      
+                  if self.session_memory is None:
+                      print("   ⚠️  No pre-built memory, creating synchronously...")
+                      start = time.perf_counter()
+                      self.session_memory = self._create_session_memory(self.messages)
+                      elapsed = time.perf_counter() - start
+                      print(f"   ⏱️  Took {elapsed:.2f}s (but should be instant normally!)")
+                      self.last_summarized_index = len(self.messages)
+      
+              with self._lock:
+                  unsummarized = self.messages[self.last_summarized_index :]
+                  summary_message = [{
+                      "role": "user",
+                      "content": f"""This session is being continued from a previous conversation. Here is the session memory: {self.session_memory}.Continue from where we left off."""
+                  }]
+                  self.messages = summary_message + unsummarized
+                  self.last_summarized_index = 1
+      
+                  print(f"\n{'=' * 60}")
+                  print(f"⚡ INSTANT COMPACTION! Messages: {prev_msg_count} → {len(self.messages)}")
+                  print(f"   Session memory was pre-built (no wait time!)")
+                  print(f"{'=' * 60}")
+  markdown cell:
+    source:
+      ### Example use of Instant Compaction
+  code cell:
+    source:
+      # Low thresholds for demo - in production you'd use higher values
+      session = InstantCompactingChatSession(
+          system_message=SYSTEM_PROMPT,
+      )
+      
+      messages = [
+          "I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well throught out plot ideas for me to consider.",
+          "I don't like those ideas, can you think of one plot something more unique and unexpected?",
+          "Ok I like it. Can you help me develop the main character's backstory and motivations?",
+          "Can you draft a detailed outline for the story, breaking it down into chapters and key events?",
+          "Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.", 
+          "Can you draft a second chapter that builds on the first one?"  
+          "Can you revise that second chapter, make it more suspenseful and engaging?"                 
+      ]
+      print("Starting conversation with instant compacting chat session...\n")
+      
+      turn_count = 0
+      for i, message in enumerate(messages, 1):
+          response, usage, background_status = session.chat(message)
+          turn_count += 1
+          
+          # Calculate cache stats
+          cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
+          cache_created = getattr(usage, "cache_creation_input_tokens", 0) or 0
+          total_input = usage.input_tokens + cache_read
+          
+          print(f"{'='*60}")
+          print(f"Turn {turn_count}:")
+          print(f"\nUser: {message}")
+          print(f"\nAssistant: \n{truncate_response(response, max_lines=3)}")
+          print(f"\nToken Usage:")
+          print(f"  Input: {total_input:,} (new: {usage.input_tokens:,}, cached: {cache_read:,})")
+          print(f"  Output: {usage.output_tokens:,}")
+          print(f"  Messages: {len(session.messages)} | Memory: {'ready' if session.session_memory else 'not yet'}")
+          
+          if cache_read > 0:
+              cache_pct = (cache_read / total_input) * 100
+              print(f"  ✓ Cache hit! {cache_pct:.0f}% of input from cache")
+          
+          if background_status:
+              print(f"\n  [Background] Proactively {background_status} session memory...")
+              print(f"  Context window: {session.current_context_window_tokens:,} tokens")
+          
+          print()
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          Starting conversation with instant compacting chat session...
+          
+          ============================================================
+          Turn 1:
+          
+          User: I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well throught out plot ideas for me to consider.
+          
+          Assistant: 
+          # Three Detective Story Concepts
+          
+          ## 1. **The Vanishing Act**
+          ... (30 more lines)
+          
+          Token Usage:
+            Input: 318 (new: 318, cached: 0)
+            Output: 762
+            Messages: 2 | Memory: not yet
+          
+          ============================================================
+          Turn 2:
+          
+          User: I don't like those ideas, can you think of one plot something more unique and unexpected?
+          
+          Assistant: 
+          # **The Cartographer's Daughter**
+          
+          **Premise:** Nora Fields (23) returns to her dying hometown of Millbrook after her cartographer father's sudden death. The town is literally disappearing—not metaphorically, but *actually*. Buildings that existed last month are gone. Streets lead to nowhere. Residents have conflicting memories about what was where. The population has dropped from 2,000 to 300 in three years, but no one can agree on who left or why.
+          ... (20 more lines)
+          
+          Token Usage:
+            Input: 1,102 (new: 1,102, cached: 0)
+            Output: 636
+            Messages: 4 | Memory: not yet
+          
+          ============================================================
+          Turn 3:
+          
+          User: Ok I like it. Can you help me develop the main character's backstory and motivations?
+          
+          Assistant: 
+          # Nora Fields - Character Development
+          
+          ## Background
+          ... (67 more lines)
+          
+          Token Usage:
+            Input: 1,760 (new: 1,760, cached: 0)
+            Output: 1,214
+            Messages: 6 | Memory: not yet
+          
+          ============================================================
+          Turn 4:
+          
+          User: Can you draft a detailed outline for the story, breaking it down into chapters and key events?
+          
+          Assistant: 
+          # **The Cartographer's Daughter - Story Outline**
+          
+          ## ACT ONE: The Return (Chapters 1-5)
+          ... (175 more lines)
+          
+          Token Usage:
+            Input: 2,996 (new: 2,996, cached: 0)
+            Output: 3,500
+            Messages: 8 | Memory: not yet
+          
+          ============================================================
+          Turn 5:
+          
+          User: Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.
+          
+          Assistant: 
+          # Chapter 1: Homecoming
+          
+          The town looked smaller than Nora remembered, which was exactly what everyone said about their hometowns. What they didn't mention was the peculiar quality of that smallness—not just diminished by time and perspective, but somehow *compressed*, as if Millbrook had been folded in on itself while she was away.
+          ... (168 more lines)
+          
+          Token Usage:
+            Input: 6,532 (new: 6,532, cached: 0)
+            Output: 3,077
+            Messages: 10 | Memory: not yet
+          
+            [Background] Proactively initializing session memory...
+            Context window: 9,609 tokens
+          
+             [Background] Initial session memory created. Cache hit=True
+          ============================================================
+          Turn 6:
+          
+          User: Can you draft a second chapter that builds on the first one?Can you revise that second chapter, make it more suspenseful and engaging?
+          
+          Assistant: 
+          # Chapter 2: The Map That Shouldn't Exist
+          
+          The funeral was smaller than Nora expected, which was saying something—she'd expected it to be small.
+          ... (214 more lines)
+          
+          Token Usage:
+            Input: 9,642 (new: 5,546, cached: 4,096)
+            Output: 3,500
+            Messages: 12 | Memory: ready
+            ✓ Cache hit! 42% of input from cache
+          
+            [Background] Proactively updating session memory...
+            Context window: 13,142 tokens
+          
+      output 1:
+        output_type: stream
+        name: stdout
+        text:
+             [Background] Session memory updated.
+  code cell:
+    source:
+      message = "What did we just talk about? Give me one sentence"
+      response, usage, background_status = session.chat(message)
+      
+      # Calculate cache stats
+      cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
+      total_input = usage.input_tokens + cache_read
+      
+      print(f"\nUser: {message}")
+      print(f"\nAssistant: \n{truncate_response(response, max_lines=3)}")
+      print(f"\nToken Usage:")
+      print(f"  Input: {total_input:,} (new: {usage.input_tokens:,}, cached: {cache_read:,})")
+      print(f"  Output: {usage.output_tokens:,}")
+      print(f"  Messages: {len(session.messages)} | Memory: {'ready' if session.session_memory else 'not yet'}")
+      
+      if cache_read > 0:
+          cache_pct = (cache_read / total_input) * 100
+          print(f"  ✓ Cache hit! {cache_pct:.0f}% of input from cache")
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          
+          ============================================================
+          ⚡ INSTANT COMPACTION! Messages: 12 → 1
+             Session memory was pre-built (no wait time!)
+          ============================================================
+          
+          User: What did we just talk about? Give me one sentence
+          
+          Assistant: 
+          We had just finished drafting Chapter 2 (the funeral and evidence discovery), and you requested that I revise it to make it more suspenseful and engaging—which I hadn't completed yet before the conversation ended.
+          
+          Would you like me to provide that revised, more suspenseful version of Chapter 2 now?
+          
+          Token Usage:
+            Input: 2,276 (new: 2,276, cached: 0)
+            Output: 71
+            Messages: 3 | Memory: ready
+  code cell:
+    source:
+      for message in session.messages:
+          print(message)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          {'role': 'user', 'content': 'This session is being continued from a previous conversation. Here is the session memory: ## User Intent\nOriginal request: "Can you draft a second chapter that builds on the first one? Can you revise that second chapter, make it more suspenseful and engaging?"\n\nTwo-part request: (1) Draft Chapter 2, (2) Revise for suspense and engagement. Only part 1 completed before conversation ended.\n\n## Completed Work\n\n**Chapter 2 Drafted (~2,500 words):**\n\nScene structure:\n1. Father\'s funeral at First Methodist (30 people, Pastor Williams, "carefully neutral" eulogy)\n2. Mrs. Eleanor Kowalski approaches Nora after service\n3. Nora drives to Fletcher Street to investigate empty lot\n4. Returns to father\'s study, searches desk\n5. Discovers evidence folder and photographs\n6. Map writes itself with new message\n7. Cliffhanger: mysterious phone call\n\n**Key reveals in Chapter 2:**\n- Mrs. Kowalski subplot: Father showed her photographs of "Thomas Kowalski" (her son), but she never had children. Father insisted she was "forgetting." He told her Nora "had it too—the sight" and could "see what was really there."\n- Evidence folder labeled "Discrepancies—Evidence" contains:\n  - Photo of Mrs. Kowalski with 12-year-old boy (Thomas), notation: "Erased April 14, 2022"\n  - Photo of "Riverside Diner" on Main Street, notation: "Established 1965. Erased June 2023"\n  - Photo of "Willow Lane" street sign, notation: "Erased September 2023"\n  - Newspaper clipping: "40 Miners Killed in Millbrook Mine Collapse—1973"\n  - Father circled line: "Survivors reported hearing unusual sounds from the deep shaft before the collapse"\n  - Father\'s note below: "The beginning."\n\n- **47 total erasures documented:**\n  - 23 buildings\n  - 8 streets\n  - 16 people\n  - Dates: 1974-2024\n  - Pattern accelerating: earliest entries years apart, most recent weeks apart\n\n- **David Chen connection:** 1973 newspaper lists mine collapse survivors including "David Chen, age 24" (Marcus\'s father)\n\n- **Supernatural escalation:** Map writes itself AFTER father\'s death:\n  - New marks appear in fresh ink (still wet)\n  - Father\'s handwriting: "Check Fletcher Street. Tuesday\'s version. The truth is underneath. —Dad"\n  - Date: today\'s date\n  - Previously invisible pencil mark near Nora\'s house: "Nora—when you\'re ready."\n\n- **Fletcher Street scene:** Nora experiences brief "shimmer" (heat haze effect), sees "ghost of walls, translucent and wrong" for fraction of second—first manifestation of "the sight"\n\n- **Cliffhanger ending:** Unknown Number calls. Beneath static: rhythmic sound "like machinery. Or digging. Or something moving in a deep, enclosed space." Chapter ends mid-dialogue: "Who is this?" Nora—" [cut off]\n\n**Chapter 2 narrative elements:**\n- Guilt motif: "She should cry. Daughters cried at their fathers\' funerals. But her eyes stayed dry, and she wondered if that made her a monster."\n- Found half-written letter from father (3 months ago): "Dear Nora, I know you think I\'m losing my mind. Maybe I am. But I need you to understand—" [unfinished]\n- Nora documents everything systematically (spreadsheet of 47 items)\n- Texts from Marcus and Simone (outside world pulling at her)\n- Desk lamp flickers twice (supernatural signal)\n\n## Errors & Corrections\nNone. User did not provide corrections or redirections during Chapter 2 draft.\n\n## Active Work\n**INCOMPLETE:** User requested revision of Chapter 2 to "make it more suspenseful and engaging."\n\nDraft provided but revision not started. User\'s second request in same message was not fulfilled before conversation ended.\n\nChapter 2 draft ends with phone call cliffhanger. Last line: "Who is this?" Nora—" [deliberately cut off mid-word for suspense]\n\n## Pending Tasks\n\n**Explicitly Requested - PRIORITY:**\n- Revise Chapter 2 for increased suspense and engagement (user\'s second request, unfulfilled)\n\n**Previously Requested:**\n- Draft remaining chapters 3-17 based on outline\n\n**Likely Next Steps:**\n- User may want to approve revised Chapter 2 before proceeding\n- May request specific chapters drafted\n- May request further revisions to outline or Chapter 1\n- May request character development for supporting cast\n- May request world-building details\n\n## Key References\n\n**Story Concept: "The Cartographer\'s Daughter"**\n- Protagonist: Nora Fields, 23, urban planning/historic preservation graduate\n- Setting: Millbrook - town literally being erased from existence\n- Core mystery: 1973 mine collapse, 40 miners as "anchors," town made deal with entity, erasures accelerating\n- Map writes itself in dead father\'s handwriting, shows "true" version of town\n\n**Characters:**\n- Henry Fields (cartographer, mapkeeper, recently died)\n- Sarah Fields (mother, erased when Nora was 14)\n- Deputy Marcus Chen, 28 (father David was mine survivor)\n- Eleanor Kowalski (elderly neighbor, son Thomas erased)\n- Walter Bishop, 92 (oldest resident, knows truth)\n- Five protected families: Dawes, Chen, Porter, Blackwood, Marsh\n\n**17-Chapter Outline:**\n- Act One (Ch 1-5): Return, discovery, initial investigation\n- Act Two (Ch 6-12): Mapkeeper legacy, pattern recognition, Nora begins being erased\n- Act Three (Ch 13-17): Mine confrontation, final choice\n\n**Chapter 1 Summary (2,000 words):**\nNora returns to Millbrook after father\'s death, discovers study covered in maps, finds master map showing different Millbrook (post office on Fletcher vs Randolph), receives father\'s effects from Marcus Chen, discovers wedding photo proving map shows truth.\n\n**Core Mechanics:**\n- Erasures spiral from sealed mine (center point)\n- Started 1973, accelerating (years → months → weeks)\n- Five families protected from erasure\n- Mapkeeper family (Fields) remembers because grandfather refused deal\n- Map shows "true" Millbrook, writes itself with updates\n- "The sight": ability to see through erasures (double vision)\n\n**Critical Locations:**\n- 47 Maple Street (father\'s house)\n- 47 Fletcher Street (post office location - now empty lot)\n- Millbrook mine (sealed 1973)\n- Storage unit Highway 9 (father\'s records)\n- Deep shaft (entity location, 40 miners in circle)\n\n**Timeline:**\n- 1973: Mine collapse, deal made\n- ~9 years ago: Mother erased (Nora age 14)\n- 5 years ago: Nora left for Boston\n- Present: October, father died, Chapter 1 = funeral next day, Chapter 2 = funeral day\n\n**Evidence/Artifacts:**\n- Self-writing map in father\'s study\n- Evidence folder: 47 documented erasures (23 buildings, 8 streets, 16 people)\n- Photographs of erased people/places\n- Mine collapse newspaper (40 dead, survivors listed)\n- Wedding photo (Fletcher Street proof)\n- Father\'s video recordings (insurance)\n- Storage unit key from Walter Bishop\n- Half-written letter to Nora (3 months old, unfinished)\n\n**New Chapter 2 Details:**\n- "The sight" = father\'s term for ability to see true reality\n- Father told Mrs. Kowalski that Nora "had it too"\n- Thomas Kowalski: boy ~12, erased April 14, 2022\n- Riverside Diner: Main Street, established 1965, erased June 2023\n- Willow Lane: connecting street, erased September 2023\n- Map notation: "Check Fletcher Street. Tuesday\'s version. The truth is underneath."\n- Hidden pencil mark: "Nora—when you\'re ready" (near her house on map)\n- Mysterious phone call sounds: rhythmic, like machinery/digging/something moving in deep enclosed space.Continue from where we left off.'}
+          {'role': 'user', 'content': 'What did we just talk about? Give me one sentence'}
+          {'role': 'assistant', 'content': "We had just finished drafting Chapter 2 (the funeral and evidence discovery), and you requested that I revise it to make it more suspenseful and engaging—which I hadn't completed yet before the conversation ended.\n\nWould you like me to provide that revised, more suspenseful version of Chapter 2 now?"}
+  markdown cell:
+    source:
+      ## Advanced: Understanding Prompt Caching
+  markdown cell:
+    source:
+      
+      The background updates can be made **~10x cheaper** by using prompt caching. The trick:
+      1. Pass the **full conversation** to the background summarizer
+      2. Add `cache_control` markers so subsequent requests hit the cache
+      3. Only the new "summarize this" instruction is billed at full price
+      
+      ```
+      ┌─────────────────────────────────────────────────────────────────────────────────┐
+      │                    PROMPT CACHING FOR LONG CONVERSATIONS                        │
+      ├─────────────────────────────────────────────────────────────────────────────────┤
+      │                                                                                 │
+      │  WITHOUT CACHING: Pay full price for entire context every turn                 │
+      │  ════════════════════════════════════════════════════════════                   │
+      │                                                                                 │
+      │  Turn 1:  [System][User1][Asst1]                         →  500 tokens  @ $3/M │
+      │  Turn 2:  [System][User1][Asst1][User2][Asst2]           → 1500 tokens  @ $3/M │
+      │  Turn 3:  [System][User1][Asst1][User2][Asst2][User3]... → 3000 tokens  @ $3/M │
+      │  Turn 4:  [System][User1][Asst1][User2][Asst2][User3]... → 5000 tokens  @ $3/M │
+      │           ─────────────────────────────────────────────                         │
+      │                                              Total: 10,000 tokens = $0.030      │
+      │                                                                                 │
+      │                                                                                 │
+      │  WITH CACHING: Pay full price once, then 90% discount on prefix                │
+      │  ═══════════════════════════════════════════════════════════════                │
+      │                                                                                 │
+      │  Turn 1:  [System][User1][Asst1]◆                        →  500 tokens  @ $3/M │
+      │                                ▲                            (cache created)    │
+      │                          cache breakpoint                                       │
+      │                                                                                 │
+      │  Turn 2:  [System][User1][Asst1][User2][Asst2]◆                                │
+      │           ╰─────── cached ──────╯                                              │
+      │                500 @ $0.30/M + 1000 new @ $3/M  =  $0.0032                     │
+      │                                                                                 │
+      │  Turn 3:  [System][User1][Asst1][User2][Asst2][User3][Asst3]◆                  │
+      │           ╰──────────── cached ─────────────╯                                  │
+      │               1500 @ $0.30/M + 1500 new @ $3/M  =  $0.0050                     │
+      │                                                                                 │
+      │  Turn 4:  [System][User1][Asst1][User2][Asst2][User3][Asst3][User4][Asst4]◆    │
+      │           ╰───────────────────── cached ─────────────────────╯                 │
+      │                     3000 @ $0.30/M + 2000 new @ $3/M  =  $0.0069               │
+      │           ─────────────────────────────────────────────                         │
+      │                                              Total: $0.0166  (45% savings)     │
+      │                                                                                 │
+      ├─────────────────────────────────────────────────────────────────────────────────┤
+      │                                                                                 │
+      │  COMPACTION + CACHING: Double benefit                                           │
+      │  ════════════════════════════════════                                           │
+      │                                                                                 │
+      │    Main Chat                      Background Summarizer                         │
+      │    ─────────                      ─────────────────────                         │
+      │                                                                                 │
+      │  [Conversation grows...]          [Same conversation prefix]◆ + [Summarize!]   │
+      │         │                                    │                                  │
+      │         │                         Cache hit! Only pays for                      │
+      │         │                         the summarization prompt                      │
+      │         │                                    │                                  │
+      │         ▼                                    ▼                                  │
+      │  Context limit reached  ──────►  Session memory ready instantly                │
+      │                                  (built cheaply in background)                  │
+      │                                                                                 │
+      │  ┌──────────────────────────────────────────────────────────────────────────┐  │
+      │  │  Key insight: The background summarizer reuses the same conversation     │  │
+      │  │  prefix that was just sent to the main chat - automatic cache hit!       │  │
+      │  └──────────────────────────────────────────────────────────────────────────┘  │
+      │                                                                                 │
+      └─────────────────────────────────────────────────────────────────────────────────┘
+      
+      ◆ = cache_control breakpoint (cache everything before this point)
+      ```
+      
+      ### Why this matters for compaction
+      
+      | Scenario | Cost per background update | Notes |
+      |----------|---------------------------|-------|
+      | No caching | Full input cost | 5,000 tokens × $3/M = $0.015 |
+      | With caching | ~10% of input cost | 500 new + 4,500 cached = $0.003 |
+      | **Savings** | **~80%** | Compounds over many updates |
+      
+      The longer the conversation, the bigger the savings—exactly when you need compaction most!
+  markdown cell:
+    source:
+      ### How the Caching Works
+      
+      The key is in `_add_cache_control()` and `_create_session_memory_cached()`:
+      
+      ```python
+      # 1. Mark the last conversation message with cache_control
+      {
+          "role": "user",
+          "content": [{
+              "type": "text",
+              "text": msg["content"],
+              "cache_control": {"type": "ephemeral"}  # <-- This creates a cache breakpoint
+          }]
+      }
+      
+      # 2. Also mark the system prompt
+      system=[{
+          "type": "text",
+          "text": "You are a session memory agent...",
+          "cache_control": {"type": "ephemeral"}
+      }]
+      ```
+      
+      **Why this works:**
+      - The first background update creates a cache entry for `[System + Messages]`
+      - Subsequent updates with the same message prefix get **cache hits**
+      - Only the new summarization instruction is billed at full price
+      - Cache entries have a 5-minute TTL, so rapid updates benefit most
+      
+      **Cost math:**
+      - Without caching: 5,000 tokens × $3.00/1M = $0.015 per update
+      - With caching: 500 new tokens × $3.00/1M + 4,500 cached × $0.30/1M = $0.00285
+      - **Savings: ~80%** on background summarization costs

Generated by nbdime

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review

Recommendation: REQUEST_CHANGES

Summary

This PR adds a comprehensive notebook demonstrating session memory compaction techniques for managing long conversations with Claude. The implementation is technically sophisticated with excellent pedagogical structure, but there are several critical issues that must be addressed before merging.

Actionable Feedback (9 items)

Critical Issues:

  • .env.example deletion - The PR deletes .env.example which is referenced in CLAUDE.md:15 (cp .env.example .env). Either restore this file or update CLAUDE.md to remove references to it.

  • misc/session_memory_compaction.ipynb (cell 3) - Add API key setup following project standards: from dotenv import load_dotenv and load_dotenv()

  • misc/session_memory_compaction.ipynb:42 - Use full model ID: MODEL = "claude-sonnet-4-5-20250929" instead of "claude-sonnet-4-5"

  • misc/session_memory_compaction.ipynb - Add Prerequisites section before setup explaining required knowledge, Python version (>= 3.11), and dependencies

  • misc/session_memory_compaction.ipynb - Add pip install cell at the beginning with %%capture

  • misc/session_memory_compaction.ipynb (cell 1) - Rewrite introduction to follow problem-focused pattern (explain the problem first, then learning objectives)

  • misc/session_memory_compaction.ipynb (in cell with remove_thinking_blocks) - Add return type annotation: -> tuple[str, str]

  • misc/session_memory_compaction.ipynb:7 - Fix typo: Utlizing to Utilizing

  • misc/session_memory_compaction.ipynb - Add Conclusion section at the end that summarizes key learnings

Detailed Review

Code Quality

Strengths:

  • The threading implementation in InstantCompactingChatSession is excellent: proper use of threading.Lock, daemon threads for cleanup, and sensible timeout handling
  • The SESSION_MEMORY_PROMPT is production-quality with structured sections, explicit preservation rules, and chain-of-thought instructions
  • Helper functions have detailed docstrings explaining implementation choices
  • Progressive examples (Traditional to Instant to Caching) effectively demonstrate solution evolution

Issues:

  • Missing type annotation on remove_thinking_blocks() return value
  • Minor formatting: class definitions could use blank lines after docstrings for readability

Security

  • No API key hardcoding detected
  • Missing load_dotenv() call in setup (critical)
  • No sensitive data exposure concerns

Documentation

Strengths:

  • ASCII diagrams clearly illustrate the difference between traditional and instant compaction approaches
  • Inline comments explain complex logic well
  • The story-writing demo provides a realistic, engaging example

Issues:

  • Introduction lists features instead of explaining the problem and learning objectives
  • Missing Prerequisites section (required by style guide)
  • Missing Conclusion section to tie back to learning objectives
  • misc/session_memory.md contains only placeholder text

Positive Notes

  • Outstanding pedagogical structure: The progressive examples with timing comparisons effectively show why each optimization matters
  • Production-ready patterns: The threading implementation and session memory prompt are both production-quality
  • Excellent technical depth: Covers prompt design, instant compaction, and cost optimization comprehensively
  • Clear visual aids: ASCII diagrams make complex concepts immediately understandable
  • Realistic demonstration: The story-writing conversation is engaging and shows real-world usage

- Restore accidentally deleted .env.example
- Add registry.yaml entry for the notebook
- Add proper introduction with learning objectives
- Add prerequisites and setup section with pip install and load_dotenv()
- Fix typo: Utlizing → Utilizing
- Add conclusion section mapping to learning objectives
- Remove placeholder misc/session_memory.md
- Update MODEL to full model ID (claude-sonnet-4-5-20250929)
- Add cross-reference to automatic-context-compaction.ipynb
- Update notebook kernel from coconut to python
- Fix timing placeholder with actual value (36 seconds)
- Add jsham042 to authors.yaml

:house: Remote-Dev: homespace

Claude-Generated-By: Claude Code (cli/claude-opus-4-5=100%)
Claude-Steers: 2
Claude-Permission-Prompts: 17
Claude-Escapes: 0
🏠 Remote-Dev: homespace
@github-actions

Copy link
Copy Markdown

Notebook Changes

This PR modifies the following notebooks:

📓 misc/session_memory_compaction.ipynb

View diff
nbdiff /dev/null misc/session_memory_compaction.ipynb (bc367bdd2f930ce5c058581a24fb3dbe008a5059)
--- /dev/null  2026-01-16 19:58:02.262867
+++ misc/session_memory_compaction.ipynb (bc367bdd2f930ce5c058581a24fb3dbe008a5059)  (no timestamp)
## added /cells:
+  markdown cell:
+    source:
+      # Session Memory Compaction
+      
+      Long-running conversations with Claude can exceed context limits, causing loss of important information. Whether you're building a coding assistant, creative writing tool, or customer service agent, managing session memory is critical for maintaining continuity and quality.
+      
+      This cookbook teaches you how to **proactively manage session memory** to avoid jarring context limit interruptions. Unlike reactive approaches that wait until the context is full, you'll learn to build session memory in the background so compaction is instant when needed.
+      
+      **Related:** For automatic SDK-based compaction in agentic workflows, see [Automatic Context Compaction](../tool_use/automatic-context-compaction.ipynb). This cookbook focuses on manual control patterns for conversational applications.
+      
+      ## Learning Objectives
+      
+      By the end of this cookbook, you will be able to:
+      
+      - Write effective session memory prompts that preserve critical context across compaction events
+      - Implement **instant compaction** using background threading to eliminate user wait time
+      - Apply prompt caching to reduce the cost of background memory updates by ~80%
+      - Choose appropriate compaction strategies (traditional vs. instant) based on your use case
+  markdown cell:
+    source:
+      ## What You'll Learn
+      
+      This cookbook covers three main topics:
+      1. Writing a quality prompt to compact session chat history
+      2. Utilizing instant compacting to improve user chat experience
+      3. Managing costs and latency for long context conversations using prompt caching
+  markdown cell:
+    source:
+      ## Prerequisites and Setup
+      
+      Before following this guide, ensure you have:
+      
+      **Required Knowledge**
+      - Basic understanding of Claude API usage and message formatting
+      - Familiarity with Python threading concepts (helpful but not required)
+      
+      **Required Tools**
+      - Python 3.10 or higher
+      - Anthropic API key
+      - Anthropic SDK
+      
+      ### Installation
+      
+      First, install the required dependencies:
+  code cell:
+    source:
+      # %pip install -qU anthropic python-dotenv
+  markdown cell:
+    source:
+      ### Configure the Client
+      
+      Load your environment variables and configure the Anthropic client. Ensure your `.env` file contains:
+      
+      ```
+      ANTHROPIC_API_KEY=your_key_here
+      ```
+  code cell:
+    source:
+      import anthropic
+      from anthropic.types import MessageParam, TextBlockParam
+      from dotenv import load_dotenv
+      
+      load_dotenv()
+      
+      client = anthropic.Anthropic(api_key="your_api_key_here")
+      MODEL = "claude-sonnet-4-5-20250929"
+  markdown cell:
+    source:
+      #### Helper functions for the cookbook
+  code cell:
+    source:
+      def truncate_response(text: str, max_lines: int = 15) -> str:
+          """Truncate long responses for cleaner output display."""
+          lines = text.strip().split("\n")
+          if len(lines) <= max_lines:
+              return text
+          return "\n".join(lines[:max_lines]) + f"\n... ({len(lines) - max_lines} more lines)"
+      
+      
+      def remove_thinking_blocks(text: str):
+          """Remove <think>...</think> blocks from the text."""
+          import re
+      
+          matches = re.findall(r"<think>.*?</think>", text, flags=re.DOTALL)
+          cleaned = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()
+          return cleaned, "".join(matches)
+      
+      
+      def add_cache_control(messages: list[dict]) -> list[MessageParam]:
+          """Add cache_control to the last user message for prompt caching.
+      
+          For prompt caching to work, the message prefix structure must be identical between requests.
+          All messages are converted to list format for consistency, and cache_control is placed on
+          the last user message to match the standard API call pattern.
+          """
+          cached_messages: list[MessageParam] = []
+          last_user_idx = None
+      
+          # Find last user message index
+          for i, msg in enumerate(messages):
+              if msg["role"] == "user":
+                  last_user_idx = i
+      
+          for i, msg in enumerate(messages):
+              content = msg["content"]
+              text = content if isinstance(content, str) else content[0]["text"]
+      
+              content_block: TextBlockParam = {"type": "text", "text": text}
+              if i == last_user_idx:
+                  content_block["cache_control"] = {"type": "ephemeral"}
+      
+              cached_messages.append({"role": msg["role"], "content": [content_block]})
+      
+          return cached_messages
+      
+      
+      def estimate_tokens(text: str) -> int:
+          """Rudimentary token estimation: 1 token per 4 characters."""
+          return len(text) // 4
+  markdown cell:
+    source:
+      ## 1. Writing a compaction prompt
+  markdown cell:
+    source:
+      Make sure you have a well structured session memory prompt. 
+      
+      Some best practices include:
+      - Use chain-of-thought before summarizing — analyze first, then output                                                                                         
+      - Enumerate exactly what to preserve: file paths, code snippets, errors, user corrections                                                                      
+      - Weight recency heavily — the end of the conversation is the active context                                                                                   
+      - Require verbatim quotes for next steps to prevent task drift                                                                                                 
+      - Use structured sections with token budgets per section                                                                                                       
+      - Include a "Current State" section that always reflects the moment of compaction
+      
+      Some pitfalls include:
+      - Vague prompts like "summarize this conversation" produce lossy output                                                                                        
+      - Treating all messages equally loses the active working context                                                                                               
+      - Paraphrasing next steps introduces subtle drift that compounds                                                                                               
+      - Omitting error history causes the model to retry failed approaches                                                                                           
+      - Dropping user corrections makes the model revert to old behaviors                                                                                            
+      - No token limits lets one section consume the entire summary                                                                                                  
+      - Summarizing for human readability instead of model continuity
+      - Having the agent try to compress the results of tool calls here - this can be retrieved later if the agent needs it
+  code cell:
+    source:
+      SESSION_MEMORY_PROMPT = """
+      Compress the conversation into a structured summary
+      that preserves all information needed to continue work seamlessly. Optimize for the assistant's
+      ability to continue working, not human readability.
+      
+      <analysis-instructions>
+      Before generating your summary, analyze the transcript in <think>...</think> tags:
+      1. What did the user originally request? (Exact phrasing)
+      2. What actions succeeded? What failed and why?
+      3. Did the user correct or redirect the assistant at any point?
+      4. What was actively being worked on at the end?
+      5. What tasks remain incomplete or pending?
+      6. What specific details (IDs, paths, values, names) must survive compression?
+      </analysis-instructions>
+      
+      <summary-format>
+      ## User Intent
+      The user's original request and any refinements. Use direct quotes for key requirements.
+      If the user's goal evolved during the conversation, capture that progression.
+      
+      ## Completed Work
+      Actions successfully performed. Be specific:
+      - What was created, modified, or deleted
+      - Exact identifiers (file paths, record IDs, URLs, names)
+      - Specific values, configurations, or settings applied
+      
+      ## Errors & Corrections
+      - Problems encountered and how they were resolved
+      - Approaches that failed (so they aren't retried)
+      - User corrections: "don't do X", "actually I meant Y", "that's wrong because..."
+      Capture corrections verbatim—these represent learned preferences.
+      
+      ## Active Work
+      What was in progress when the session ended. Include:
+      - The specific task being performed
+      - Direct quotes showing exactly where work left off
+      - Any partial results or intermediate state
+      
+      ## Pending Tasks
+      Remaining items the user requested that haven't been started.
+      Distinguish between "explicitly requested" and "implied/assumed."
+      
+      ## Key References
+      Important details needed to continue:
+      - Identifiers: IDs, paths, URLs, names, keys
+      - Values: numbers, dates, configurations, credentials (redacted)
+      - Context: relevant background information, constraints, preferences
+      - Citations: sources referenced during the conversation
+      </summary-format>
+      
+      <preserve-rules>
+      Always preserve when present:
+      - Exact identifiers (IDs, paths, URLs, keys, names)
+      - Error messages verbatim
+      - User corrections and negative feedback
+      - Specific values, formulas, or configurations
+      - Technical constraints or requirements discovered
+      - The precise state of any in-progress work
+      </preserve-rules>
+      
+      <compression-rules>
+      - Weight recent messages more heavily—the end of the transcript is the active context
+      - Omit pleasantries, acknowledgments, and filler ("Sure!", "Great question")
+      - Omit system context that will be re-injected separately
+      - Keep each section under 500 words; condense older content to make room for recent
+      - If you must cut details, preserve: user corrections > errors > active work > completed work
+      </compression-rules>
+      """
+  markdown cell:
+    source:
+      ### Code example using traditional compacting
+      In traditional compaction, you generate one summary once the token threshold is reached.
+      Traditional compaction is slow: when you hit the context limit, you wait for a summary.
+  markdown cell:
+    source:
+      
+      ```
+      TRADITIONAL COMPACTION (slow)
+      ─────────────────────────────
+      Turn 1 → Turn 2 → Turn 3 → ... → Turn N → CONTEXT FULL!
+
+
+                                          ┌─────────────────┐
+                                          │ Generate summary│
+                                          │ ( USER WAITS !) │
+                                          └─────────────────┘
+
+
+                                               Continue
+      
+      ```
+  code cell:
+    source:
+      import time
+      
+      
+      class TraditionalCompactingChatSession:
+          """Traditional chat session with compaction after the fact."""
+      
+          def __init__(self, system_message="You are a helpful assistant", context_limit: int = 10000):
+              self.system_message = system_message
+              self.context_limit = context_limit  # the point at which the conversation is compacted so it does not exceed model limits.
+              self.messages = []
+              self.current_context_window_tokens = 0
+              self.summary = None
+      
+          def chat(self, user_message: str):
+              # In traditional compaction, we check if we need to compact when the user sends a message. NOT IDEAL!
+              if self.current_context_window_tokens >= self.context_limit:
+                  print(
+                      f"\n🧹 Context window at {self.current_context_window_tokens} tokens. Limit exceeded, compacting session memory..."
+                  )
+                  self.compact()  # compacts everything before the new user message
+      
+              self.messages.append({"role": "user", "content": user_message})
+              print(f"\nUser: {user_message}")
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=3500,
+                  system=self.system_message,
+                  messages=add_cache_control(self.messages),
+              )
+              assistant_message = response.content[0].text
+              self.messages.append({"role": "assistant", "content": assistant_message})
+      
+              print(f"\nAssistant: \n{truncate_response(assistant_message, max_lines=15)}")
+      
+              # approximate current token count in the conversation before the next user message
+              cache_read = getattr(response.usage, "cache_read_input_tokens", 0) or 0
+              total_input = response.usage.input_tokens + cache_read
+              self.current_context_window_tokens = total_input + response.usage.output_tokens
+      
+              print(
+                  f"Input={total_input:,}, Prompt cached used= {cache_read > 0} | "
+                  f"Output={response.usage.output_tokens:,} | "
+                  f"Messages={len(self.messages)}"
+              )
+              return assistant_message, response.usage
+      
+          def compact(self):
+              start_time = time.perf_counter()
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=5000,
+                  system=self.system_message,  # Same as main chat for cache sharing
+                  messages=add_cache_control(self.messages)
+                  + [{"role": "user", "content": SESSION_MEMORY_PROMPT}],
+              )
+              elapsed = time.perf_counter() - start_time
+      
+              # Generate new summary message
+              self.summary, removed_text = remove_thinking_blocks(
+                  response.content[0].text
+              )  # clean up any <think> blocks because they are not needed in the session memory
+              approximate_summary_tokens = response.usage.output_tokens - round(
+                  len(removed_text) / 4
+              )  # rough estimate of tokens removed from summary
+      
+              # Replace prior messages with new summary message
+              self.messages = [
+                  {
+                      "role": "user",
+                      "content": f"""This session is being continued from a previous conversation. Here is the session memory: {self.summary}.Continue from where we left off.""",
+                  }
+              ]
+      
+              # Show token reduction if we just compacted
+              reduction = self.current_context_window_tokens - approximate_summary_tokens
+              pct = (reduction / self.current_context_window_tokens) * 100
+      
+              print(f"\n{'-' * 60}")
+              print("📝 New session memory created.")
+              print(
+                  f"✅ Tokens reduced: {self.current_context_window_tokens:,} → {approximate_summary_tokens:.0f} ({reduction:,} tokens saved, {pct:.0f}% reduction)"
+              )
+              print(f"⏱️ Compaction time: {elapsed:.2f}s (user waiting...)")
+              print(f" Cache used: {getattr(response.usage, 'cache_read_input_tokens', 0) > 0}")
+              print(f"{'-' * 60}")
+      
+              # Update token count to reflect compacted state
+              self.current_context_window_tokens = approximate_summary_tokens
+  markdown cell:
+    source:
+      Below we simulate a conversation between an author and an LLM that helps write stories.
+  code cell:
+    source:
+      SYSTEM_PROMPT = """
+      You are a short story writer who helps authors develop their ideas into compelling narratives.
+      
+      ## What You Do
+      
+      **Plot Development**
+      - Help authors work through story structure, pacing, and narrative arc
+      - Identify plot holes, inconsistencies, or missed opportunities
+      - Suggest ways to raise stakes, add tension, or deepen conflict
+      - Brainstorm twists, resolutions, and scene transitions
+      
+      **Character Development**
+      - Develop backstories, motivations, and internal conflicts
+      - Ensure characters have distinct voices and consistent behavior
+      - Explore character relationships and how they drive the plot
+      - Help authors understand what their characters want vs. what they need
+      
+      **Drafting**
+      - Write short stories or scenes based on the author's ideas and direction
+      - Match tone, genre conventions, and stylistic preferences
+      - Show rather than tell when bringing scenes to life
+      - Craft dialogue that reveals character and advances plot
+      
+      ## How You Work
+      - You are the lead writer. When you disagree with a creative choice, say so respectfully, but ultimately defer to what the author wants.
+      - DO NOT ask the user to provide more context or clarify their request. Assume you have enough information to proceed.
+      """
+  code cell:
+    source:
+      session = TraditionalCompactingChatSession(system_message=SYSTEM_PROMPT)
+      
+      # Simulated conversation
+      messages = [
+          "I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well throught out plot ideas for me to consider.",
+          "I don't like those ideas, can you think of one plot something more unique and unexpected?",
+          "Ok I like it. Can you help me develop the main character's backstory and motivations?",
+          "Can you draft a detailed outline for the story, breaking it down into chapters and key events?",
+          "Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.",
+          "Can you draft a second chapter that builds on the first one, introducing a new twist in the mystery?",
+      ]
+      
+      print("Starting conversation...\n")
+      
+      turn_count = 0
+      
+      for _i, message in enumerate(messages, 1):
+          turn_count += 1
+          print(f"==============================================\nTurn {turn_count}:\n")
+          response, usage = session.chat(message)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          Starting conversation...
+          
+          ==============================================
+          Turn 1:
+          
+          
+          User: I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well throught out plot ideas for me to consider.
+          
+          Assistant: 
+          # Three Mystery Plot Ideas
+          
+          ## 1. **The Drowning Season**
+          
+          **Setup:** Twenty-three-year-old Detective Maya Reeves returns to her hometown of Millbrook after five years away, assigned to investigate what locals insist is just another tragic drowning at the quarry—the third one this summer. The sheriff's department is understaffed, and Maya's the only one taking it seriously.
+          
+          **The Mystery:** Each victim was found at dawn, fully clothed, with no water in their lungs. They all have the same strange detail: fresh soil under their fingernails, though the quarry is surrounded by rock. Maya discovers all three victims had recently started remembering "lost time" from their childhood—gaps of hours or days they couldn't explain. The town has a secret: twenty years ago, a local therapist ran a controversial "memory recovery" program for troubled kids. Maya was one of those kids, and she's starting to remember why she really left town.
+          
+          **What Makes It Work:** The murders are connected to buried trauma, both literal and psychological. The killer believes they're saving the victims from remembering something terrible. Maya must solve the case while confronting her own suppressed memories, never sure if what she's remembering is real or if she might be the next target.
+          
+          ---
+          
+          ## 2. **The Lighthouse Keeper's Daughter**
+          
+          **Setup:** Rookie detective James Park is sent to the coastal town of Beacon's Rest to investigate the disappearance of fifteen-year-old Clare Whitmore, daughter of the reclusive lighthouse keeper. She vanished during the annual Founder's Day festival three days ago. The town is cooperative but oddly unsurprised—seven girls have disappeared from Beacon's Rest over the past forty years, always during Founder's Day, always at age fifteen, and always without a trace.
+          ... (18 more lines)
+          Input=318, Prompt cached used= False | Output=988 | Messages=2
+          ==============================================
+          Turn 2:
+          
+          
+          User: I don't like those ideas, can you think of one plot something more unique and unexpected?
+          
+          Assistant: 
+          # The Cartographer's Cipher
+          
+          **Setup:** Detective Nora Alike, 24, takes her first solo case in the dying mining town of Shepherdstown (population 847 and falling). An elderly cartographer named Eugene Fisk was found dead in his cluttered workshop, officially ruled a heart attack. But his daughter insists something's wrong: her father spent his final month frantically updating a massive hand-drawn map of the town—adding buildings that don't exist, removing ones that do, and marking dozens of red X's in seemingly random locations.
+          
+          **The Twist:** Nora starts investigating the X's out of curiosity. The first one leads to an old park bench—underneath it, she finds a coffee can containing $2,300 in cash and a note: "For the Hendersons' mortgage, June 1987." The second X marks a spot behind the defunct movie theater where she uncovers an envelope with photographs proving the former mayor's affair (the scandal that ended his career). The third X reveals a time capsule with a child's confession to starting the fire that burned down the elementary school forty years ago.
+          
+          **The Real Mystery:** Eugene wasn't murdered—but he wasn't mapping the town as it is. He was mapping every secret buried within it, physical and metaphorical. For sixty years, he'd been Shepherdstown's unofficial confessor. People would visit his workshop and tell him things they couldn't tell anyone else, and he'd mark them on his map in an elaborate code only he understood. In his final month, knowing he was dying, he decoded everything—creating a map that, if made public, would destroy half the town.
+          
+          **The Investigation:** Nora realizes someone is also following the map. Items are being dug up, locations disturbed. She races to decipher Eugene's system before this unknown person does. But the deeper she digs, the more she uncovers: covered-up deaths, stolen inheritances, a hit-and-run from 1992, evidence of who really embezzled the union funds that closed the mine. Every secret connects to another. The town isn't dying by accident—it's been poisoned from within by decades of buried truth.
+          
+          **The Impossible Choice:** The person following the map is Eugene's daughter, who wants to expose everything and burn the town's lies to the ground. She blames these secrets—and the people who keep them—for turning Shepherdstown into a hollowed-out shell. Nora must decide: help her complete her father's final work and reveal every terrible truth, or destroy the map and let sleeping sins lie. The case becomes about whether a detective's job is to uncover the truth or to protect the living from what the truth will do to them.
+          
+          **What Makes It Unique:** 
+          - No traditional villain—just a town full of people who made choices
+          - The "victim" weaponized his own death
+          ... (5 more lines)
+          Input=1,328, Prompt cached used= False | Output=688 | Messages=4
+          ==============================================
+          Turn 3:
+          
+          
+          User: Ok I like it. Can you help me develop the main character's backstory and motivations?
+          
+          Assistant: 
+          # Nora Alike: Character Deep Dive
+          
+          ## Background
+          
+          **Age:** 24, but people consistently guess younger—which she hates. She's been fighting to be taken seriously her entire life.
+          
+          **Origin:** Grew up in a suburb of a mid-sized city, the youngest of four siblings by seven years (she was the "accident baby"). Her parents were older, tired, and emotionally checked out by the time she came along. Her siblings had already left home, so she essentially raised herself in a house that felt like a museum to other people's childhoods.
+          
+          **The Formative Incident:** When Nora was sixteen, her oldest brother Martin disappeared. Just stopped coming to family dinners, didn't return calls. Her parents were worried but passive—"He's an adult, he'll reach out when he's ready." After three weeks of everyone just *waiting*, Nora took the bus to his apartment herself. She found him in the middle of a breakdown, his apartment filthy, convinced he'd ruined his life after losing his job. Her parents had known something was wrong but didn't want to pry, didn't want to intrude. Their politeness, their respect for privacy, almost killed him.
+          
+          That's when Nora learned: sometimes the most destructive thing you can do is mind your own business.
+          
+          ## Why She Became a Detective
+          
+          **Surface Reason:** She tells people she's interested in justice, in puzzles, in helping people. Standard cop interview answers.
+          ... (44 more lines)
+          Input=2,039, Prompt cached used= False | Output=1,350 | Messages=6
+          ==============================================
+          Turn 4:
+          
+          
+          User: Can you draft a detailed outline for the story, breaking it down into chapters and key events?
+          
+          Assistant: 
+          # The Cartographer's Cipher: Detailed Outline
+          
+          ## ACT ONE: ARRIVAL AND DISCOVERY
+          
+          ### Chapter 1: The Smallest Case
+          - **Setting:** Nora arrives in Shepherdstown on a gray October morning. Description of the town—half the storefronts empty, population aging, mines closed for twenty years.
+          - **Key Event:** She meets Sheriff Dale Kovach (58, gruff, counting days to retirement). He makes it clear this is a nothing case—old man had a heart attack, daughter is in denial, just close it out.
+          - **Character Moment:** Nora's crappy motel room. She can't sleep. Lies awake analyzing the case file that shouldn't be a case.
+          - **Setup:** Brief flashback to why she's here—her last case, pushing too hard, burning bridges.
+          
+          ### Chapter 2: The Workshop
+          - **Setting:** Eugene Fisk's workshop—a converted garage behind his house, packed with decades of maps, surveying equipment, and obsessive documentation.
+          - **Key Event:** Nora meets Eugene's daughter, **Caroline Fisk** (44, librarian, wire-thin with grief and rage). Caroline shows her the map—a massive 8x10 foot rendering of Shepherdstown with bizarre alterations.
+          - **The Map:** Buildings that don't exist (a church that burned down in 1963, a restaurant that was never built). Streets rerouted. And 63 red X's scattered across the town.
+          - **Caroline's Plea:** "My father didn't have heart attacks. He had *purpose*. Someone scared him to death."
+          ... (113 more lines)
+          Input=3,411, Prompt cached used= False | Output=3,500 | Messages=8
+          ==============================================
+          Turn 5:
+          
+          
+          User: Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.
+          
+          Assistant: 
+          # Chapter 1: The Smallest Case
+          
+          The thing about dying towns is they all smell the same—wet cardboard and regret.
+          
+          Nora Alike noticed it the moment she stepped out of her Honda Civic into the gray October drizzle. Shepherdstown, Pennsylvania stretched out before her like a body someone had forgotten to bury. Half the storefronts on Main Street were empty, their windows either papered over with faded FOR LEASE signs or simply left naked to expose the gutted interiors. A diner called Mabel's still had its lights on, and a Dollar General looked operational, but that was about it for signs of life.
+          
+          Population 847, the sign at the town limits had declared. Nora suspected that number was aspirational.
+          
+          She locked her car—force of habit, though she doubted anyone here had the energy for theft—and crossed the street toward the sheriff's office. The building was brick, squat, probably built in the sixties when the coal mine was still running and the town still had a future. A rusted flagpole stood out front, the rope slapping against the metal in the wind like a half-hearted call for help.
+          
+          *This is what you get*, Nora told herself. *This is what happens when you push too hard.*
+          
+          Three months ago, she'd been working property crimes in Harrisburg. Not glamorous, but it was something. Then she'd caught a case involving a city councilman's nephew and a string of car thefts. Everyone told her to go easy, that the kid was "going through something," that his uncle was "a good man who'd served the community for twenty years." Nora had arrested him anyway, because the evidence was there and because she couldn't understand why everyone wanted her to look the other way.
+          
+          The charges stuck. The councilman made calls. Nora's lieutenant made it clear that while he couldn't fire her, he could make her life miserable. Two weeks later, she'd accepted a "transfer opportunity" to Shepherdstown, where the entire sheriff's department consisted of three people and the most exciting case in recent memory was someone stealing catalytic converters from the Dollar General parking lot.
+          ... (98 more lines)
+          Input=6,947, Prompt cached used= False | Output=2,670 | Messages=10
+          ==============================================
+          Turn 6:
+          
+          
+          User: Can you draft a second chapter that builds on the first one, introducing a new twist in the mystery?
+          
+          Assistant: 
+          # Chapter 2: The Workshop
+          
+          Caroline Fisk's house was a narrow two-story Victorian on Elm Street, painted a faded yellow that had probably been cheerful once. The detached garage sat thirty feet behind it, connected by a flagstone path that cut through what had been a carefully tended garden. Past tense, because the garden was dying now—rosebushes gone leggy and wild, perennials choked with weeds, a birdbath tipped over on its side.
+          
+          Nora parked on the street and walked up the front path. Before she could knock, the door opened.
+          
+          Caroline Fisk was forty-four but looked older, the way grief ages people in fast-forward. Thin to the point of frailty, with graying brown hair pulled back in a hasty ponytail. She wore jeans and an oversized cardigan that might have been her father's. Her eyes were red-rimmed but sharp, evaluating Nora with the focused intensity of someone who'd cried herself out and moved on to anger.
+          
+          "You're the detective." Not a question.
+          
+          "Detective Alike. I'm sorry for your loss, Ms. Fisk."
+          
+          "Are you?" Caroline stepped aside to let Nora in. "Or are you here to tell me I'm a hysterical woman who can't accept that her father died of natural causes?"
+          
+          "I'm here to listen."
+          ... (161 more lines)
+          Input=9,641, Prompt cached used= True | Output=3,500 | Messages=12
+  markdown cell:
+    source:
+      This is a long conversation with several turns. You'll notice a few things here:
+      
+      Prompt caching: You'll notice here that the input tokens eventually grew to a point where prompt caching was used (turn 6). This helps reduce costs and speed as these conversations grow!
+  markdown cell:
+    source:
+      On the next turn, we are going to hit our 10K context window limit, which triggers compaction:
+  code cell:
+    source:
+      response, usage = session.chat("Propose a title for the book")
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          
+          🧹 Context window at 13141 tokens. Limit exceeded, compacting session memory...
+          
+          ------------------------------------------------------------
+          📝 New session memory created.
+          ✅ Tokens reduced: 13,141 → 1559 (11,582 tokens saved, 88% reduction)
+          ⏱️ Compaction time: 36.13s (user waiting...)
+           Cache used: True
+          ------------------------------------------------------------
+          
+          User: Propose a title for the book
+          
+          Assistant: 
+          Looking at the story we've developed, I'd propose:
+          
+          **"The Cartographer's Confession"**
+          
+          Here's why this works:
+          
+          **Thematic Resonance:**
+          - The double meaning captures Eugene's dual role: he kept confessions *and* his final map is itself a confession
+          - "Cartographer" immediately signals the unique hook of your premise
+          - "Confession" ties to the central tension between exposure and privacy
+          
+          **Alternative Titles to Consider:**
+          
+          1. **"Burial Ground"** - More commercial, emphasizes the literal buried evidence and metaphorical buried truths
+          
+          ... (13 more lines)
+          Input=1,840, Prompt cached used= False | Output=325 | Messages=3
+  markdown cell:
+    source:
+      You'll notice here that it took **over 36 seconds** for the agent to compact the conversation. Because we used traditional compaction, the user would be waiting on Claude to compact the conversation, which is not an ideal user experience.
+      
+      Below you can see the result of the compaction. It captures the key elements of conversation in less than 2K tokens.
+  code cell:
+    source:
+      print(session.summary)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          ## User Intent
+          User requested: "I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well throught out plot ideas for me to consider."
+          
+          After rejecting initial 3 plots, user specified: "I don't like those ideas, can you think of one plot something more unique and unexpected?"
+          
+          Accepted "The Cartographer's Cipher" concept. Then requested: character development, detailed outline, and chapter drafts.
+          
+          ## Completed Work
+          
+          **Story Concept Developed:**
+          - Title: "The Cartographer's Cipher"
+          - Premise: Detective investigates death of cartographer who spent final month decoding 40 years of town secrets onto a map with 63 red X's marking buried physical evidence
+          
+          **Character: Detective Nora Alike**
+          - Age: 24, physically small (5'4"), socially awkward, insomniac
+          - Backstory: Youngest of 4 siblings by 7 years, essentially raised herself. At 16, "rescued" brother Martin from breakdown—he resents the intrusion
+          - Transferred to Shepherdstown from Harrisburg Property Crimes after arresting city councilman's nephew, making powerful enemies
+          - Fatal flaw: Believes exposure always equals healing; pathological need to know; invasive about others' secrets, intensely private about her own
+          - Character arc: Must learn some truths cause more harm than good
+          
+          **Setting: Shepherdstown, PA**
+          - Population: 847 (declining)
+          - Coal mine closed 1998, town dying since
+          - Key locations: Sheriff's office, Fisk house/workshop, abandoned mine, Motor Lodge
+          
+          **Supporting Characters:**
+          - Eugene Fisk: 79, cartographer, died 4 days before story opens. Stage 4 pancreatic cancer. Spent 40 years as town's "confessor"
+          - Caroline Fisk: 44, librarian, Eugene's daughter, wants to expose all secrets
+          - Sheriff Dale Kovach: 58, wants case closed quickly
+          - Deputy Marcus Webb: 31, local, volunteers to help (his father's secret on the map)
+          - Helen Morrison: 73, visited Eugene asking him to "mark" something, he refused
+          
+          **17-Chapter Outline Created:**
+          - Act 1 (Chapters 1-5): Nora arrives, discovers map, investigates first X's revealing buried secrets (Hendersons' mortgage, mayor's affair, fire confession, etc.)
+          - Act 2 (Chapters 6-11): Someone else following map, threatens Nora. Interviews living victims who beg her not to expose secrets. Discovers mine embezzlement cover-up
+          - Act 3 (Chapters 12-17): Town meeting confrontation, Nora discovers Mayor Ortiz's father covered up mine safety violations. Phone call from brother Martin reveals he resents her "rescue." Nora compromises: exposes mine cover-up (affects everyone), buries personal secrets. Arrests Mayor and father. Ambiguous ending about whether truth serves justice
+          
+          **Key Plot Points:**
+          - Mine closed due to covered-up safety violations, not economics
+          - Vernon Pike (union treasurer) embezzled funds as scapegoat at Donald Mercer's direction
+          - Donald Mercer is Mayor Linda Ortiz's father
+          - Eugene's final journal: "I've mapped every lie, every buried truth... Maybe the only cure is exposure. Or maybe exposure is just another kind of death."
+          - Each X marks physical evidence of a secret (cash, photos, confessions, documents)
+          
+          **Chapters Drafted:**
+          - Chapter 1 (~2,000 words): Nora arrives Shepherdstown, meets Sheriff Kovach who dismisses case, assigned to talk to Caroline Fisk
+          - Chapter 2 (continuation): Nora visits Caroline, sees workshop and massive incorrect map with 63 red X's. Caroline explains Eugene's paranoid final month. Journal entry reveals "Morrison girl" visit. Coroner confirms extremely elevated stress hormones. Chapter ends with Nora heading to investigate old mine location with survey map
+          
+          ## Errors & Corrections
+          
+          User rejected first 3 plot concepts as not unique/unexpected enough:
+          1. "The Drowning Season" (memory recovery therapy murders)
+          2. "The Lighthouse Keeper's Daughter" (ritual sacrifices every 5-6 years)
+          3. "The Memory Box Murders" (classmates hunting each other over past crime)
+          
+          User directive: "can you think of one plot something more unique and unexpected?" Led to cartographer concept.
+          
+          ## Active Work
+          
+          Chapter 2 just completed. Ends with:
+          "She headed for the door, the survey map folded in her pocket and Eugene Fisk's final journal entry echoing in her mind: *Maybe the only cure is exposure. Or maybe exposure is just another kind of death.*
+          
+          Outside, the clouds had thickened again, pressing down on Shepherdstown like a shroud. Somewhere in this dying town, someone had scared an old man to death."
+          
+          Nora is heading to investigate the old mine (northern edge of town, surrounded by woods, one overgrown access road). She has Eugene's survey map showing cluster of X's around mine area, dates back to 1998.
+          
+          ## Pending Tasks
+          
+          No explicit requests pending. Story development ongoing—presumably more chapters to draft following the 17-chapter outline structure.
+          
+          ## Key References
+          
+          **Timeline:**
+          - 1998: Mine closes (covered-up safety violations)
+          - 2003: Martha Fisk (Eugene's wife) dies of cancer
+          - 5 weeks before present: Eugene starts creating "corrected" map
+          - 2 weeks before present: Caroline finds Eugene shaking, says "should have left them buried"
+          - 1 week before present: Helen Morrison visits, Eugene refuses to mark something
+          - 4 days before present: Eugene found dead with extremely elevated cortisol levels
+          
+          **Map Details:**
+          - 8ft x 10ft, mounted on foam board with acetate cover
+          - Shows Shepherdstown with deliberate "errors": church on Third & Maple (doesn't exist), Giovanni's restaurant on Main, rerouted streets
+          - 63+ red X's scattered across town
+          - Each X marks buried physical evidence of a secret
+          - Survey map subset focuses on mine area with names/dates/timeline from 1998
+          
+          **Character Relationships:**
+          - Nora/Martin (brother): She "saved" him 8 years ago when he was 23 and having breakdown; he resents the public humiliation; they barely speak now
+          - Eugene/Caroline: She cared for him through cancer; he left her the decoded map knowing she'd find it
+          - Eugene/townspeople: He was unofficial confessor for 40 years; people trusted him to keep secrets safe
+  markdown cell:
+    source:
+      ## Instant Compaction
+      
+      With **Instant compaction** the session memory is PROACTIVELY generated once a soft token threshold is reached. 
+      
+      Once the user triggers a compaction or a hard limit is reached, the summary is already available, so the user doesn't need to wait.
+      
+      Result: Instant compaction, no waiting.
+  markdown cell:
+    source:
+      
+      SESSION MEMORY COMPACTION (instant)
+      ```
+      ────────────────────────────────────
+      Turn 1 → Turn 2 → ... → Turn K → Turn K+1 → ... → Turn N → ..  → CONTEXT FULL!
+                                  │                         │            │
+                      (soft token threshold met:        (update          │
+                     initialize session memory)          trigger)        │
+                                  │                                      │
+                                  │                         │            │
+                                  ▼                         ▼            │
+                             ┌────────┐                ┌────────┐        │
+                             │ Create │                │ Update │        │
+                             │ memory │ (background)   │ memory │        │
+                             └────────┘                └────────┘        │
+                                  │                         │            │
+                                  ▼                         ▼            ▼
+                           📝 session-memory.md ──────────────────► INSTANT SWAP!
+                             (continuously updated)
+      ```
+      
+      **Update triggers:** The first summary is generated after the initial soft token limit. Updates can be triggered after every subsequent turn, or at periodically at natural breakpoints intervals (e.g. every ~10k tokens or 3+ tool calls).
+  markdown cell:
+    source:
+      This `InstantCompactingChatSession` class uses **threading** for background execution:
+      1. **`threading.Thread`** - runs memory updates in background without blocking
+      2. **Thread-safe state** - uses `threading.Lock` to safely update shared memory
+      3. **Daemon threads** - background work doesn't prevent program exit
+      4. **Instant compaction** - when context is full, just swap in the pre-built memory
+  code cell:
+    source:
+      import threading
+      import time
+      
+      
+      class InstantCompactingChatSession:
+          """
+          Maintains session memory via incremental background updates.
+      
+          Key insight: By updating memory in the background after each turn,
+          the summary is already ready when compaction is needed - instant swap!
+          """
+      
+          def __init__(
+              self,
+              system_message="You are a helpful assistant",
+              context_limit: int = 12000,
+              min_tokens_to_init: int = 7500,
+              min_tokens_between_updates: int = 2000,
+          ):
+              # Thresholds
+              self.context_limit = context_limit  # the point at which the conversation is compacted so it does not exceed model limits
+              self.min_tokens_to_init = min_tokens_to_init  # tokens needed to trigger initial memory creation; note this happens PROACTIVELY in background unlike traditional compaction
+              self.min_tokens_between_updates = min_tokens_between_updates  # tokens needed to trigger memory update. only comes into play after initial memory is created and additional compaction (memory update) is needed after that
+      
+              # Conversation state
+              self.system_message = system_message
+              self.messages = []
+              self.current_context_window_tokens = 0
+      
+              # Session memory state
+              self.session_memory = None  # this is the compacted conversation in session memory; for the demo we are storing this in memory, but in production you would write to session_memory.md file
+              self.last_summarized_index = (
+                  0  # The index of the last message included in the session memory
+              )
+              self.tokens_at_last_update = 0  # To track tokens at last memory update and see if enough new tokens have been added to trigger another update
+      
+              # Background update tracking
+              self._update_thread: threading.Thread | None = None
+              self.last_update_time = None
+              self._lock = threading.Lock()
+      
+          def chat(self, user_message: str):
+              """Process a chat turn with background session memory updates."""
+      
+              if self.current_context_window_tokens + estimate_tokens(user_message) >= self.context_limit:
+                  self.compact()  # note that when this is triggered, the compaction has already been created and is just swapped in instantly
+      
+              self.messages.append({"role": "user", "content": user_message})
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=3500,
+                  system=self.system_message,
+                  messages=add_cache_control(self.messages),
+              )
+      
+              assistant_message = response.content[0].text
+              self.messages.append({"role": "assistant", "content": assistant_message})
+      
+              # Calculate token usage including cache
+              cache_read = getattr(response.usage, "cache_read_input_tokens", 0) or 0
+              total_input = response.usage.input_tokens + cache_read
+      
+              # Update context window tokens (includes cached tokens since they still count toward context)
+              self.current_context_window_tokens = total_input + response.usage.output_tokens
+      
+              # KEY DIFFERENCE: Trigger background memory update if needed proactively, before compaction is needed
+              background_status = None
+              if self._should_init_memory() or self._should_update_memory():
+                  self._trigger_background_update()
+                  background_status = "initializing" if self.session_memory is None else "updating"
+      
+              # Return usage info with cache stats
+              return assistant_message, response.usage, background_status
+      
+          # Helper methods to determine when to init session memory
+          def _should_init_memory(self) -> bool:
+              return (
+                  self.session_memory is None
+                  and self.current_context_window_tokens >= self.min_tokens_to_init
+              )
+      
+          # Helper method to determine if memory should be updated
+          def _should_update_memory(self) -> bool:
+              if self.session_memory is None:
+                  return False
+              tokens_since = self.current_context_window_tokens - self.tokens_at_last_update
+              return tokens_since >= self.min_tokens_between_updates
+      
+          # Methods to create initial session memory
+          def _create_session_memory(self, messages: list[dict]) -> str:
+              """Generate initial session memory from messages."""
+              # Put compaction instructions in user message to share cache with main chat
+              compaction_messages = [{"role": "user", "content": SESSION_MEMORY_PROMPT}]
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=5000,
+                  system=self.system_message,  # Same as main chat for cache sharing
+                  messages=add_cache_control(messages) + compaction_messages,
+              )
+              summary, _ = remove_thinking_blocks(
+                  response.content[0].text
+              )  # clean up any <think> blocks because they are not needed in the session memory
+              print(
+                  f"   [Background] Initial session memory created. Cache hit={getattr(response.usage, 'cache_read_input_tokens', 0) > 0}"
+              )
+              return summary
+      
+          def _update_session_memory(self, new_messages: list[dict]) -> str:
+              """Update existing session memory with new messages. In practice, you may want to do this via file edit rather than full re-generation. But for demo purposes we do full regeneration here."""
+              # Put compaction instructions in user message to share cache with main chat
+              compaction_update_messages = [
+                  {
+                      "role": "user",
+                      "content": SESSION_MEMORY_PROMPT
+                      + f"""There is an existing session memory: {self.session_memory}. Return the entire session memory with updates to reflect new messages.""",
+                  }
+              ]
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=5000,
+                  system=self.system_message,
+                  messages=new_messages
+                  + compaction_update_messages,  # you may want to use prompt caching instead, in which case you'd use add_cache_control(self.messages) here
+              )
+              updated_summary, _ = remove_thinking_blocks(
+                  response.content[0].text
+              )  # clean up any <think> blocks because they are not needed in the session memory
+              print("   [Background] Session memory updated.")
+              return updated_summary
+      
+          # Background memory update methods
+          def _background_memory_update(
+              self, messages_snapshot: list[dict], snapshot_index: int, current_tokens: int
+          ):
+              """Run session memory update in a background thread."""
+              try:
+                  with self._lock:
+                      current_session_memory = self.session_memory
+                      last_index = self.last_summarized_index
+      
+                  if current_session_memory is None:
+                      new_memory = self._create_session_memory(messages_snapshot)
+                  else:
+                      # Get new messages since last summary
+                      new_messages = messages_snapshot[last_index:]
+                      if not new_messages:
+                          return
+                      new_memory = self._update_session_memory(new_messages)
+      
+                  # Update state (thread-safe)
+                  with self._lock:
+                      self.session_memory = new_memory
+                      self.last_summarized_index = snapshot_index
+                      self.tokens_at_last_update = current_tokens
+                      self.last_update_time = time.time()
+      
+              except Exception as e:
+                  print(f"   [Background] Error updating memory: {e}")
+      
+          # This makes sure only one background update runs at a time. If one is already running, we skip starting another. If not, we start a new thread to do the update.
+          def _trigger_background_update(self):
+              """Trigger a background session memory update."""
+              if self._update_thread is not None and self._update_thread.is_alive():
+                  return
+      
+              messages_snapshot = self.messages.copy()
+              snapshot_index = len(messages_snapshot)
+              current_tokens = self.current_context_window_tokens
+      
+              self._update_thread = threading.Thread(
+                  target=self._background_memory_update,
+                  args=(messages_snapshot, snapshot_index, current_tokens),
+                  daemon=True,
+              )
+              self._update_thread.start()
+      
+          # Function to compact
+          def compact(self):
+              """INSTANT compaction using pre-built session memory."""
+              prev_msg_count = len(self.messages)
+      
+              # Ensure session memory is ready. Shouldn't be an issue normally, but here for safety.
+              if self.session_memory is None:
+                  if self._update_thread is not None and self._update_thread.is_alive():
+                      print("   ⏳ Waiting for background memory update...")
+                      self._update_thread.join(timeout=30.0)
+      
+                  if self.session_memory is None:
+                      print("   ⚠️  No pre-built memory, creating synchronously...")
+                      start = time.perf_counter()
+                      self.session_memory = self._create_session_memory(self.messages)
+                      elapsed = time.perf_counter() - start
+                      print(f"   ⏱️  Took {elapsed:.2f}s (but should be instant normally!)")
+                      self.last_summarized_index = len(self.messages)
+      
+              with self._lock:
+                  unsummarized = self.messages[self.last_summarized_index :]
+                  summary_message = [
+                      {
+                          "role": "user",
+                          "content": f"""This session is being continued from a previous conversation. Here is the session memory: {self.session_memory}.Continue from where we left off.""",
+                      }
+                  ]
+                  self.messages = summary_message + unsummarized
+                  self.last_summarized_index = 1
+      
+                  print(f"\n{'=' * 60}")
+                  print(f"⚡ INSTANT COMPACTION! Messages: {prev_msg_count} → {len(self.messages)}")
+                  print("   Session memory was pre-built (no wait time!)")
+                  print(f"{'=' * 60}")
+  markdown cell:
+    source:
+      ### Example use of Instant Compaction
+  code cell:
+    source:
+      # Low thresholds for demo - in production you'd use higher values
+      session = InstantCompactingChatSession(
+          system_message=SYSTEM_PROMPT,
+      )
+      
+      messages = [
+          "I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well throught out plot ideas for me to consider.",
+          "I don't like those ideas, can you think of one plot something more unique and unexpected?",
+          "Ok I like it. Can you help me develop the main character's backstory and motivations?",
+          "Can you draft a detailed outline for the story, breaking it down into chapters and key events?",
+          "Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.",
+          "Can you draft a second chapter that builds on the first one?"
+          "Can you revise that second chapter, make it more suspenseful and engaging?",
+      ]
+      print("Starting conversation with instant compacting chat session...\n")
+      
+      turn_count = 0
+      for _i, message in enumerate(messages, 1):
+          response, usage, background_status = session.chat(message)
+          turn_count += 1
+      
+          # Calculate cache stats
+          cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
+          cache_created = getattr(usage, "cache_creation_input_tokens", 0) or 0
+          total_input = usage.input_tokens + cache_read
+      
+          print(f"{'=' * 60}")
+          print(f"Turn {turn_count}:")
+          print(f"\nUser: {message}")
+          print(f"\nAssistant: \n{truncate_response(response, max_lines=3)}")
+          print("\nToken Usage:")
+          print(f"  Input: {total_input:,} (new: {usage.input_tokens:,}, cached: {cache_read:,})")
+          print(f"  Output: {usage.output_tokens:,}")
+          print(
+              f"  Messages: {len(session.messages)} | Memory: {'ready' if session.session_memory else 'not yet'}"
+          )
+      
+          if cache_read > 0:
+              cache_pct = (cache_read / total_input) * 100
+              print(f"  ✓ Cache hit! {cache_pct:.0f}% of input from cache")
+      
+          if background_status:
+              print(f"\n  [Background] Proactively {background_status} session memory...")
+              print(f"  Context window: {session.current_context_window_tokens:,} tokens")
+      
+          print()
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          Starting conversation with instant compacting chat session...
+          
+          ============================================================
+          Turn 1:
+          
+          User: I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well throught out plot ideas for me to consider.
+          
+          Assistant: 
+          # Three Detective Story Concepts
+          
+          ## 1. **The Vanishing Act**
+          ... (30 more lines)
+          
+          Token Usage:
+            Input: 318 (new: 318, cached: 0)
+            Output: 762
+            Messages: 2 | Memory: not yet
+          
+          ============================================================
+          Turn 2:
+          
+          User: I don't like those ideas, can you think of one plot something more unique and unexpected?
+          
+          Assistant: 
+          # **The Cartographer's Daughter**
+          
+          **Premise:** Nora Fields (23) returns to her dying hometown of Millbrook after her cartographer father's sudden death. The town is literally disappearing—not metaphorically, but *actually*. Buildings that existed last month are gone. Streets lead to nowhere. Residents have conflicting memories about what was where. The population has dropped from 2,000 to 300 in three years, but no one can agree on who left or why.
+          ... (20 more lines)
+          
+          Token Usage:
+            Input: 1,102 (new: 1,102, cached: 0)
+            Output: 636
+            Messages: 4 | Memory: not yet
+          
+          ============================================================
+          Turn 3:
+          
+          User: Ok I like it. Can you help me develop the main character's backstory and motivations?
+          
+          Assistant: 
+          # Nora Fields - Character Development
+          
+          ## Background
+          ... (67 more lines)
+          
+          Token Usage:
+            Input: 1,760 (new: 1,760, cached: 0)
+            Output: 1,214
+            Messages: 6 | Memory: not yet
+          
+          ============================================================
+          Turn 4:
+          
+          User: Can you draft a detailed outline for the story, breaking it down into chapters and key events?
+          
+          Assistant: 
+          # **The Cartographer's Daughter - Story Outline**
+          
+          ## ACT ONE: The Return (Chapters 1-5)
+          ... (175 more lines)
+          
+          Token Usage:
+            Input: 2,996 (new: 2,996, cached: 0)
+            Output: 3,500
+            Messages: 8 | Memory: not yet
+          
+          ============================================================
+          Turn 5:
+          
+          User: Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.
+          
+          Assistant: 
+          # Chapter 1: Homecoming
+          
+          The town looked smaller than Nora remembered, which was exactly what everyone said about their hometowns. What they didn't mention was the peculiar quality of that smallness—not just diminished by time and perspective, but somehow *compressed*, as if Millbrook had been folded in on itself while she was away.
+          ... (168 more lines)
+          
+          Token Usage:
+            Input: 6,532 (new: 6,532, cached: 0)
+            Output: 3,077
+            Messages: 10 | Memory: not yet
+          
+            [Background] Proactively initializing session memory...
+            Context window: 9,609 tokens
+          
+             [Background] Initial session memory created. Cache hit=True
+          ============================================================
+          Turn 6:
+          
+          User: Can you draft a second chapter that builds on the first one?Can you revise that second chapter, make it more suspenseful and engaging?
+          
+          Assistant: 
+          # Chapter 2: The Map That Shouldn't Exist
+          
+          The funeral was smaller than Nora expected, which was saying something—she'd expected it to be small.
+          ... (214 more lines)
+          
+          Token Usage:
+            Input: 9,642 (new: 5,546, cached: 4,096)
+            Output: 3,500
+            Messages: 12 | Memory: ready
+            ✓ Cache hit! 42% of input from cache
+          
+            [Background] Proactively updating session memory...
+            Context window: 13,142 tokens
+          
+      output 1:
+        output_type: stream
+        name: stdout
+        text:
+             [Background] Session memory updated.
+  code cell:
+    source:
+      message = "What did we just talk about? Give me one sentence"
+      response, usage, background_status = session.chat(message)
+      
+      # Calculate cache stats
+      cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
+      total_input = usage.input_tokens + cache_read
+      
+      print(f"\nUser: {message}")
+      print(f"\nAssistant: \n{truncate_response(response, max_lines=3)}")
+      print("\nToken Usage:")
+      print(f"  Input: {total_input:,} (new: {usage.input_tokens:,}, cached: {cache_read:,})")
+      print(f"  Output: {usage.output_tokens:,}")
+      print(
+          f"  Messages: {len(session.messages)} | Memory: {'ready' if session.session_memory else 'not yet'}"
+      )
+      
+      if cache_read > 0:
+          cache_pct = (cache_read / total_input) * 100
+          print(f"  ✓ Cache hit! {cache_pct:.0f}% of input from cache")
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          
+          ============================================================
+          ⚡ INSTANT COMPACTION! Messages: 12 → 1
+             Session memory was pre-built (no wait time!)
+          ============================================================
+          
+          User: What did we just talk about? Give me one sentence
+          
+          Assistant: 
+          We had just finished drafting Chapter 2 (the funeral and evidence discovery), and you requested that I revise it to make it more suspenseful and engaging—which I hadn't completed yet before the conversation ended.
+          
+          Would you like me to provide that revised, more suspenseful version of Chapter 2 now?
+          
+          Token Usage:
+            Input: 2,276 (new: 2,276, cached: 0)
+            Output: 71
+            Messages: 3 | Memory: ready
+  markdown cell:
+    source:
+      You'll notice here that once we hit the context limit, the session memory was instantaly swapped in, meaning the user had zero waiting time for a response!
+  markdown cell:
+    source:
+      ## Advanced: Understanding Prompt Caching
+  markdown cell:
+    source:
+      
+      The background updates can be made **~10x cheaper** by using prompt caching. The trick:
+      1. Pass the **full conversation** to the background summarizer
+      2. Add `cache_control` markers so subsequent requests hit the cache
+      3. Only the new "summarize this" instruction is billed at full price
+      
+      ```
+      ┌─────────────────────────────────────────────────────────────────────────────────┐
+      │                    PROMPT CACHING FOR LONG CONVERSATIONS                        │
+      ├─────────────────────────────────────────────────────────────────────────────────┤
+      │                                                                                 │
+      │  WITHOUT CACHING: Pay full price for entire context every turn                 │
+      │  ════════════════════════════════════════════════════════════                   │
+      │                                                                                 │
+      │  Turn 1:  [System][User1][Asst1]                         →  500 tokens  @ $3/M │
+      │  Turn 2:  [System][User1][Asst1][User2][Asst2]           → 1500 tokens  @ $3/M │
+      │  Turn 3:  [System][User1][Asst1][User2][Asst2][User3]... → 3000 tokens  @ $3/M │
+      │  Turn 4:  [System][User1][Asst1][User2][Asst2][User3]... → 5000 tokens  @ $3/M │
+      │           ─────────────────────────────────────────────                         │
+      │                                              Total: 10,000 tokens = $0.030      │
+      │                                                                                 │
+      │                                                                                 │
+      │  WITH CACHING: Pay full price once, then 90% discount on prefix                │
+      │  ═══════════════════════════════════════════════════════════════                │
+      │                                                                                 │
+      │  Turn 1:  [System][User1][Asst1]◆                        →  500 tokens  @ $3/M │
+      │                                ▲                            (cache created)    │
+      │                          cache breakpoint                                       │
+      │                                                                                 │
+      │  Turn 2:  [System][User1][Asst1][User2][Asst2]◆                                │
+      │           ╰─────── cached ──────╯                                              │
+      │                500 @ $0.30/M + 1000 new @ $3/M  =  $0.0032                     │
+      │                                                                                 │
+      │  Turn 3:  [System][User1][Asst1][User2][Asst2][User3][Asst3]◆                  │
+      │           ╰──────────── cached ─────────────╯                                  │
+      │               1500 @ $0.30/M + 1500 new @ $3/M  =  $0.0050                     │
+      │                                                                                 │
+      │  Turn 4:  [System][User1][Asst1][User2][Asst2][User3][Asst3][User4][Asst4]◆    │
+      │           ╰───────────────────── cached ─────────────────────╯                 │
+      │                     3000 @ $0.30/M + 2000 new @ $3/M  =  $0.0069               │
+      │           ─────────────────────────────────────────────                         │
+      │                                              Total: $0.0166  (45% savings)     │
+      │                                                                                 │
+      ├─────────────────────────────────────────────────────────────────────────────────┤
+      │                                                                                 │
+      │  COMPACTION + CACHING: Double benefit                                           │
+      │  ════════════════════════════════════                                           │
+      │                                                                                 │
+      │    Main Chat                      Background Summarizer                         │
+      │    ─────────                      ─────────────────────                         │
+      │                                                                                 │
+      │  [Conversation grows...]          [Same conversation prefix]◆ + [Summarize!]   │
+      │         │                                    │                                  │
+      │         │                         Cache hit! Only pays for                      │
+      │         │                         the summarization prompt                      │
+      │         │                                    │                                  │
+      │         ▼                                    ▼                                  │
+      │  Context limit reached  ──────►  Session memory ready instantly                │
+      │                                  (built cheaply in background)                  │
+      │                                                                                 │
+      │  ┌──────────────────────────────────────────────────────────────────────────┐  │
+      │  │  Key insight: The background summarizer reuses the same conversation     │  │
+      │  │  prefix that was just sent to the main chat - automatic cache hit!       │  │
+      │  └──────────────────────────────────────────────────────────────────────────┘  │
+      │                                                                                 │
+      └─────────────────────────────────────────────────────────────────────────────────┘
+      
+      ◆ = cache_control breakpoint (cache everything before this point)
+      ```
+      
+      ### Why this matters for compaction
+      
+      | Scenario | Cost per background update | Notes |
+      |----------|---------------------------|-------|
+      | No caching | Full input cost | 5,000 tokens × $3/M = $0.015 |
+      | With caching | ~10% of input cost | 500 new + 4,500 cached = $0.003 |
+      | **Savings** | **~80%** | Compounds over many updates |
+      
+      The longer the conversation, the bigger the savings—exactly when you need compaction most!
+  markdown cell:
+    source:
+      ### How the Caching Works
+      
+      The key is in `_add_cache_control()` and `_create_session_memory_cached()`:
+      
+      ```python
+      # 1. Mark the last conversation message with cache_control
+      {
+          "role": "user",
+          "content": [{
+              "type": "text",
+              "text": msg["content"],
+              "cache_control": {"type": "ephemeral"}  # <-- This creates a cache breakpoint
+          }]
+      }
+      
+      # 2. Also mark the system prompt
+      system=[{
+          "type": "text",
+          "text": "You are a session memory agent...",
+          "cache_control": {"type": "ephemeral"}
+      }]
+      ```
+      
+      **Why this works:**
+      - The first background update creates a cache entry for `[System + Messages]`
+      - Subsequent updates with the same message prefix get **cache hits**
+      - Only the new summarization instruction is billed at full price
+      - Cache entries have a 5-minute TTL, so rapid updates benefit most
+      
+      **Cost math:**
+      - Without caching: 5,000 tokens × $3.00/1M = $0.015 per update
+      - With caching: 500 new tokens × $3.00/1M + 4,500 cached × $0.30/1M = $0.00285
+      - **Savings: ~80%** on background summarization costs
+  markdown cell:
+    source:
+      ## Conclusion
+      
+      In this cookbook, you learned how to manage long-running Claude conversations through session memory compaction.
+      
+      ### What We Covered
+      
+      ✅ **Effective compaction prompts** - Structure your session memory to preserve user intent, completed work, errors, active work, and key references while discarding filler
+      
+      ✅ **Instant compaction** - Use background threading to proactively build session memory, eliminating user wait time when context limits are reached
+      
+      ✅ **Prompt caching for cost savings** - Reduce background update costs by ~80% by reusing the conversation prefix cache
+      
+      ✅ **Traditional vs. instant patterns** - Understand when to use each approach based on your application needs
+      
+      ### Key Takeaways
+      
+      1. **Weight recency heavily** - The end of a conversation is the active working context
+      2. **Preserve user corrections verbatim** - Prevents the model from reverting to old behaviors
+      3. **Build memory proactively** - Don't wait for context limits; start background updates early
+      4. **Leverage prompt caching** - Background summarization can share cache with the main conversation
+      
+      ### Next Steps
+      
+      - **For agentic workflows**: See [Automatic Context Compaction](../tool_use/automatic-context-compaction.ipynb) for SDK-based automatic compaction with tool use
+      - **For production**: Consider persisting session memory to disk rather than keeping it in memory
+      - **For optimization**: Experiment with update frequency thresholds to balance cost vs. freshness

Generated by nbdime

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review

Recommendation: REQUEST_CHANGES

Summary

This PR adds a high-quality cookbook demonstrating session memory compaction patterns for long-running Claude conversations. The technical implementation is sound with excellent threading patterns and caching strategies, but there are a few critical issues that must be fixed before merging.

Actionable Feedback (6 items)

Critical Issues:

  • misc/session_memory_compaction.ipynb:93 - Remove hardcoded API key parameter. Change client = anthropic.Anthropic(api_key="your_api_key_here") to client = anthropic.Anthropic() - the SDK reads from environment automatically after load_dotenv()

Important Issues:

  • misc/session_memory_compaction.ipynb:65 - Fix pip install command. Change # %pip install -qU anthropic python-dotenv to use %%capture magic per project standards:
    %%capture
    %pip install -U anthropic python-dotenv
  • misc/session_memory_compaction.ipynb (in cell with def remove_thinking_blocks(text: str):) - Add return type annotations to all helper functions. Example: def remove_thinking_blocks(text: str) -> tuple[str, str]:
  • misc/session_memory_compaction.ipynb:cell-1 - Remove the redundant "What You'll Learn" section - it duplicates the Learning Objectives already in the introduction
  • misc/session_memory_compaction.ipynb (in cells with class definitions) - Add blank line after class definition docstrings before first method for PEP 8 compliance
  • .env.example:16 - Good catch on the missing newline! ✓
Detailed Review

Code Quality

Strengths:

  • Excellent threading implementation with proper threading.Lock() for thread-safe state management
  • Well-structured progressive complexity: traditional → instant → caching optimization
  • Comprehensive helper functions with clear docstrings
  • Modern type hints using list[dict] and str | None syntax
  • Real-world creative writing assistant example makes concepts concrete

Issues:

  • Hardcoded API key parameter (line 93): Following load_dotenv(), the SDK automatically reads ANTHROPIC_API_KEY from environment. The explicit api_key="your_api_key_here" parameter is unnecessary and teaches bad patterns. All other cookbooks use client = anthropic.Anthropic() without parameters.
  • Missing return type annotations: Functions like remove_thinking_blocks(), add_cache_control() should have explicit return types for consistency with project standards
  • Pip install format: Should use %%capture magic instead of -q flag per project conventions (see other cookbooks)

Documentation & Structure

Strengths:

  • Problem-focused introduction that opens with the pain point
  • Excellent ASCII diagrams comparing traditional vs instant compaction
  • Clear learning objectives aligned with TLO pattern
  • Strong conclusion with key takeaways and next steps
  • Helpful inline comments explaining thread timing and cache behavior

Issues:

  • "What You'll Learn" section (cell 1) duplicates the Learning Objectives already in the introduction
  • Could map conclusion more explicitly back to the 4 learning objectives

Session Memory Prompt

Strengths:

  • Well-structured with clear analysis instructions and summary format
  • Excellent preservation rules (identifiers, errors, corrections)
  • Proper emphasis on recency weighting
  • Good compression rules with token budgets

Observations:

  • The prompt is comprehensive and production-ready
  • Chain-of-thought approach before summarization is excellent practice

Technical Patterns

Strengths:

  • Background update strategy with _should_init_memory() and _should_update_memory() shows thoughtful engineering
  • Prompt caching explanation with visual diagrams is exceptionally clear
  • add_cache_control() properly structures messages for cache hits
  • Token estimation and reduction tracking throughout

Suggestions:

  • Consider adding brief note about caching earlier (currently explained only in Section 3)
  • The helper functions use caching before it's explained to readers

Security

  • No security issues identified
  • Proper use of environment variables for API key (once hardcoded parameter is removed)
  • No sensitive data exposed in notebook outputs

Project Compliance

Strengths:

  • ✓ Registry entry properly formatted with correct categories
  • ✓ Author information complete in authors.yaml
  • ✓ Model version uses current Sonnet 4.5
  • ✓ Notebook outputs preserved for demonstration
  • ✓ No .env files committed

Issues:

  • API key handling pattern needs update (remove explicit parameter)
  • Pip install format needs %%capture magic

Positive Notes

This cookbook demonstrates advanced patterns that will genuinely help developers build production-ready conversational AI. The instant compaction with background threading is a sophisticated pattern rarely documented elsewhere. The progressive complexity and real-world example make this highly valuable.

The technical implementation is sound - the threading, caching strategy, and session memory prompt are all production-ready. Once the handful of formatting and style issues are addressed, this will be an excellent addition to the cookbook collection.

Specific highlights:

  • The visual comparison of traditional vs instant compaction (cells 12 & 23) is brilliant pedagogy
  • The detailed explanation of cache structure and prefix reuse will save readers hours of trial and error
  • The SESSION_MEMORY_PROMPT is well-crafted and reusable
  • Code comments like "note that when this is triggered, the compaction has already been created" show attention to detail

- Remove hardcoded API key, use Anthropic() with env auto-detection
- Fix pip install to use %%capture magic per project standards
- Add return type annotation to remove_thinking_blocks function
- Remove redundant 'What You'll Learn' section (duplicates Learning Objectives)
- Clean up duplicate cells from previous edits

:house: Remote-Dev: homespace

Claude-Generated-By: Claude Code (cli/claude-opus-4-5=100%)
Claude-Steers: 1
Claude-Permission-Prompts: 13
Claude-Escapes: 0
@github-actions

Copy link
Copy Markdown

Notebook Changes

This PR modifies the following notebooks:

📓 misc/session_memory_compaction.ipynb

View diff
nbdiff /dev/null misc/session_memory_compaction.ipynb (d09f2ffcc518f4898ac077188638b6a84e65e164)
--- /dev/null  2026-01-16 20:50:20.676940
+++ misc/session_memory_compaction.ipynb (d09f2ffcc518f4898ac077188638b6a84e65e164)  (no timestamp)
## added /cells:
+  markdown cell:
+    source:
+      # Session Memory Compaction
+      
+      Long-running conversations with Claude can exceed context limits, causing loss of important information. Whether you're building a coding assistant, creative writing tool, or customer service agent, managing session memory is critical for maintaining continuity and quality.
+      
+      This cookbook teaches you how to **proactively manage session memory** to avoid jarring context limit interruptions. Unlike reactive approaches that wait until the context is full, you'll learn to build session memory in the background so compaction is instant when needed.
+      
+      **Related:** For automatic SDK-based compaction in agentic workflows, see [Automatic Context Compaction](../tool_use/automatic-context-compaction.ipynb). This cookbook focuses on manual control patterns for conversational applications.
+      
+      ## Learning Objectives
+      
+      By the end of this cookbook, you will be able to:
+      
+      - Write effective session memory prompts that preserve critical context across compaction events
+      - Implement **instant compaction** using background threading to eliminate user wait time
+      - Apply prompt caching to reduce the cost of background memory updates by ~80%
+      - Choose appropriate compaction strategies (traditional vs. instant) based on your use case
+  markdown cell:
+    source:
+      ## Prerequisites and Setup
+      
+      Before following this guide, ensure you have:
+      
+      **Required Knowledge**
+      - Basic understanding of Claude API usage and message formatting
+      - Familiarity with Python threading concepts (helpful but not required)
+      
+      **Required Tools**
+      - Python 3.10 or higher
+      - Anthropic API key
+      - Anthropic SDK
+      
+      ### Installation
+      
+      First, install the required dependencies:
+  code cell:
+    source:
+      %%capture
+      %pip install -U anthropic python-dotenv
+  code cell:
+    source:
+      import anthropic
+      from anthropic.types import MessageParam, TextBlockParam
+      from dotenv import load_dotenv
+      
+      load_dotenv()
+      
+      client = anthropic.Anthropic()
+      MODEL = "claude-sonnet-4-5-20250929"
+  code cell:
+    source:
+      def truncate_response(text: str, max_lines: int = 15) -> str:
+          """Truncate long responses for cleaner output display."""
+          lines = text.strip().split("\n")
+          if len(lines) <= max_lines:
+              return text
+          return "\n".join(lines[:max_lines]) + f"\n... ({len(lines) - max_lines} more lines)"
+      
+      
+      def remove_thinking_blocks(text: str) -> tuple[str, str]:
+          """Remove <think>...</think> blocks from the text."""
+          import re
+      
+          matches = re.findall(r"<think>.*?</think>", text, flags=re.DOTALL)
+          cleaned = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()
+          return cleaned, "".join(matches)
+      
+      
+      def add_cache_control(messages: list[dict]) -> list[MessageParam]:
+          """Add cache_control to the last user message for prompt caching.
+      
+          For prompt caching to work, the message prefix structure must be identical between requests.
+          All messages are converted to list format for consistency, and cache_control is placed on
+          the last user message to match the standard API call pattern.
+          """
+          cached_messages: list[MessageParam] = []
+          last_user_idx = None
+      
+          # Find last user message index
+          for i, msg in enumerate(messages):
+              if msg["role"] == "user":
+                  last_user_idx = i
+      
+          for i, msg in enumerate(messages):
+              content = msg["content"]
+              text = content if isinstance(content, str) else content[0]["text"]
+      
+              content_block: TextBlockParam = {"type": "text", "text": text}
+              if i == last_user_idx:
+                  content_block["cache_control"] = {"type": "ephemeral"}
+      
+              cached_messages.append({"role": msg["role"], "content": [content_block]})
+      
+          return cached_messages
+      
+      
+      def estimate_tokens(text: str) -> int:
+          """Rudimentary token estimation: 1 token per 4 characters."""
+          return len(text) // 4
+  code cell:
+    source:
+      SESSION_MEMORY_PROMPT = """
+      Compress the conversation into a structured summary
+      that preserves all information needed to continue work seamlessly. Optimize for the assistant's
+      ability to continue working, not human readability.
+      
+      <analysis-instructions>
+      Before generating your summary, analyze the transcript in <think>...</think> tags:
+      1. What did the user originally request? (Exact phrasing)
+      2. What actions succeeded? What failed and why?
+      3. Did the user correct or redirect the assistant at any point?
+      4. What was actively being worked on at the end?
+      5. What tasks remain incomplete or pending?
+      6. What specific details (IDs, paths, values, names) must survive compression?
+      </analysis-instructions>
+      
+      <summary-format>
+      ## User Intent
+      The user's original request and any refinements. Use direct quotes for key requirements.
+      If the user's goal evolved during the conversation, capture that progression.
+      
+      ## Completed Work
+      Actions successfully performed. Be specific:
+      - What was created, modified, or deleted
+      - Exact identifiers (file paths, record IDs, URLs, names)
+      - Specific values, configurations, or settings applied
+      
+      ## Errors & Corrections
+      - Problems encountered and how they were resolved
+      - Approaches that failed (so they aren't retried)
+      - User corrections: "don't do X", "actually I meant Y", "that's wrong because..."
+      Capture corrections verbatim—these represent learned preferences.
+      
+      ## Active Work
+      What was in progress when the session ended. Include:
+      - The specific task being performed
+      - Direct quotes showing exactly where work left off
+      - Any partial results or intermediate state
+      
+      ## Pending Tasks
+      Remaining items the user requested that haven't been started.
+      Distinguish between "explicitly requested" and "implied/assumed."
+      
+      ## Key References
+      Important details needed to continue:
+      - Identifiers: IDs, paths, URLs, names, keys
+      - Values: numbers, dates, configurations, credentials (redacted)
+      - Context: relevant background information, constraints, preferences
+      - Citations: sources referenced during the conversation
+      </summary-format>
+      
+      <preserve-rules>
+      Always preserve when present:
+      - Exact identifiers (IDs, paths, URLs, keys, names)
+      - Error messages verbatim
+      - User corrections and negative feedback
+      - Specific values, formulas, or configurations
+      - Technical constraints or requirements discovered
+      - The precise state of any in-progress work
+      </preserve-rules>
+      
+      <compression-rules>
+      - Weight recent messages more heavily—the end of the transcript is the active context
+      - Omit pleasantries, acknowledgments, and filler ("Sure!", "Great question")
+      - Omit system context that will be re-injected separately
+      - Keep each section under 500 words; condense older content to make room for recent
+      - If you must cut details, preserve: user corrections > errors > active work > completed work
+      </compression-rules>
+      """
+  markdown cell:
+    source:
+      ### Code example using traditional compacting
+      In traditional compaction, you generate one summary once the token threshold is reached.
+      Traditional compaction is slow: when you hit the context limit, you wait for a summary.
+  markdown cell:
+    source:
+      
+      ```
+      TRADITIONAL COMPACTION (slow)
+      ─────────────────────────────
+      Turn 1 → Turn 2 → Turn 3 → ... → Turn N → CONTEXT FULL!
+
+
+                                          ┌─────────────────┐
+                                          │ Generate summary│
+                                          │ ( USER WAITS !) │
+                                          └─────────────────┘
+
+
+                                               Continue
+      
+      ```
+  code cell:
+    source:
+      import time
+      
+      
+      class TraditionalCompactingChatSession:
+          """Traditional chat session with compaction after the fact."""
+      
+          def __init__(self, system_message="You are a helpful assistant", context_limit: int = 10000):
+              self.system_message = system_message
+              self.context_limit = context_limit  # the point at which the conversation is compacted so it does not exceed model limits.
+              self.messages = []
+              self.current_context_window_tokens = 0
+              self.summary = None
+      
+          def chat(self, user_message: str):
+              # In traditional compaction, we check if we need to compact when the user sends a message. NOT IDEAL!
+              if self.current_context_window_tokens >= self.context_limit:
+                  print(
+                      f"\n🧹 Context window at {self.current_context_window_tokens} tokens. Limit exceeded, compacting session memory..."
+                  )
+                  self.compact()  # compacts everything before the new user message
+      
+              self.messages.append({"role": "user", "content": user_message})
+              print(f"\nUser: {user_message}")
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=3500,
+                  system=self.system_message,
+                  messages=add_cache_control(self.messages),
+              )
+              assistant_message = response.content[0].text
+              self.messages.append({"role": "assistant", "content": assistant_message})
+      
+              print(f"\nAssistant: \n{truncate_response(assistant_message, max_lines=15)}")
+      
+              # approximate current token count in the conversation before the next user message
+              cache_read = getattr(response.usage, "cache_read_input_tokens", 0) or 0
+              total_input = response.usage.input_tokens + cache_read
+              self.current_context_window_tokens = total_input + response.usage.output_tokens
+      
+              print(
+                  f"Input={total_input:,}, Prompt cached used= {cache_read > 0} | "
+                  f"Output={response.usage.output_tokens:,} | "
+                  f"Messages={len(self.messages)}"
+              )
+              return assistant_message, response.usage
+      
+          def compact(self):
+              start_time = time.perf_counter()
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=5000,
+                  system=self.system_message,  # Same as main chat for cache sharing
+                  messages=add_cache_control(self.messages)
+                  + [{"role": "user", "content": SESSION_MEMORY_PROMPT}],
+              )
+              elapsed = time.perf_counter() - start_time
+      
+              # Generate new summary message
+              self.summary, removed_text = remove_thinking_blocks(
+                  response.content[0].text
+              )  # clean up any <think> blocks because they are not needed in the session memory
+              approximate_summary_tokens = response.usage.output_tokens - round(
+                  len(removed_text) / 4
+              )  # rough estimate of tokens removed from summary
+      
+              # Replace prior messages with new summary message
+              self.messages = [
+                  {
+                      "role": "user",
+                      "content": f"""This session is being continued from a previous conversation. Here is the session memory: {self.summary}.Continue from where we left off.""",
+                  }
+              ]
+      
+              # Show token reduction if we just compacted
+              reduction = self.current_context_window_tokens - approximate_summary_tokens
+              pct = (reduction / self.current_context_window_tokens) * 100
+      
+              print(f"\n{'-' * 60}")
+              print("📝 New session memory created.")
+              print(
+                  f"✅ Tokens reduced: {self.current_context_window_tokens:,} → {approximate_summary_tokens:.0f} ({reduction:,} tokens saved, {pct:.0f}% reduction)"
+              )
+              print(f"⏱️ Compaction time: {elapsed:.2f}s (user waiting...)")
+              print(f" Cache used: {getattr(response.usage, 'cache_read_input_tokens', 0) > 0}")
+              print(f"{'-' * 60}")
+      
+              # Update token count to reflect compacted state
+              self.current_context_window_tokens = approximate_summary_tokens
+  markdown cell:
+    source:
+      Below we simulate a conversation between an author and an LLM that helps write stories.
+  code cell:
+    source:
+      SYSTEM_PROMPT = """
+      You are a short story writer who helps authors develop their ideas into compelling narratives.
+      
+      ## What You Do
+      
+      **Plot Development**
+      - Help authors work through story structure, pacing, and narrative arc
+      - Identify plot holes, inconsistencies, or missed opportunities
+      - Suggest ways to raise stakes, add tension, or deepen conflict
+      - Brainstorm twists, resolutions, and scene transitions
+      
+      **Character Development**
+      - Develop backstories, motivations, and internal conflicts
+      - Ensure characters have distinct voices and consistent behavior
+      - Explore character relationships and how they drive the plot
+      - Help authors understand what their characters want vs. what they need
+      
+      **Drafting**
+      - Write short stories or scenes based on the author's ideas and direction
+      - Match tone, genre conventions, and stylistic preferences
+      - Show rather than tell when bringing scenes to life
+      - Craft dialogue that reveals character and advances plot
+      
+      ## How You Work
+      - You are the lead writer. When you disagree with a creative choice, say so respectfully, but ultimately defer to what the author wants.
+      - DO NOT ask the user to provide more context or clarify their request. Assume you have enough information to proceed.
+      """
+  code cell:
+    source:
+      session = TraditionalCompactingChatSession(system_message=SYSTEM_PROMPT)
+      
+      # Simulated conversation
+      messages = [
+          "I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well throught out plot ideas for me to consider.",
+          "I don't like those ideas, can you think of one plot something more unique and unexpected?",
+          "Ok I like it. Can you help me develop the main character's backstory and motivations?",
+          "Can you draft a detailed outline for the story, breaking it down into chapters and key events?",
+          "Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.",
+          "Can you draft a second chapter that builds on the first one, introducing a new twist in the mystery?",
+      ]
+      
+      print("Starting conversation...\n")
+      
+      turn_count = 0
+      
+      for _i, message in enumerate(messages, 1):
+          turn_count += 1
+          print(f"==============================================\nTurn {turn_count}:\n")
+          response, usage = session.chat(message)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          Starting conversation...
+          
+          ==============================================
+          Turn 1:
+          
+          
+          User: I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well throught out plot ideas for me to consider.
+          
+          Assistant: 
+          # Three Mystery Plot Ideas
+          
+          ## 1. **The Drowning Season**
+          
+          **Setup:** Twenty-three-year-old Detective Maya Reeves returns to her hometown of Millbrook after five years away, assigned to investigate what locals insist is just another tragic drowning at the quarry—the third one this summer. The sheriff's department is understaffed, and Maya's the only one taking it seriously.
+          
+          **The Mystery:** Each victim was found at dawn, fully clothed, with no water in their lungs. They all have the same strange detail: fresh soil under their fingernails, though the quarry is surrounded by rock. Maya discovers all three victims had recently started remembering "lost time" from their childhood—gaps of hours or days they couldn't explain. The town has a secret: twenty years ago, a local therapist ran a controversial "memory recovery" program for troubled kids. Maya was one of those kids, and she's starting to remember why she really left town.
+          
+          **What Makes It Work:** The murders are connected to buried trauma, both literal and psychological. The killer believes they're saving the victims from remembering something terrible. Maya must solve the case while confronting her own suppressed memories, never sure if what she's remembering is real or if she might be the next target.
+          
+          ---
+          
+          ## 2. **The Lighthouse Keeper's Daughter**
+          
+          **Setup:** Rookie detective James Park is sent to the coastal town of Beacon's Rest to investigate the disappearance of fifteen-year-old Clare Whitmore, daughter of the reclusive lighthouse keeper. She vanished during the annual Founder's Day festival three days ago. The town is cooperative but oddly unsurprised—seven girls have disappeared from Beacon's Rest over the past forty years, always during Founder's Day, always at age fifteen, and always without a trace.
+          ... (18 more lines)
+          Input=318, Prompt cached used= False | Output=988 | Messages=2
+          ==============================================
+          Turn 2:
+          
+          
+          User: I don't like those ideas, can you think of one plot something more unique and unexpected?
+          
+          Assistant: 
+          # The Cartographer's Cipher
+          
+          **Setup:** Detective Nora Alike, 24, takes her first solo case in the dying mining town of Shepherdstown (population 847 and falling). An elderly cartographer named Eugene Fisk was found dead in his cluttered workshop, officially ruled a heart attack. But his daughter insists something's wrong: her father spent his final month frantically updating a massive hand-drawn map of the town—adding buildings that don't exist, removing ones that do, and marking dozens of red X's in seemingly random locations.
+          
+          **The Twist:** Nora starts investigating the X's out of curiosity. The first one leads to an old park bench—underneath it, she finds a coffee can containing $2,300 in cash and a note: "For the Hendersons' mortgage, June 1987." The second X marks a spot behind the defunct movie theater where she uncovers an envelope with photographs proving the former mayor's affair (the scandal that ended his career). The third X reveals a time capsule with a child's confession to starting the fire that burned down the elementary school forty years ago.
+          
+          **The Real Mystery:** Eugene wasn't murdered—but he wasn't mapping the town as it is. He was mapping every secret buried within it, physical and metaphorical. For sixty years, he'd been Shepherdstown's unofficial confessor. People would visit his workshop and tell him things they couldn't tell anyone else, and he'd mark them on his map in an elaborate code only he understood. In his final month, knowing he was dying, he decoded everything—creating a map that, if made public, would destroy half the town.
+          
+          **The Investigation:** Nora realizes someone is also following the map. Items are being dug up, locations disturbed. She races to decipher Eugene's system before this unknown person does. But the deeper she digs, the more she uncovers: covered-up deaths, stolen inheritances, a hit-and-run from 1992, evidence of who really embezzled the union funds that closed the mine. Every secret connects to another. The town isn't dying by accident—it's been poisoned from within by decades of buried truth.
+          
+          **The Impossible Choice:** The person following the map is Eugene's daughter, who wants to expose everything and burn the town's lies to the ground. She blames these secrets—and the people who keep them—for turning Shepherdstown into a hollowed-out shell. Nora must decide: help her complete her father's final work and reveal every terrible truth, or destroy the map and let sleeping sins lie. The case becomes about whether a detective's job is to uncover the truth or to protect the living from what the truth will do to them.
+          
+          **What Makes It Unique:** 
+          - No traditional villain—just a town full of people who made choices
+          - The "victim" weaponized his own death
+          ... (5 more lines)
+          Input=1,328, Prompt cached used= False | Output=688 | Messages=4
+          ==============================================
+          Turn 3:
+          
+          
+          User: Ok I like it. Can you help me develop the main character's backstory and motivations?
+          
+          Assistant: 
+          # Nora Alike: Character Deep Dive
+          
+          ## Background
+          
+          **Age:** 24, but people consistently guess younger—which she hates. She's been fighting to be taken seriously her entire life.
+          
+          **Origin:** Grew up in a suburb of a mid-sized city, the youngest of four siblings by seven years (she was the "accident baby"). Her parents were older, tired, and emotionally checked out by the time she came along. Her siblings had already left home, so she essentially raised herself in a house that felt like a museum to other people's childhoods.
+          
+          **The Formative Incident:** When Nora was sixteen, her oldest brother Martin disappeared. Just stopped coming to family dinners, didn't return calls. Her parents were worried but passive—"He's an adult, he'll reach out when he's ready." After three weeks of everyone just *waiting*, Nora took the bus to his apartment herself. She found him in the middle of a breakdown, his apartment filthy, convinced he'd ruined his life after losing his job. Her parents had known something was wrong but didn't want to pry, didn't want to intrude. Their politeness, their respect for privacy, almost killed him.
+          
+          That's when Nora learned: sometimes the most destructive thing you can do is mind your own business.
+          
+          ## Why She Became a Detective
+          
+          **Surface Reason:** She tells people she's interested in justice, in puzzles, in helping people. Standard cop interview answers.
+          ... (44 more lines)
+          Input=2,039, Prompt cached used= False | Output=1,350 | Messages=6
+          ==============================================
+          Turn 4:
+          
+          
+          User: Can you draft a detailed outline for the story, breaking it down into chapters and key events?
+          
+          Assistant: 
+          # The Cartographer's Cipher: Detailed Outline
+          
+          ## ACT ONE: ARRIVAL AND DISCOVERY
+          
+          ### Chapter 1: The Smallest Case
+          - **Setting:** Nora arrives in Shepherdstown on a gray October morning. Description of the town—half the storefronts empty, population aging, mines closed for twenty years.
+          - **Key Event:** She meets Sheriff Dale Kovach (58, gruff, counting days to retirement). He makes it clear this is a nothing case—old man had a heart attack, daughter is in denial, just close it out.
+          - **Character Moment:** Nora's crappy motel room. She can't sleep. Lies awake analyzing the case file that shouldn't be a case.
+          - **Setup:** Brief flashback to why she's here—her last case, pushing too hard, burning bridges.
+          
+          ### Chapter 2: The Workshop
+          - **Setting:** Eugene Fisk's workshop—a converted garage behind his house, packed with decades of maps, surveying equipment, and obsessive documentation.
+          - **Key Event:** Nora meets Eugene's daughter, **Caroline Fisk** (44, librarian, wire-thin with grief and rage). Caroline shows her the map—a massive 8x10 foot rendering of Shepherdstown with bizarre alterations.
+          - **The Map:** Buildings that don't exist (a church that burned down in 1963, a restaurant that was never built). Streets rerouted. And 63 red X's scattered across the town.
+          - **Caroline's Plea:** "My father didn't have heart attacks. He had *purpose*. Someone scared him to death."
+          ... (113 more lines)
+          Input=3,411, Prompt cached used= False | Output=3,500 | Messages=8
+          ==============================================
+          Turn 5:
+          
+          
+          User: Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.
+          
+          Assistant: 
+          # Chapter 1: The Smallest Case
+          
+          The thing about dying towns is they all smell the same—wet cardboard and regret.
+          
+          Nora Alike noticed it the moment she stepped out of her Honda Civic into the gray October drizzle. Shepherdstown, Pennsylvania stretched out before her like a body someone had forgotten to bury. Half the storefronts on Main Street were empty, their windows either papered over with faded FOR LEASE signs or simply left naked to expose the gutted interiors. A diner called Mabel's still had its lights on, and a Dollar General looked operational, but that was about it for signs of life.
+          
+          Population 847, the sign at the town limits had declared. Nora suspected that number was aspirational.
+          
+          She locked her car—force of habit, though she doubted anyone here had the energy for theft—and crossed the street toward the sheriff's office. The building was brick, squat, probably built in the sixties when the coal mine was still running and the town still had a future. A rusted flagpole stood out front, the rope slapping against the metal in the wind like a half-hearted call for help.
+          
+          *This is what you get*, Nora told herself. *This is what happens when you push too hard.*
+          
+          Three months ago, she'd been working property crimes in Harrisburg. Not glamorous, but it was something. Then she'd caught a case involving a city councilman's nephew and a string of car thefts. Everyone told her to go easy, that the kid was "going through something," that his uncle was "a good man who'd served the community for twenty years." Nora had arrested him anyway, because the evidence was there and because she couldn't understand why everyone wanted her to look the other way.
+          
+          The charges stuck. The councilman made calls. Nora's lieutenant made it clear that while he couldn't fire her, he could make her life miserable. Two weeks later, she'd accepted a "transfer opportunity" to Shepherdstown, where the entire sheriff's department consisted of three people and the most exciting case in recent memory was someone stealing catalytic converters from the Dollar General parking lot.
+          ... (98 more lines)
+          Input=6,947, Prompt cached used= False | Output=2,670 | Messages=10
+          ==============================================
+          Turn 6:
+          
+          
+          User: Can you draft a second chapter that builds on the first one, introducing a new twist in the mystery?
+          
+          Assistant: 
+          # Chapter 2: The Workshop
+          
+          Caroline Fisk's house was a narrow two-story Victorian on Elm Street, painted a faded yellow that had probably been cheerful once. The detached garage sat thirty feet behind it, connected by a flagstone path that cut through what had been a carefully tended garden. Past tense, because the garden was dying now—rosebushes gone leggy and wild, perennials choked with weeds, a birdbath tipped over on its side.
+          
+          Nora parked on the street and walked up the front path. Before she could knock, the door opened.
+          
+          Caroline Fisk was forty-four but looked older, the way grief ages people in fast-forward. Thin to the point of frailty, with graying brown hair pulled back in a hasty ponytail. She wore jeans and an oversized cardigan that might have been her father's. Her eyes were red-rimmed but sharp, evaluating Nora with the focused intensity of someone who'd cried herself out and moved on to anger.
+          
+          "You're the detective." Not a question.
+          
+          "Detective Alike. I'm sorry for your loss, Ms. Fisk."
+          
+          "Are you?" Caroline stepped aside to let Nora in. "Or are you here to tell me I'm a hysterical woman who can't accept that her father died of natural causes?"
+          
+          "I'm here to listen."
+          ... (161 more lines)
+          Input=9,641, Prompt cached used= True | Output=3,500 | Messages=12
+  markdown cell:
+    source:
+      This is a long conversation with several turns. You'll notice a few things here:
+      
+      Prompt caching: You'll notice here that the input tokens eventually grew to a point where prompt caching was used (turn 6). This helps reduce costs and speed as these conversations grow!
+  markdown cell:
+    source:
+      On the next turn, we are going to hit our 10K context window limit, which triggers compaction:
+  code cell:
+    source:
+      response, usage = session.chat("Propose a title for the book")
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          
+          🧹 Context window at 13141 tokens. Limit exceeded, compacting session memory...
+          
+          ------------------------------------------------------------
+          📝 New session memory created.
+          ✅ Tokens reduced: 13,141 → 1559 (11,582 tokens saved, 88% reduction)
+          ⏱️ Compaction time: 36.13s (user waiting...)
+           Cache used: True
+          ------------------------------------------------------------
+          
+          User: Propose a title for the book
+          
+          Assistant: 
+          Looking at the story we've developed, I'd propose:
+          
+          **"The Cartographer's Confession"**
+          
+          Here's why this works:
+          
+          **Thematic Resonance:**
+          - The double meaning captures Eugene's dual role: he kept confessions *and* his final map is itself a confession
+          - "Cartographer" immediately signals the unique hook of your premise
+          - "Confession" ties to the central tension between exposure and privacy
+          
+          **Alternative Titles to Consider:**
+          
+          1. **"Burial Ground"** - More commercial, emphasizes the literal buried evidence and metaphorical buried truths
+          
+          ... (13 more lines)
+          Input=1,840, Prompt cached used= False | Output=325 | Messages=3
+  markdown cell:
+    source:
+      You'll notice here that it took **over 36 seconds** for the agent to compact the conversation. Because we used traditional compaction, the user would be waiting on Claude to compact the conversation, which is not an ideal user experience.
+      
+      Below you can see the result of the compaction. It captures the key elements of conversation in less than 2K tokens.
+  code cell:
+    source:
+      print(session.summary)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          ## User Intent
+          User requested: "I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well throught out plot ideas for me to consider."
+          
+          After rejecting initial 3 plots, user specified: "I don't like those ideas, can you think of one plot something more unique and unexpected?"
+          
+          Accepted "The Cartographer's Cipher" concept. Then requested: character development, detailed outline, and chapter drafts.
+          
+          ## Completed Work
+          
+          **Story Concept Developed:**
+          - Title: "The Cartographer's Cipher"
+          - Premise: Detective investigates death of cartographer who spent final month decoding 40 years of town secrets onto a map with 63 red X's marking buried physical evidence
+          
+          **Character: Detective Nora Alike**
+          - Age: 24, physically small (5'4"), socially awkward, insomniac
+          - Backstory: Youngest of 4 siblings by 7 years, essentially raised herself. At 16, "rescued" brother Martin from breakdown—he resents the intrusion
+          - Transferred to Shepherdstown from Harrisburg Property Crimes after arresting city councilman's nephew, making powerful enemies
+          - Fatal flaw: Believes exposure always equals healing; pathological need to know; invasive about others' secrets, intensely private about her own
+          - Character arc: Must learn some truths cause more harm than good
+          
+          **Setting: Shepherdstown, PA**
+          - Population: 847 (declining)
+          - Coal mine closed 1998, town dying since
+          - Key locations: Sheriff's office, Fisk house/workshop, abandoned mine, Motor Lodge
+          
+          **Supporting Characters:**
+          - Eugene Fisk: 79, cartographer, died 4 days before story opens. Stage 4 pancreatic cancer. Spent 40 years as town's "confessor"
+          - Caroline Fisk: 44, librarian, Eugene's daughter, wants to expose all secrets
+          - Sheriff Dale Kovach: 58, wants case closed quickly
+          - Deputy Marcus Webb: 31, local, volunteers to help (his father's secret on the map)
+          - Helen Morrison: 73, visited Eugene asking him to "mark" something, he refused
+          
+          **17-Chapter Outline Created:**
+          - Act 1 (Chapters 1-5): Nora arrives, discovers map, investigates first X's revealing buried secrets (Hendersons' mortgage, mayor's affair, fire confession, etc.)
+          - Act 2 (Chapters 6-11): Someone else following map, threatens Nora. Interviews living victims who beg her not to expose secrets. Discovers mine embezzlement cover-up
+          - Act 3 (Chapters 12-17): Town meeting confrontation, Nora discovers Mayor Ortiz's father covered up mine safety violations. Phone call from brother Martin reveals he resents her "rescue." Nora compromises: exposes mine cover-up (affects everyone), buries personal secrets. Arrests Mayor and father. Ambiguous ending about whether truth serves justice
+          
+          **Key Plot Points:**
+          - Mine closed due to covered-up safety violations, not economics
+          - Vernon Pike (union treasurer) embezzled funds as scapegoat at Donald Mercer's direction
+          - Donald Mercer is Mayor Linda Ortiz's father
+          - Eugene's final journal: "I've mapped every lie, every buried truth... Maybe the only cure is exposure. Or maybe exposure is just another kind of death."
+          - Each X marks physical evidence of a secret (cash, photos, confessions, documents)
+          
+          **Chapters Drafted:**
+          - Chapter 1 (~2,000 words): Nora arrives Shepherdstown, meets Sheriff Kovach who dismisses case, assigned to talk to Caroline Fisk
+          - Chapter 2 (continuation): Nora visits Caroline, sees workshop and massive incorrect map with 63 red X's. Caroline explains Eugene's paranoid final month. Journal entry reveals "Morrison girl" visit. Coroner confirms extremely elevated stress hormones. Chapter ends with Nora heading to investigate old mine location with survey map
+          
+          ## Errors & Corrections
+          
+          User rejected first 3 plot concepts as not unique/unexpected enough:
+          1. "The Drowning Season" (memory recovery therapy murders)
+          2. "The Lighthouse Keeper's Daughter" (ritual sacrifices every 5-6 years)
+          3. "The Memory Box Murders" (classmates hunting each other over past crime)
+          
+          User directive: "can you think of one plot something more unique and unexpected?" Led to cartographer concept.
+          
+          ## Active Work
+          
+          Chapter 2 just completed. Ends with:
+          "She headed for the door, the survey map folded in her pocket and Eugene Fisk's final journal entry echoing in her mind: *Maybe the only cure is exposure. Or maybe exposure is just another kind of death.*
+          
+          Outside, the clouds had thickened again, pressing down on Shepherdstown like a shroud. Somewhere in this dying town, someone had scared an old man to death."
+          
+          Nora is heading to investigate the old mine (northern edge of town, surrounded by woods, one overgrown access road). She has Eugene's survey map showing cluster of X's around mine area, dates back to 1998.
+          
+          ## Pending Tasks
+          
+          No explicit requests pending. Story development ongoing—presumably more chapters to draft following the 17-chapter outline structure.
+          
+          ## Key References
+          
+          **Timeline:**
+          - 1998: Mine closes (covered-up safety violations)
+          - 2003: Martha Fisk (Eugene's wife) dies of cancer
+          - 5 weeks before present: Eugene starts creating "corrected" map
+          - 2 weeks before present: Caroline finds Eugene shaking, says "should have left them buried"
+          - 1 week before present: Helen Morrison visits, Eugene refuses to mark something
+          - 4 days before present: Eugene found dead with extremely elevated cortisol levels
+          
+          **Map Details:**
+          - 8ft x 10ft, mounted on foam board with acetate cover
+          - Shows Shepherdstown with deliberate "errors": church on Third & Maple (doesn't exist), Giovanni's restaurant on Main, rerouted streets
+          - 63+ red X's scattered across town
+          - Each X marks buried physical evidence of a secret
+          - Survey map subset focuses on mine area with names/dates/timeline from 1998
+          
+          **Character Relationships:**
+          - Nora/Martin (brother): She "saved" him 8 years ago when he was 23 and having breakdown; he resents the public humiliation; they barely speak now
+          - Eugene/Caroline: She cared for him through cancer; he left her the decoded map knowing she'd find it
+          - Eugene/townspeople: He was unofficial confessor for 40 years; people trusted him to keep secrets safe
+  markdown cell:
+    source:
+      ## Instant Compaction
+      
+      With **Instant compaction** the session memory is PROACTIVELY generated once a soft token threshold is reached. 
+      
+      Once the user triggers a compaction or a hard limit is reached, the summary is already available, so the user doesn't need to wait.
+      
+      Result: Instant compaction, no waiting.
+  markdown cell:
+    source:
+      
+      SESSION MEMORY COMPACTION (instant)
+      ```
+      ────────────────────────────────────
+      Turn 1 → Turn 2 → ... → Turn K → Turn K+1 → ... → Turn N → ..  → CONTEXT FULL!
+                                  │                         │            │
+                      (soft token threshold met:        (update          │
+                     initialize session memory)          trigger)        │
+                                  │                                      │
+                                  │                         │            │
+                                  ▼                         ▼            │
+                             ┌────────┐                ┌────────┐        │
+                             │ Create │                │ Update │        │
+                             │ memory │ (background)   │ memory │        │
+                             └────────┘                └────────┘        │
+                                  │                         │            │
+                                  ▼                         ▼            ▼
+                           📝 session-memory.md ──────────────────► INSTANT SWAP!
+                             (continuously updated)
+      ```
+      
+      **Update triggers:** The first summary is generated after the initial soft token limit. Updates can be triggered after every subsequent turn, or at periodically at natural breakpoints intervals (e.g. every ~10k tokens or 3+ tool calls).
+  markdown cell:
+    source:
+      This `InstantCompactingChatSession` class uses **threading** for background execution:
+      1. **`threading.Thread`** - runs memory updates in background without blocking
+      2. **Thread-safe state** - uses `threading.Lock` to safely update shared memory
+      3. **Daemon threads** - background work doesn't prevent program exit
+      4. **Instant compaction** - when context is full, just swap in the pre-built memory
+  code cell:
+    source:
+      import threading
+      import time
+      
+      
+      class InstantCompactingChatSession:
+          """
+          Maintains session memory via incremental background updates.
+      
+          Key insight: By updating memory in the background after each turn,
+          the summary is already ready when compaction is needed - instant swap!
+          """
+      
+          def __init__(
+              self,
+              system_message="You are a helpful assistant",
+              context_limit: int = 12000,
+              min_tokens_to_init: int = 7500,
+              min_tokens_between_updates: int = 2000,
+          ):
+              # Thresholds
+              self.context_limit = context_limit  # the point at which the conversation is compacted so it does not exceed model limits
+              self.min_tokens_to_init = min_tokens_to_init  # tokens needed to trigger initial memory creation; note this happens PROACTIVELY in background unlike traditional compaction
+              self.min_tokens_between_updates = min_tokens_between_updates  # tokens needed to trigger memory update. only comes into play after initial memory is created and additional compaction (memory update) is needed after that
+      
+              # Conversation state
+              self.system_message = system_message
+              self.messages = []
+              self.current_context_window_tokens = 0
+      
+              # Session memory state
+              self.session_memory = None  # this is the compacted conversation in session memory; for the demo we are storing this in memory, but in production you would write to session_memory.md file
+              self.last_summarized_index = (
+                  0  # The index of the last message included in the session memory
+              )
+              self.tokens_at_last_update = 0  # To track tokens at last memory update and see if enough new tokens have been added to trigger another update
+      
+              # Background update tracking
+              self._update_thread: threading.Thread | None = None
+              self.last_update_time = None
+              self._lock = threading.Lock()
+      
+          def chat(self, user_message: str):
+              """Process a chat turn with background session memory updates."""
+      
+              if self.current_context_window_tokens + estimate_tokens(user_message) >= self.context_limit:
+                  self.compact()  # note that when this is triggered, the compaction has already been created and is just swapped in instantly
+      
+              self.messages.append({"role": "user", "content": user_message})
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=3500,
+                  system=self.system_message,
+                  messages=add_cache_control(self.messages),
+              )
+      
+              assistant_message = response.content[0].text
+              self.messages.append({"role": "assistant", "content": assistant_message})
+      
+              # Calculate token usage including cache
+              cache_read = getattr(response.usage, "cache_read_input_tokens", 0) or 0
+              total_input = response.usage.input_tokens + cache_read
+      
+              # Update context window tokens (includes cached tokens since they still count toward context)
+              self.current_context_window_tokens = total_input + response.usage.output_tokens
+      
+              # KEY DIFFERENCE: Trigger background memory update if needed proactively, before compaction is needed
+              background_status = None
+              if self._should_init_memory() or self._should_update_memory():
+                  self._trigger_background_update()
+                  background_status = "initializing" if self.session_memory is None else "updating"
+      
+              # Return usage info with cache stats
+              return assistant_message, response.usage, background_status
+      
+          # Helper methods to determine when to init session memory
+          def _should_init_memory(self) -> bool:
+              return (
+                  self.session_memory is None
+                  and self.current_context_window_tokens >= self.min_tokens_to_init
+              )
+      
+          # Helper method to determine if memory should be updated
+          def _should_update_memory(self) -> bool:
+              if self.session_memory is None:
+                  return False
+              tokens_since = self.current_context_window_tokens - self.tokens_at_last_update
+              return tokens_since >= self.min_tokens_between_updates
+      
+          # Methods to create initial session memory
+          def _create_session_memory(self, messages: list[dict]) -> str:
+              """Generate initial session memory from messages."""
+              # Put compaction instructions in user message to share cache with main chat
+              compaction_messages = [{"role": "user", "content": SESSION_MEMORY_PROMPT}]
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=5000,
+                  system=self.system_message,  # Same as main chat for cache sharing
+                  messages=add_cache_control(messages) + compaction_messages,
+              )
+              summary, _ = remove_thinking_blocks(
+                  response.content[0].text
+              )  # clean up any <think> blocks because they are not needed in the session memory
+              print(
+                  f"   [Background] Initial session memory created. Cache hit={getattr(response.usage, 'cache_read_input_tokens', 0) > 0}"
+              )
+              return summary
+      
+          def _update_session_memory(self, new_messages: list[dict]) -> str:
+              """Update existing session memory with new messages. In practice, you may want to do this via file edit rather than full re-generation. But for demo purposes we do full regeneration here."""
+              # Put compaction instructions in user message to share cache with main chat
+              compaction_update_messages = [
+                  {
+                      "role": "user",
+                      "content": SESSION_MEMORY_PROMPT
+                      + f"""There is an existing session memory: {self.session_memory}. Return the entire session memory with updates to reflect new messages.""",
+                  }
+              ]
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=5000,
+                  system=self.system_message,
+                  messages=new_messages
+                  + compaction_update_messages,  # you may want to use prompt caching instead, in which case you'd use add_cache_control(self.messages) here
+              )
+              updated_summary, _ = remove_thinking_blocks(
+                  response.content[0].text
+              )  # clean up any <think> blocks because they are not needed in the session memory
+              print("   [Background] Session memory updated.")
+              return updated_summary
+      
+          # Background memory update methods
+          def _background_memory_update(
+              self, messages_snapshot: list[dict], snapshot_index: int, current_tokens: int
+          ):
+              """Run session memory update in a background thread."""
+              try:
+                  with self._lock:
+                      current_session_memory = self.session_memory
+                      last_index = self.last_summarized_index
+      
+                  if current_session_memory is None:
+                      new_memory = self._create_session_memory(messages_snapshot)
+                  else:
+                      # Get new messages since last summary
+                      new_messages = messages_snapshot[last_index:]
+                      if not new_messages:
+                          return
+                      new_memory = self._update_session_memory(new_messages)
+      
+                  # Update state (thread-safe)
+                  with self._lock:
+                      self.session_memory = new_memory
+                      self.last_summarized_index = snapshot_index
+                      self.tokens_at_last_update = current_tokens
+                      self.last_update_time = time.time()
+      
+              except Exception as e:
+                  print(f"   [Background] Error updating memory: {e}")
+      
+          # This makes sure only one background update runs at a time. If one is already running, we skip starting another. If not, we start a new thread to do the update.
+          def _trigger_background_update(self):
+              """Trigger a background session memory update."""
+              if self._update_thread is not None and self._update_thread.is_alive():
+                  return
+      
+              messages_snapshot = self.messages.copy()
+              snapshot_index = len(messages_snapshot)
+              current_tokens = self.current_context_window_tokens
+      
+              self._update_thread = threading.Thread(
+                  target=self._background_memory_update,
+                  args=(messages_snapshot, snapshot_index, current_tokens),
+                  daemon=True,
+              )
+              self._update_thread.start()
+      
+          # Function to compact
+          def compact(self):
+              """INSTANT compaction using pre-built session memory."""
+              prev_msg_count = len(self.messages)
+      
+              # Ensure session memory is ready. Shouldn't be an issue normally, but here for safety.
+              if self.session_memory is None:
+                  if self._update_thread is not None and self._update_thread.is_alive():
+                      print("   ⏳ Waiting for background memory update...")
+                      self._update_thread.join(timeout=30.0)
+      
+                  if self.session_memory is None:
+                      print("   ⚠️  No pre-built memory, creating synchronously...")
+                      start = time.perf_counter()
+                      self.session_memory = self._create_session_memory(self.messages)
+                      elapsed = time.perf_counter() - start
+                      print(f"   ⏱️  Took {elapsed:.2f}s (but should be instant normally!)")
+                      self.last_summarized_index = len(self.messages)
+      
+              with self._lock:
+                  unsummarized = self.messages[self.last_summarized_index :]
+                  summary_message = [
+                      {
+                          "role": "user",
+                          "content": f"""This session is being continued from a previous conversation. Here is the session memory: {self.session_memory}.Continue from where we left off.""",
+                      }
+                  ]
+                  self.messages = summary_message + unsummarized
+                  self.last_summarized_index = 1
+      
+                  print(f"\n{'=' * 60}")
+                  print(f"⚡ INSTANT COMPACTION! Messages: {prev_msg_count} → {len(self.messages)}")
+                  print("   Session memory was pre-built (no wait time!)")
+                  print(f"{'=' * 60}")
+  markdown cell:
+    source:
+      ### Example use of Instant Compaction
+  code cell:
+    source:
+      # Low thresholds for demo - in production you'd use higher values
+      session = InstantCompactingChatSession(
+          system_message=SYSTEM_PROMPT,
+      )
+      
+      messages = [
+          "I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well throught out plot ideas for me to consider.",
+          "I don't like those ideas, can you think of one plot something more unique and unexpected?",
+          "Ok I like it. Can you help me develop the main character's backstory and motivations?",
+          "Can you draft a detailed outline for the story, breaking it down into chapters and key events?",
+          "Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.",
+          "Can you draft a second chapter that builds on the first one?"
+          "Can you revise that second chapter, make it more suspenseful and engaging?",
+      ]
+      print("Starting conversation with instant compacting chat session...\n")
+      
+      turn_count = 0
+      for _i, message in enumerate(messages, 1):
+          response, usage, background_status = session.chat(message)
+          turn_count += 1
+      
+          # Calculate cache stats
+          cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
+          cache_created = getattr(usage, "cache_creation_input_tokens", 0) or 0
+          total_input = usage.input_tokens + cache_read
+      
+          print(f"{'=' * 60}")
+          print(f"Turn {turn_count}:")
+          print(f"\nUser: {message}")
+          print(f"\nAssistant: \n{truncate_response(response, max_lines=3)}")
+          print("\nToken Usage:")
+          print(f"  Input: {total_input:,} (new: {usage.input_tokens:,}, cached: {cache_read:,})")
+          print(f"  Output: {usage.output_tokens:,}")
+          print(
+              f"  Messages: {len(session.messages)} | Memory: {'ready' if session.session_memory else 'not yet'}"
+          )
+      
+          if cache_read > 0:
+              cache_pct = (cache_read / total_input) * 100
+              print(f"  ✓ Cache hit! {cache_pct:.0f}% of input from cache")
+      
+          if background_status:
+              print(f"\n  [Background] Proactively {background_status} session memory...")
+              print(f"  Context window: {session.current_context_window_tokens:,} tokens")
+      
+          print()
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          Starting conversation with instant compacting chat session...
+          
+          ============================================================
+          Turn 1:
+          
+          User: I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well throught out plot ideas for me to consider.
+          
+          Assistant: 
+          # Three Detective Story Concepts
+          
+          ## 1. **The Vanishing Act**
+          ... (30 more lines)
+          
+          Token Usage:
+            Input: 318 (new: 318, cached: 0)
+            Output: 762
+            Messages: 2 | Memory: not yet
+          
+          ============================================================
+          Turn 2:
+          
+          User: I don't like those ideas, can you think of one plot something more unique and unexpected?
+          
+          Assistant: 
+          # **The Cartographer's Daughter**
+          
+          **Premise:** Nora Fields (23) returns to her dying hometown of Millbrook after her cartographer father's sudden death. The town is literally disappearing—not metaphorically, but *actually*. Buildings that existed last month are gone. Streets lead to nowhere. Residents have conflicting memories about what was where. The population has dropped from 2,000 to 300 in three years, but no one can agree on who left or why.
+          ... (20 more lines)
+          
+          Token Usage:
+            Input: 1,102 (new: 1,102, cached: 0)
+            Output: 636
+            Messages: 4 | Memory: not yet
+          
+          ============================================================
+          Turn 3:
+          
+          User: Ok I like it. Can you help me develop the main character's backstory and motivations?
+          
+          Assistant: 
+          # Nora Fields - Character Development
+          
+          ## Background
+          ... (67 more lines)
+          
+          Token Usage:
+            Input: 1,760 (new: 1,760, cached: 0)
+            Output: 1,214
+            Messages: 6 | Memory: not yet
+          
+          ============================================================
+          Turn 4:
+          
+          User: Can you draft a detailed outline for the story, breaking it down into chapters and key events?
+          
+          Assistant: 
+          # **The Cartographer's Daughter - Story Outline**
+          
+          ## ACT ONE: The Return (Chapters 1-5)
+          ... (175 more lines)
+          
+          Token Usage:
+            Input: 2,996 (new: 2,996, cached: 0)
+            Output: 3,500
+            Messages: 8 | Memory: not yet
+          
+          ============================================================
+          Turn 5:
+          
+          User: Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.
+          
+          Assistant: 
+          # Chapter 1: Homecoming
+          
+          The town looked smaller than Nora remembered, which was exactly what everyone said about their hometowns. What they didn't mention was the peculiar quality of that smallness—not just diminished by time and perspective, but somehow *compressed*, as if Millbrook had been folded in on itself while she was away.
+          ... (168 more lines)
+          
+          Token Usage:
+            Input: 6,532 (new: 6,532, cached: 0)
+            Output: 3,077
+            Messages: 10 | Memory: not yet
+          
+            [Background] Proactively initializing session memory...
+            Context window: 9,609 tokens
+          
+             [Background] Initial session memory created. Cache hit=True
+          ============================================================
+          Turn 6:
+          
+          User: Can you draft a second chapter that builds on the first one?Can you revise that second chapter, make it more suspenseful and engaging?
+          
+          Assistant: 
+          # Chapter 2: The Map That Shouldn't Exist
+          
+          The funeral was smaller than Nora expected, which was saying something—she'd expected it to be small.
+          ... (214 more lines)
+          
+          Token Usage:
+            Input: 9,642 (new: 5,546, cached: 4,096)
+            Output: 3,500
+            Messages: 12 | Memory: ready
+            ✓ Cache hit! 42% of input from cache
+          
+            [Background] Proactively updating session memory...
+            Context window: 13,142 tokens
+          
+      output 1:
+        output_type: stream
+        name: stdout
+        text:
+             [Background] Session memory updated.
+  code cell:
+    source:
+      message = "What did we just talk about? Give me one sentence"
+      response, usage, background_status = session.chat(message)
+      
+      # Calculate cache stats
+      cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
+      total_input = usage.input_tokens + cache_read
+      
+      print(f"\nUser: {message}")
+      print(f"\nAssistant: \n{truncate_response(response, max_lines=3)}")
+      print("\nToken Usage:")
+      print(f"  Input: {total_input:,} (new: {usage.input_tokens:,}, cached: {cache_read:,})")
+      print(f"  Output: {usage.output_tokens:,}")
+      print(
+          f"  Messages: {len(session.messages)} | Memory: {'ready' if session.session_memory else 'not yet'}"
+      )
+      
+      if cache_read > 0:
+          cache_pct = (cache_read / total_input) * 100
+          print(f"  ✓ Cache hit! {cache_pct:.0f}% of input from cache")
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          
+          ============================================================
+          ⚡ INSTANT COMPACTION! Messages: 12 → 1
+             Session memory was pre-built (no wait time!)
+          ============================================================
+          
+          User: What did we just talk about? Give me one sentence
+          
+          Assistant: 
+          We had just finished drafting Chapter 2 (the funeral and evidence discovery), and you requested that I revise it to make it more suspenseful and engaging—which I hadn't completed yet before the conversation ended.
+          
+          Would you like me to provide that revised, more suspenseful version of Chapter 2 now?
+          
+          Token Usage:
+            Input: 2,276 (new: 2,276, cached: 0)
+            Output: 71
+            Messages: 3 | Memory: ready
+  markdown cell:
+    source:
+      You'll notice here that once we hit the context limit, the session memory was instantaly swapped in, meaning the user had zero waiting time for a response!
+  markdown cell:
+    source:
+      ## Advanced: Understanding Prompt Caching
+  markdown cell:
+    source:
+      
+      The background updates can be made **~10x cheaper** by using prompt caching. The trick:
+      1. Pass the **full conversation** to the background summarizer
+      2. Add `cache_control` markers so subsequent requests hit the cache
+      3. Only the new "summarize this" instruction is billed at full price
+      
+      ```
+      ┌─────────────────────────────────────────────────────────────────────────────────┐
+      │                    PROMPT CACHING FOR LONG CONVERSATIONS                        │
+      ├─────────────────────────────────────────────────────────────────────────────────┤
+      │                                                                                 │
+      │  WITHOUT CACHING: Pay full price for entire context every turn                 │
+      │  ════════════════════════════════════════════════════════════                   │
+      │                                                                                 │
+      │  Turn 1:  [System][User1][Asst1]                         →  500 tokens  @ $3/M │
+      │  Turn 2:  [System][User1][Asst1][User2][Asst2]           → 1500 tokens  @ $3/M │
+      │  Turn 3:  [System][User1][Asst1][User2][Asst2][User3]... → 3000 tokens  @ $3/M │
+      │  Turn 4:  [System][User1][Asst1][User2][Asst2][User3]... → 5000 tokens  @ $3/M │
+      │           ─────────────────────────────────────────────                         │
+      │                                              Total: 10,000 tokens = $0.030      │
+      │                                                                                 │
+      │                                                                                 │
+      │  WITH CACHING: Pay full price once, then 90% discount on prefix                │
+      │  ═══════════════════════════════════════════════════════════════                │
+      │                                                                                 │
+      │  Turn 1:  [System][User1][Asst1]◆                        →  500 tokens  @ $3/M │
+      │                                ▲                            (cache created)    │
+      │                          cache breakpoint                                       │
+      │                                                                                 │
+      │  Turn 2:  [System][User1][Asst1][User2][Asst2]◆                                │
+      │           ╰─────── cached ──────╯                                              │
+      │                500 @ $0.30/M + 1000 new @ $3/M  =  $0.0032                     │
+      │                                                                                 │
+      │  Turn 3:  [System][User1][Asst1][User2][Asst2][User3][Asst3]◆                  │
+      │           ╰──────────── cached ─────────────╯                                  │
+      │               1500 @ $0.30/M + 1500 new @ $3/M  =  $0.0050                     │
+      │                                                                                 │
+      │  Turn 4:  [System][User1][Asst1][User2][Asst2][User3][Asst3][User4][Asst4]◆    │
+      │           ╰───────────────────── cached ─────────────────────╯                 │
+      │                     3000 @ $0.30/M + 2000 new @ $3/M  =  $0.0069               │
+      │           ─────────────────────────────────────────────                         │
+      │                                              Total: $0.0166  (45% savings)     │
+      │                                                                                 │
+      ├─────────────────────────────────────────────────────────────────────────────────┤
+      │                                                                                 │
+      │  COMPACTION + CACHING: Double benefit                                           │
+      │  ════════════════════════════════════                                           │
+      │                                                                                 │
+      │    Main Chat                      Background Summarizer                         │
+      │    ─────────                      ─────────────────────                         │
+      │                                                                                 │
+      │  [Conversation grows...]          [Same conversation prefix]◆ + [Summarize!]   │
+      │         │                                    │                                  │
+      │         │                         Cache hit! Only pays for                      │
+      │         │                         the summarization prompt                      │
+      │         │                                    │                                  │
+      │         ▼                                    ▼                                  │
+      │  Context limit reached  ──────►  Session memory ready instantly                │
+      │                                  (built cheaply in background)                  │
+      │                                                                                 │
+      │  ┌──────────────────────────────────────────────────────────────────────────┐  │
+      │  │  Key insight: The background summarizer reuses the same conversation     │  │
+      │  │  prefix that was just sent to the main chat - automatic cache hit!       │  │
+      │  └──────────────────────────────────────────────────────────────────────────┘  │
+      │                                                                                 │
+      └─────────────────────────────────────────────────────────────────────────────────┘
+      
+      ◆ = cache_control breakpoint (cache everything before this point)
+      ```
+      
+      ### Why this matters for compaction
+      
+      | Scenario | Cost per background update | Notes |
+      |----------|---------------------------|-------|
+      | No caching | Full input cost | 5,000 tokens × $3/M = $0.015 |
+      | With caching | ~10% of input cost | 500 new + 4,500 cached = $0.003 |
+      | **Savings** | **~80%** | Compounds over many updates |
+      
+      The longer the conversation, the bigger the savings—exactly when you need compaction most!
+  markdown cell:
+    source:
+      ### How the Caching Works
+      
+      The key is in `_add_cache_control()` and `_create_session_memory_cached()`:
+      
+      ```python
+      # 1. Mark the last conversation message with cache_control
+      {
+          "role": "user",
+          "content": [{
+              "type": "text",
+              "text": msg["content"],
+              "cache_control": {"type": "ephemeral"}  # <-- This creates a cache breakpoint
+          }]
+      }
+      
+      # 2. Also mark the system prompt
+      system=[{
+          "type": "text",
+          "text": "You are a session memory agent...",
+          "cache_control": {"type": "ephemeral"}
+      }]
+      ```
+      
+      **Why this works:**
+      - The first background update creates a cache entry for `[System + Messages]`
+      - Subsequent updates with the same message prefix get **cache hits**
+      - Only the new summarization instruction is billed at full price
+      - Cache entries have a 5-minute TTL, so rapid updates benefit most
+      
+      **Cost math:**
+      - Without caching: 5,000 tokens × $3.00/1M = $0.015 per update
+      - With caching: 500 new tokens × $3.00/1M + 4,500 cached × $0.30/1M = $0.00285
+      - **Savings: ~80%** on background summarization costs
+  markdown cell:
+    source:
+      ## Conclusion
+      
+      In this cookbook, you learned how to manage long-running Claude conversations through session memory compaction.
+      
+      ### What We Covered
+      
+      ✅ **Effective compaction prompts** - Structure your session memory to preserve user intent, completed work, errors, active work, and key references while discarding filler
+      
+      ✅ **Instant compaction** - Use background threading to proactively build session memory, eliminating user wait time when context limits are reached
+      
+      ✅ **Prompt caching for cost savings** - Reduce background update costs by ~80% by reusing the conversation prefix cache
+      
+      ✅ **Traditional vs. instant patterns** - Understand when to use each approach based on your application needs
+      
+      ### Key Takeaways
+      
+      1. **Weight recency heavily** - The end of a conversation is the active working context
+      2. **Preserve user corrections verbatim** - Prevents the model from reverting to old behaviors
+      3. **Build memory proactively** - Don't wait for context limits; start background updates early
+      4. **Leverage prompt caching** - Background summarization can share cache with the main conversation
+      
+      ### Next Steps
+      
+      - **For agentic workflows**: See [Automatic Context Compaction](../tool_use/automatic-context-compaction.ipynb) for SDK-based automatic compaction with tool use
+      - **For production**: Consider persisting session memory to disk rather than keeping it in memory
+      - **For optimization**: Experiment with update frequency thresholds to balance cost vs. freshness

Generated by nbdime

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review

Recommendation: REQUEST_CHANGES

Summary

This PR adds a comprehensive notebook on session memory compaction techniques demonstrating both traditional and instant compaction patterns with threading and prompt caching. The content is high-quality, well-structured, and pedagogically excellent. However, there are several important technical issues that should be addressed to maintain project code quality standards.

Actionable Feedback (5 items)

Required Changes

  • misc/session_memory_compaction.ipynb (cells 8, 20) - Add return type annotations to all class methods. Examples: chat() should return -> tuple[str, Any], compact() should return -> None, _background_memory_update() should return -> None
  • misc/session_memory_compaction.ipynb (cell 1, Prerequisites section) - Update Python version requirement from Python 3.10 or higher to Python 3.11 or higher to match project standards
  • misc/session_memory_compaction.ipynb (all cells) - Re-run entire notebook top-to-bottom with Run All to ensure sequential execution and verify all cells execute successfully
  • misc/session_memory_compaction.ipynb (cells 11, 22) - Fix typo: throught should be thought in user message
  • misc/session_memory_compaction.ipynb (cell 22) - Fix string concatenation bug: Add comma or space between Can you draft a second chapter and Can you revise that second chapter (currently concatenates without separator)
Detailed Review

Code Quality

Type Annotations (Important)

  • All class methods in both TraditionalCompactingChatSession and InstantCompactingChatSession lack return type annotations
  • This violates project type safety standards emphasized in CLAUDE.md
  • Methods like _should_init_memory() and _create_session_memory() correctly use type hints - extend this pattern to all methods

Python Version Compatibility

  • Notebook specifies Python 3.10 or higher but project pyproject.toml requires python >=3.11,<3.13
  • Should align with project standards to avoid confusion

Execution Order

  • Cell execution counts show non-sequential pattern
  • Per CLAUDE.md Key Rule 4: Test that notebooks run top-to-bottom without errors
  • Please verify with clean Run All

Code Correctness

String Concatenation Bug
In cell 22, two strings are concatenated without separator creating: ...first one?Can you revise... (missing space/separator)

Typo in Example Data
throught should be thought in the user message about generating plot ideas (appears in cells 11 and 22)

Security

No security issues found:

  • Proper use of dotenv.load_dotenv()
  • No hardcoded API keys
  • Correct use of %%capture for pip installs

Suggestions

  1. Add explanatory text after complex code blocks - Some sections would benefit from post-code explanation reinforcing what was learned
  2. Simplify loop variable naming - In cells 11 and 22, _i is enumerated but never used
  3. Enhance error messages - Background thread exception handler could include exception type and traceback for easier debugging

Positive Notes

Outstanding Pedagogical Structure

  • Perfect problem-focused learning approach
  • Clear learning objectives upfront
  • Demonstrates anti-pattern first, then best practice

Excellent Visualizations

  • ASCII diagrams comparing traditional vs. instant compaction are exceptionally clear
  • Prompt caching visualization is particularly well-done

Production-Quality Session Memory Prompt

  • Comprehensive preservation rules and compression guidelines
  • Shows deep understanding of context management

Strong Code Organization

  • Clean separation between Traditional and Instant implementations
  • Thread-safe implementation with proper locking
  • Well-commented design decisions

Impressive Performance Results

  • Demonstrates 36s wait time elimination
  • Shows ~80% token reduction
  • Clear cost savings through prompt caching

Once the required changes are addressed, this will be an excellent addition to the cookbook collection. The instant compaction pattern with threading is valuable and the educational quality is high.

@github-actions

Copy link
Copy Markdown

Notebook Changes

This PR modifies the following notebooks:

📓 misc/session_memory_compaction.ipynb

View diff
nbdiff /dev/null misc/session_memory_compaction.ipynb (2f77ae841886cdedd4209504f807acab41f3c59b)
--- /dev/null  2026-01-16 21:29:06.966516
+++ misc/session_memory_compaction.ipynb (2f77ae841886cdedd4209504f807acab41f3c59b)  (no timestamp)
## added /cells:
+  markdown cell:
+    source:
+      # Session Memory Compaction
+      
+      Long-running conversations with Claude can exceed context limits, causing loss of important information. Whether you're building a coding assistant, creative writing tool, or customer service agent, managing session memory is critical for maintaining continuity and quality.
+      
+      This cookbook teaches you how to **proactively manage session memory** to avoid jarring context limit interruptions. Unlike reactive approaches that wait until the context is full, you'll learn to build session memory in the background so compaction is instant when needed.
+      
+      **Related:** For automatic SDK-based compaction in agentic workflows, see [Automatic Context Compaction](../tool_use/automatic-context-compaction.ipynb). This cookbook focuses on manual control patterns for conversational applications.
+      
+      ## Learning Objectives
+      
+      By the end of this cookbook, you will be able to:
+      
+      - Write effective session memory prompts that preserve critical context across compaction events
+      - Implement **instant compaction** using background threading to eliminate user wait time
+      - Apply prompt caching to reduce the cost of background memory updates by ~80%
+      - Choose appropriate compaction strategies (traditional vs. instant) based on your use case
+  markdown cell:
+    source:
+      ## Prerequisites and Setup
+      
+      Before following this guide, ensure you have:
+      
+      **Required Knowledge**
+      - Basic understanding of Claude API usage and message formatting
+      - Familiarity with Python threading concepts (helpful but not required)
+      
+      **Required Tools**
+      - Python 3.11 or higher
+      - Anthropic API key
+      - Anthropic SDK
+      
+      ### Installation
+      
+      First, install the required dependencies:
+  code cell:
+    source:
+      %%capture
+      %pip install -U anthropic python-dotenv
+  code cell:
+    source:
+      import anthropic
+      from anthropic.types import MessageParam, TextBlockParam
+      from dotenv import load_dotenv
+      
+      load_dotenv()
+      
+      client = anthropic.Anthropic()
+      MODEL = "claude-sonnet-4-5-20250929"
+  code cell:
+    source:
+      # Helper functions
+      def truncate_response(text: str, max_lines: int = 15) -> str:
+          """Truncate long responses for cleaner output display."""
+          lines = text.strip().split("\n")
+          if len(lines) <= max_lines:
+              return text
+          return "\n".join(lines[:max_lines]) + f"\n... ({len(lines) - max_lines} more lines)"
+      
+      
+      def remove_thinking_blocks(text: str) -> tuple[str, str]:
+          """Remove <think>...</think> blocks from the text."""
+          import re
+      
+          matches = re.findall(r"<think>.*?</think>", text, flags=re.DOTALL)
+          cleaned = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()
+          return cleaned, "".join(matches)
+      
+      
+      def add_cache_control(messages: list[dict]) -> list[MessageParam]:
+          """Add cache_control to the last user message for prompt caching.
+      
+          For prompt caching to work, the message prefix structure must be identical between requests.
+          All messages are converted to list format for consistency, and cache_control is placed on
+          the last user message to match the standard API call pattern.
+          """
+          cached_messages: list[MessageParam] = []
+          last_user_idx = None
+      
+          # Find last user message index
+          for i, msg in enumerate(messages):
+              if msg["role"] == "user":
+                  last_user_idx = i
+      
+          for i, msg in enumerate(messages):
+              content = msg["content"]
+              text = content if isinstance(content, str) else content[0]["text"]
+      
+              content_block: TextBlockParam = {"type": "text", "text": text}
+              if i == last_user_idx:
+                  content_block["cache_control"] = {"type": "ephemeral"}
+      
+              cached_messages.append({"role": msg["role"], "content": [content_block]})
+      
+          return cached_messages
+      
+      
+      def estimate_tokens(text: str) -> int:
+          """Rudimentary token estimation: 1 token per 4 characters."""
+          return len(text) // 4
+    outputs:
+      output 0:
+        output_type: stream
+        name: stderr
+        text:
+          /root/.pyenv/versions/3.13.11/lib/python3.13/site-packages/coconut/compiler/util.py:676: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in staticmethod() if you want to preserve the old behavior
+            return Regex(regex, options)
+          /root/.pyenv/versions/3.13.11/lib/python3.13/site-packages/coconut/compiler/util.py:457: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in staticmethod() if you want to preserve the old behavior
+            result = add_action(grammar, unpack).parseWithTabs().transformString(text)
+  code cell:
+    source:
+      SESSION_MEMORY_PROMPT = """
+      Compress the conversation into a structured summary
+      that preserves all information needed to continue work seamlessly. Optimize for the assistant's
+      ability to continue working, not human readability.
+      
+      <analysis-instructions>
+      Before generating your summary, analyze the transcript in <think>...</think> tags:
+      1. What did the user originally request? (Exact phrasing)
+      2. What actions succeeded? What failed and why?
+      3. Did the user correct or redirect the assistant at any point?
+      4. What was actively being worked on at the end?
+      5. What tasks remain incomplete or pending?
+      6. What specific details (IDs, paths, values, names) must survive compression?
+      </analysis-instructions>
+      
+      <summary-format>
+      ## User Intent
+      The user's original request and any refinements. Use direct quotes for key requirements.
+      If the user's goal evolved during the conversation, capture that progression.
+      
+      ## Completed Work
+      Actions successfully performed. Be specific:
+      - What was created, modified, or deleted
+      - Exact identifiers (file paths, record IDs, URLs, names)
+      - Specific values, configurations, or settings applied
+      
+      ## Errors & Corrections
+      - Problems encountered and how they were resolved
+      - Approaches that failed (so they aren't retried)
+      - User corrections: "don't do X", "actually I meant Y", "that's wrong because..."
+      Capture corrections verbatim—these represent learned preferences.
+      
+      ## Active Work
+      What was in progress when the session ended. Include:
+      - The specific task being performed
+      - Direct quotes showing exactly where work left off
+      - Any partial results or intermediate state
+      
+      ## Pending Tasks
+      Remaining items the user requested that haven't been started.
+      Distinguish between "explicitly requested" and "implied/assumed."
+      
+      ## Key References
+      Important details needed to continue:
+      - Identifiers: IDs, paths, URLs, names, keys
+      - Values: numbers, dates, configurations, credentials (redacted)
+      - Context: relevant background information, constraints, preferences
+      - Citations: sources referenced during the conversation
+      </summary-format>
+      
+      <preserve-rules>
+      Always preserve when present:
+      - Exact identifiers (IDs, paths, URLs, keys, names)
+      - Error messages verbatim
+      - User corrections and negative feedback
+      - Specific values, formulas, or configurations
+      - Technical constraints or requirements discovered
+      - The precise state of any in-progress work
+      </preserve-rules>
+      
+      <compression-rules>
+      - Weight recent messages more heavily—the end of the transcript is the active context
+      - Omit pleasantries, acknowledgments, and filler ("Sure!", "Great question")
+      - Omit system context that will be re-injected separately
+      - Keep each section under 500 words; condense older content to make room for recent
+      - If you must cut details, preserve: user corrections > errors > active work > completed work
+      </compression-rules>
+      """
+  markdown cell:
+    source:
+      ### Code example using traditional compacting
+      In traditional compaction, you generate one summary once the token threshold is reached.
+      Traditional compaction is slow: when you hit the context limit, you wait for a summary.
+  markdown cell:
+    source:
+      
+      ```
+      TRADITIONAL COMPACTION (slow)
+      ─────────────────────────────
+      Turn 1 → Turn 2 → Turn 3 → ... → Turn N → CONTEXT FULL!
+
+
+                                          ┌─────────────────┐
+                                          │ Generate summary│
+                                          │ ( USER WAITS !) │
+                                          └─────────────────┘
+
+
+                                               Continue
+      
+      ```
+  code cell:
+    source:
+      import time
+      
+      
+      class TraditionalCompactingChatSession:
+          """Traditional chat session with compaction after the fact."""
+      
+          def __init__(self, system_message="You are a helpful assistant", context_limit: int = 10000):
+              self.system_message = system_message
+              self.context_limit = context_limit  # the point at which the conversation is compacted so it does not exceed model limits.
+              self.messages = []
+              self.current_context_window_tokens = 0
+              self.summary = None
+      
+          def chat(self, user_message: str) -> tuple[str, anthropic.types.Usage]:
+              # In traditional compaction, we check if we need to compact when the user sends a message. NOT IDEAL!
+              if self.current_context_window_tokens >= self.context_limit:
+                  print(
+                      f"\n🧹 Context window at {self.current_context_window_tokens} tokens. Limit exceeded, compacting session memory..."
+                  )
+                  self.compact()  # compacts everything before the new user message
+      
+              self.messages.append({"role": "user", "content": user_message})
+              print(f"\nUser: {user_message}")
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=3500,
+                  system=self.system_message,
+                  messages=add_cache_control(self.messages),
+              )
+              assistant_message = response.content[0].text
+              self.messages.append({"role": "assistant", "content": assistant_message})
+      
+              print(f"\nAssistant: \n{truncate_response(assistant_message, max_lines=15)}")
+      
+              # approximate current token count in the conversation before the next user message
+              cache_read = getattr(response.usage, "cache_read_input_tokens", 0) or 0
+              total_input = response.usage.input_tokens + cache_read
+              self.current_context_window_tokens = total_input + response.usage.output_tokens
+      
+              print(
+                  f"Input={total_input:,}, Prompt cached used= {cache_read > 0} | "
+                  f"Output={response.usage.output_tokens:,} | "
+                  f"Messages={len(self.messages)}"
+              )
+              return assistant_message, response.usage
+      
+          def compact(self) -> None:
+              start_time = time.perf_counter()
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=5000,
+                  system=self.system_message,  # Same as main chat for cache sharing
+                  messages=add_cache_control(self.messages)
+                  + [{"role": "user", "content": SESSION_MEMORY_PROMPT}],
+              )
+              elapsed = time.perf_counter() - start_time
+      
+              # Generate new summary message
+              self.summary, removed_text = remove_thinking_blocks(
+                  response.content[0].text
+              )  # clean up any <think> blocks because they are not needed in the session memory
+              approximate_summary_tokens = response.usage.output_tokens - round(
+                  len(removed_text) / 4
+              )  # rough estimate of tokens removed from summary
+      
+              # Replace prior messages with new summary message
+              self.messages = [
+                  {
+                      "role": "user",
+                      "content": f"""This session is being continued from a previous conversation. Here is the session memory: {self.summary}.Continue from where we left off.""",
+                  }
+              ]
+      
+              # Show token reduction if we just compacted
+              reduction = self.current_context_window_tokens - approximate_summary_tokens
+              pct = (reduction / self.current_context_window_tokens) * 100
+      
+              print(f"\n{'-' * 60}")
+              print("📝 New session memory created.")
+              print(
+                  f"✅ Tokens reduced: {self.current_context_window_tokens:,} → {approximate_summary_tokens:.0f} ({reduction:,} tokens saved, {pct:.0f}% reduction)"
+              )
+              print(f"⏱️ Compaction time: {elapsed:.2f}s (user waiting...)")
+              print(f" Cache used: {getattr(response.usage, 'cache_read_input_tokens', 0) > 0}")
+              print(f"{'-' * 60}")
+      
+              # Update token count to reflect compacted state
+              self.current_context_window_tokens = approximate_summary_tokens
+    outputs:
+      output 0:
+        output_type: stream
+        name: stderr
+        text:
+          /root/.pyenv/versions/3.13.11/lib/python3.13/site-packages/coconut/compiler/util.py:403: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in staticmethod() if you want to preserve the old behavior
+            grammar.streamline()
+          /root/.pyenv/versions/3.13.11/lib/python3.13/site-packages/coconut/compiler/util.py:457: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in staticmethod() if you want to preserve the old behavior
+            result = add_action(grammar, unpack).parseWithTabs().transformString(text)
+  markdown cell:
+    source:
+      Below we simulate a conversation between an author and an LLM that helps write stories.
+  code cell:
+    source:
+      SYSTEM_PROMPT = """
+      You are a short story writer who helps authors develop their ideas into compelling narratives.
+      
+      ## What You Do
+      
+      **Plot Development**
+      - Help authors work through story structure, pacing, and narrative arc
+      - Identify plot holes, inconsistencies, or missed opportunities
+      - Suggest ways to raise stakes, add tension, or deepen conflict
+      - Brainstorm twists, resolutions, and scene transitions
+      
+      **Character Development**
+      - Develop backstories, motivations, and internal conflicts
+      - Ensure characters have distinct voices and consistent behavior
+      - Explore character relationships and how they drive the plot
+      - Help authors understand what their characters want vs. what they need
+      
+      **Drafting**
+      - Write short stories or scenes based on the author's ideas and direction
+      - Match tone, genre conventions, and stylistic preferences
+      - Show rather than tell when bringing scenes to life
+      - Craft dialogue that reveals character and advances plot
+      
+      ## How You Work
+      - You are the lead writer. When you disagree with a creative choice, say so respectfully, but ultimately defer to what the author wants.
+      - DO NOT ask the user to provide more context or clarify their request. Assume you have enough information to proceed.
+      """
+  code cell:
+    source:
+      session = TraditionalCompactingChatSession(system_message=SYSTEM_PROMPT)
+      
+      # Simulated conversation
+      messages = [
+          "I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well thought out plot ideas for me to consider.",
+          "I don't like those ideas, can you think of one plot something more unique and unexpected?",
+          "Ok I like it. Can you help me develop the main character's backstory and motivations?",
+          "Can you draft a detailed outline for the story, breaking it down into chapters and key events?",
+          "Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.",
+          "Can you draft a second chapter that builds on the first one, introducing a new twist in the mystery?",
+      ]
+      
+      print("Starting conversation...\n")
+      
+      turn_count = 0
+      
+      for _i, message in enumerate(messages, 1):
+          turn_count += 1
+          print(f"==============================================\nTurn {turn_count}:\n")
+          response, usage = session.chat(message)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          Starting conversation...
+          
+          ==============================================
+          Turn 1:
+          
+          
+          User: I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well thought out plot ideas for me to consider.
+          
+          Assistant: 
+          # Three Mystery Plot Ideas
+          
+          ## 1. **The Vanishing Choir**
+          
+          **Setup:** In the sleepy town of Millbrook, the entire church choir—twelve people ranging from teenagers to retirees—disappears during their weekly Thursday night practice. The church was locked from the inside, their belongings left behind, including phones and car keys. No signs of struggle, no broken windows. Just an empty sanctuary and sheet music scattered across the floor.
+          
+          **The Twist:** Your young detective discovers the choir members didn't disappear—they're hiding. Twenty years ago, they witnessed the town's beloved mayor commit a hit-and-run that killed a drifter. They stayed silent, bound by threats and their own complicity. Now the mayor is dying and has hired someone to ensure his secret dies with him. The choir staged their own disappearance to draw attention and finally confess, but the detective must figure out who they're hiding from before the killer finds them first.
+          
+          **Why it works:** Small-town secrets, a ticking clock, and the moral complexity of people who aren't quite innocent or guilty. The locked-room mystery becomes a desperate act of exposure rather than concealment.
+          
+          ---
+          
+          ## 2. **The Memory Thief**
+          
+          **Setup:** Elderly residents in Hartwood are reporting identical "robberies"—but nothing is actually stolen. Instead, they insist specific memories have been taken: a first kiss, a wedding day, the birth of a child. The police dismiss it as dementia until the detective notices a pattern: all victims visited the same new "reminiscence therapist" who uses experimental techniques to help seniors preserve their memories before they fade.
+          ... (18 more lines)
+          Input=317, Prompt cached used= False | Output=852 | Messages=2
+          ==============================================
+          Turn 2:
+          
+          
+          User: I don't like those ideas, can you think of one plot something more unique and unexpected?
+          
+          Assistant: 
+          # **The Cartographer's Grave**
+          
+          **Setup:** Your young detective arrives in the mountain town of Ridgeway to investigate what seems like a prank: someone has been systematically correcting the town's street signs, storefront addresses, and property markers—changing them by just one or two numbers. The post office is going insane. Mail is being misdelivered. Emergency services are getting lost. But here's the thing: the "corrections" match a 150-year-old town map that was supposedly drawn incorrectly by a disgraced surveyor who was run out of town and buried in an unmarked grave.
+          
+          **The Investigation:** The detective discovers the old surveyor wasn't wrong—he was *right*. The town founders deliberately falsified all records after his death, shifting every address, every property line, every boundary marker. Why? Because the real map would reveal that the town's most valuable land—now home to the founding families' estates, the town hall, the historic church—was actually purchased from a Native American family who were supposed to retain mineral rights and a percentage of all profits in perpetuity. 
+          
+          **The Twist:** The person "correcting" the signs isn't a descendant of the surveyor or the Native family. It's a twelve-year-old girl with terminal cancer who found the original map in her grandmother's attic. She's not seeking revenge or restitution—she just became obsessed with the idea that the whole town has been living a lie written into the ground itself. Before she dies, she wants the truth to physically exist in the world, even if no one acknowledges it. She's been sneaking out at night with a screwdriver and homemade signs.
+          
+          **The Real Mystery:** The detective must decide what to do when they realize that "solving" the case—stopping the girl and restoring the false addresses—means choosing comfortable order over inconvenient truth. Meanwhile, members of the founding families are getting desperate to stop the corrections before anyone looks too closely at land deeds, and the girl's nightly trips are getting increasingly dangerous.
+          
+          **Why it works:** It subverts the detective genre by making the "crime" an act of radical honesty. The mystery isn't whodunit—it's what truth is worth when it would destroy an entire community's foundation. The villain isn't the child or even really the founding families—it's the town itself, built on a literal lie. And your detective can't punch or outwit their way out of a moral knot.
+          Input=1,191, Prompt cached used= False | Output=544 | Messages=4
+          ==============================================
+          Turn 3:
+          
+          
+          User: Ok I like it. Can you help me develop the main character's backstory and motivations?
+          
+          Assistant: 
+          # The Detective: Building Your Protagonist
+          
+          ## Core Identity
+          
+          Let's call her **Nora Winters** (adjust the name as you prefer). She's 26 years old, which makes her young enough to be underestimated but old enough to have scars.
+          
+          ## Backstory
+          
+          **The Personal Lie:** Nora grew up in a different small town where her father was the police chief—respected, beloved, the kind of man who coached Little League and knew everyone's name. When she was sixteen, she discovered he'd been planting evidence on suspects for years. Not major crimes—small-town stuff. A little weed here, stolen property there. Just enough to "help" cases along when he "knew" someone was guilty but couldn't prove it.
+          
+          She confronted him. He told her the truth was "complicated," that sometimes you have to bend the rules to protect people, that the town trusted him to keep them safe and he delivered results. He made her complicit with his confession—if she exposed him, she'd destroy their family, his legacy, the town's faith in law enforcement.
+          
+          She stayed silent. For two years, she carried it like a stone in her chest.
+          
+          **The Breaking Point:** At eighteen, one of her father's frame jobs went wrong. A kid he planted drugs on—a 19-year-old with priors who was actually clean this time—panicked during arrest and ran. Got hit by a car. Died in the street.
+          ... (50 more lines)
+          Input=1,758, Prompt cached used= False | Output=1,193 | Messages=6
+          ==============================================
+          Turn 4:
+          
+          
+          User: Can you draft a detailed outline for the story, breaking it down into chapters and key events?
+          
+          Assistant: 
+          # **The Cartographer's Grave: Story Outline**
+          
+          ## **ACT ONE: The Map Doesn't Match the Territory**
+          
+          ### **Chapter 1: Arrival**
+          - Nora arrives in Ridgeway on a grey October morning, six-hour drive from the city
+          - Establishes the town: population 3,200, nestled in mountain valley, tourism from fall foliage and "historic charm"
+          - She meets Town Manager **Gerald Vance** (55, anxious, sweating through his shirt despite the cold) at town hall
+          - He explains the crisis: for three weeks, someone's been changing street signs and address numbers—small changes, but creating chaos
+          - Mail misdelivered, ambulance went to wrong house (patient survived, barely), legal documents being contested
+          - Shows her examples: the church changed from 847 Oak Street to 843 Oak, the post office from 1215 Main to 1219 Main
+          - Gerald thinks it's vandalism, wants her to catch whoever's doing it quickly and quietly
+          - **Key detail:** Nora notices all the "corrections" subtract or add small, specific numbers—not random
+          
+          ### **Chapter 2: The Town's Memory**
+          ... (160 more lines)
+          Input=2,973, Prompt cached used= False | Output=3,500 | Messages=8
+          ==============================================
+          Turn 5:
+          
+          
+          User: Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.
+          
+          Assistant: 
+          # **Chapter 1: Arrival**
+          
+          The town of Ridgeway materialized from the fog like a photograph developing in reverse—sharp edges first, then the softer details filling in around them. Church steeple. Water tower. The downtown strip with its brick facades and hand-painted signs promising antiques, locally roasted coffee, genuine mountain crafts. Nora Winters took it all in through rain-spattered windshield glass, her Honda's wipers beating a rhythm that had become hypnotic somewhere around hour five of the drive.
+          
+          Small towns always looked the same in October. Like they were apologizing for something.
+          
+          She followed Main Street through the heart of downtown—two blocks of calculated charm before the road narrowed and climbed into the residential neighborhoods climbing the mountain's flank. The GPS on her phone had lost signal fifteen minutes ago, but she didn't need it. Town hall would be near the center, probably the biggest building on the main drag, probably with a flag and that particular kind of municipal architecture that tried too hard to look historic.
+          
+          She was right on all counts.
+          
+          Ridgeway Town Hall occupied a corner lot with the self-importance of a building that had watched the world change and declined to participate. Red brick, white columns, a bronze plaque declaring it built in 1889 and placed on the National Register in 1976. The flag snapped in the wind. Rain pooled in the uneven sidewalk leading to the entrance.
+          
+          Nora parked on the street, grabbed her messenger bag, and made the dash to the covered entrance. The door was heavier than it looked—solid oak, brass hardware—and opened into the particular smell of old public buildings: lemon polish, ancient radiator heat, and paper slowly yellowing in filing cabinets.
+          
+          A woman at the reception desk looked up from her computer. Sixty-something, reading glasses on a beaded chain, cardigan against the over-aggressive heating system. Her smile was professional and tired in equal measure.
+          ... (118 more lines)
+          Input=6,509, Prompt cached used= False | Output=3,073 | Messages=10
+          ==============================================
+          Turn 6:
+          
+          
+          User: Can you draft a second chapter that builds on the first one, introducing a new twist in the mystery?
+          
+          Assistant: 
+          # **Chapter 2: The Town's Memory**
+          
+          Nora woke at six to the sound of rain that had never really stopped, just paused to gather its strength. She'd fallen asleep at the small desk by the window, tablet still glowing with Gerald's documentation, her neck complaining about the angle she'd maintained for however many hours.
+          
+          The numbers were still dancing behind her eyes. She'd spent half the night building spreadsheets, mapping the changes, looking for the pattern Gerald had sensed but couldn't articulate. And she'd found it—or the beginning of it. Every changed address was in the older part of town, pre-1900 construction. Every change was small but specific: sometimes adding four, sometimes subtracting two, once adding seven. Not random. Definitely not random.
+          
+          She showered, changed into clean jeans and a sweater, and followed the smell of coffee downstairs.
+          
+          The dining room was small and warmly lit, with four tables covered in white cloth. An older man sat alone by the window reading a newspaper—actual paper, folded precisely. A younger couple occupied another table, speaking in low voices over their phones. Tourists, Nora guessed, based on the hiking boots and Patagonia fleeces.
+          
+          A sideboard held coffee, tea, juice, and a modest breakfast spread. Nora filled a mug and took a table in the corner where she could see the room and the street beyond.
+          
+          Dolores appeared from what must have been the kitchen, carrying a plate of scrambled eggs and toast. She set it in front of the man with the newspaper without being asked, received a grunt of acknowledgment, and crossed to Nora's table.
+          
+          "You look like you didn't sleep much."
+          ... (171 more lines)
+          Input=9,606, Prompt cached used= True | Output=3,241 | Messages=12
+  markdown cell:
+    source:
+      This is a long conversation with several turns. You'll notice a few things here:
+      
+      Prompt caching: You'll notice here that the input tokens eventually grew to a point where prompt caching was used (turn 6). This helps reduce costs and speed as these conversations grow!
+  markdown cell:
+    source:
+      On the next turn, we are going to hit our 10K context window limit, which triggers compaction:
+  code cell:
+    source:
+      response, usage = session.chat("Propose a title for the book")
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          
+          🧹 Context window at 12847 tokens. Limit exceeded, compacting session memory...
+          
+          ------------------------------------------------------------
+          📝 New session memory created.
+          ✅ Tokens reduced: 12,847 → 1526 (11,321 tokens saved, 88% reduction)
+          ⏱️ Compaction time: 41.42s (user waiting...)
+           Cache used: True
+          ------------------------------------------------------------
+          
+          User: Propose a title for the book
+          
+          Assistant: 
+          Based on the story's core themes and imagery, here are my title proposals:
+          
+          ## Primary Recommendation
+          
+          **The Cartographer's Daughter**
+          
+          This works on multiple levels:
+          - Emma is metaphorically Amos Frost's "daughter" in mission—inheriting and completing his work
+          - Patricia (literal descendant of Frost's assistant) becomes Emma's accomplice
+          - Evokes the weight of inheritance, legacy, and what we pass down
+          - "Cartographer" immediately signals the map/truth theme
+          - Has literary gravitas appropriate for the story's tone
+          
+          ## Alternates
+          
+          ... (20 more lines)
+          Input=1,813, Prompt cached used= False | Output=328 | Messages=3
+  markdown cell:
+    source:
+      You'll notice here that it took **over 40 seconds** for the agent to compact the conversation. Because we used traditional compaction, the user would be waiting on Claude to compact the conversation, which is not an ideal user experience.
+      
+      Below you can see the result of the compaction. It captures the key elements of conversation in less than 2K tokens.
+  code cell:
+    source:
+      print(session.summary)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          ## User Intent
+          Create short story about young detective solving mysterious case in small town. Initially requested "3 well thought out plot ideas." Rejected first batch as not unique enough, requested "something more unique and unexpected." Accepted "The Cartographer's Grave" concept. Then requested: character backstory/motivations development, detailed chapter outline, and drafted chapters.
+          
+          ## Completed Work
+          
+          **Approved Plot: "The Cartographer's Grave"**
+          - Ridgeway (pop. 3,200, mountain town) experiencing systematic address changes
+          - 12-year-old Emma Lancaster (terminal brain cancer) changing signs at night to match 1874 surveyor Amos Frost's original map
+          - Frost was "disgraced," replaced by Marcus Bellamy (founding family) in 1875 re-survey
+          - Real conspiracy: Bellamy survey deliberately shifted property lines 200-400 feet east to steal valuable land from Pequawket family (Native American), who had mineral rights + 15% revenue contract
+          - Emma found Frost's materials in grandmother's attic (Patricia Lancaster, granddaughter of Samuel Lancaster—Frost's assistant who bought his effects)
+          - Emma's motivation: not justice, but existential—wants to matter, leave truth behind before dying
+          - Resolution: Town acknowledges historical fraud via memorial/fund, addresses stay changed, Emma dies knowing truth survived
+          
+          **Main Character: Nora Winters**
+          - Age 26, private investigator for rural cases firm
+          - Backstory: Father was police chief who planted evidence for years. At 16 she discovered it, stayed silent 2 years. At 18, a framed kid died fleeing arrest. She exposed father to state police. Father forced into retirement but kept reputation. Family estranged, won't speak to her.
+          - Motivation: Prove truth always matters, atone for 2 years of silence
+          - Fatal flaw: Prioritizes truth over mercy, can be self-righteous
+          - Arc: Must learn truth and justice aren't always same thing
+          
+          **Supporting Characters**
+          - Gerald Vance (55, town manager, anxious)
+          - Dolores Chen (68, Ridgeway Inn owner, knows everything)
+          - Ruth Bellamy (72, historical society president, descendant of Marcus Bellamy)
+          - Sheriff Tom Whitlock (50, third-generation sheriff, dismissive)
+          - Emma Lancaster (12, dying of brain cancer, changing signs)
+          - Patricia Lancaster (64, Emma's grandmother, retired town clerk, Samuel Lancaster's descendant)
+          - James Pequawket (58, teacher, lives two towns over, descendant)
+          - Samuel Lancaster (Amos Frost's 1874 assistant, bought Frost's effects after death)
+          - Amos Frost (surveyor, accurate 1874 map, died 1889 in sanitarium, unmarked grave)
+          
+          **16-Chapter Outline with Epilogue Created**
+          - Act 1 (Ch 1-4): Nora arrives, discovers pattern, identifies Emma via dropped notebook
+          - Act 2 (Ch 5-9): Confronts Emma, discovers Frost's journal/map at Lancaster house, uncovers full conspiracy
+          - Act 3 (Ch 10-15): Town pressures Nora, founding families threaten charges against Emma, Nora proposes compromise (memorial + fund vs. property transfers), Emma dies December 3rd
+          - Epilogue (Ch 16): Six months later, Nora receives Emma's notebook showing new project mapping unmarked graves
+          
+          **Chapter 1 Drafted (~2000 words)**
+          - Nora arrives Ridgeway in rain, meets Gerald Vance at town hall
+          - Gerald explains crisis: 37 locations, professional signs, started October 2nd, systematic pattern
+          - Key detail: Changes are small (1-5 numbers) but precise, affecting only pre-1900 buildings
+          - Nora checks into Ridgeway Inn, meets Dolores Chen
+          - Dolores reveals her address changed October 3rd: 843 to 847 Oak Street, left new sign up to "adapt"
+          - Dolores warns: "Be careful asking questions here. Not everyone appreciates having their complications examined."
+          
+          **Chapter 2 Drafted**
+          - Nora analyzes data overnight, identifies pattern: all changes in pre-1900 areas
+          - Breakfast at inn, confronted by Howard Marsh (70s, opposed to investigation)
+          - Visits library, meets Jess (librarian, 30, supportive)
+          - Discovers in basement archives: changed addresses match 1875 town plat exactly
+          - Jess reveals history: Amos Frost surveyed 1874, deemed "inaccurate," dismissed. Marcus Bellamy (Ruth's great-great-grandfather) re-surveyed 1875 (official record). Frost's survey "destroyed years ago."
+          - Text from Jess's partner Sarah: Frost died 1889 in sanitarium, pauper's grave. Effects purchased by Samuel Lancaster at 1847 Oak Street.
+          - **Chapter ends with revelation**: 1847 Oak = current "corrected" address of Ridgeway Inn (officially 843). Frost's materials likely still at inn. "The question was who in that building knew they existed, and why they'd decided—after more than a century of silence—that the truth needed to be rewritten into the town's streets."
+          
+          ## Errors & Corrections
+          User explicitly rejected first 3 plot ideas: "The Vanishing Choir," "The Memory Thief," "The Lighthouse Keeper's Daughter"—deemed not unique/unexpected enough.
+          
+          ## Active Work
+          Chapter 2 completed. Story ready to continue with Chapter 3, which per outline should cover "The Pattern" where Nora stakes out locations and first spots Emma changing a sign.
+          
+          ## Pending Tasks
+          Draft chapters 3-16 and epilogue per approved outline.
+          
+          ## Key References
+          **Critical addresses**: 843 Oak Street (official) / 847 Oak Street (corrected) = Ridgeway Inn location, Samuel Lancaster's 1874 address
+          **Timeline**: October 2nd changes start, story current timeframe October, Emma dies December 3rd, epilogue six months later
+          **The fraud mechanics**: Bellamy survey shifted all property lines 200-400 feet east, making Pequawket parcel appear worthless hillside while valuable land (now Bellamy estate, town hall, church) became "legitimately" owned by founding families
+          **Pequawket contract terms**: Mineral rights in perpetuity + 15% of all property values/business revenues from specified parcel
+  markdown cell:
+    source:
+      ## Instant Compaction
+      
+      With **Instant compaction** the session memory is PROACTIVELY generated once a soft token threshold is reached. 
+      
+      Once the user triggers a compaction or a hard limit is reached, the summary is already available, so the user doesn't need to wait.
+      
+      Result: Instant compaction, no waiting.
+  markdown cell:
+    source:
+      
+      SESSION MEMORY COMPACTION (instant)
+      ```
+      ────────────────────────────────────
+      Turn 1 → Turn 2 → ... → Turn K → Turn K+1 → ... → Turn N → ..  → CONTEXT FULL!
+                                  │                         │            │
+                      (soft token threshold met:        (update          │
+                     initialize session memory)          trigger)        │
+                                  │                                      │
+                                  │                         │            │
+                                  ▼                         ▼            │
+                             ┌────────┐                ┌────────┐        │
+                             │ Create │                │ Update │        │
+                             │ memory │ (background)   │ memory │        │
+                             └────────┘                └────────┘        │
+                                  │                         │            │
+                                  ▼                         ▼            ▼
+                           📝 session-memory.md ──────────────────► INSTANT SWAP!
+                             (continuously updated)
+      ```
+      
+      **Update triggers:** The first summary is generated after the initial soft token limit. Updates can be triggered after every subsequent turn, or at periodically at natural breakpoints intervals (e.g. every ~10k tokens or 3+ tool calls).
+  markdown cell:
+    source:
+      This `InstantCompactingChatSession` class uses **threading** for background execution:
+      1. **`threading.Thread`** - runs memory updates in background without blocking
+      2. **Thread-safe state** - uses `threading.Lock` to safely update shared memory
+      3. **Daemon threads** - background work doesn't prevent program exit
+      4. **Instant compaction** - when context is full, just swap in the pre-built memory
+  code cell:
+    source:
+      import threading
+      import time
+      
+      
+      class InstantCompactingChatSession:
+          """
+          Maintains session memory via incremental background updates.
+      
+          Key insight: By updating memory in the background after each turn,
+          the summary is already ready when compaction is needed - instant swap!
+          """
+      
+          def __init__(
+              self,
+              system_message="You are a helpful assistant",
+              context_limit: int = 12000,
+              min_tokens_to_init: int = 7500,
+              min_tokens_between_updates: int = 2000,
+          ):
+              # Thresholds
+              self.context_limit = context_limit  # the point at which the conversation is compacted so it does not exceed model limits
+              self.min_tokens_to_init = min_tokens_to_init  # tokens needed to trigger initial memory creation; note this happens PROACTIVELY in background unlike traditional compaction
+              self.min_tokens_between_updates = min_tokens_between_updates  # tokens needed to trigger memory update. only comes into play after initial memory is created and additional compaction (memory update) is needed after that
+      
+              # Conversation state
+              self.system_message = system_message
+              self.messages = []
+              self.current_context_window_tokens = 0
+      
+              # Session memory state
+              self.session_memory = None  # this is the compacted conversation in session memory; for the demo we are storing this in memory, but in production you would write to session_memory.md file
+              self.last_summarized_index = (
+                  0  # The index of the last message included in the session memory
+              )
+              self.tokens_at_last_update = 0  # To track tokens at last memory update and see if enough new tokens have been added to trigger another update
+      
+              # Background update tracking
+              self._update_thread: threading.Thread | None = None
+              self.last_update_time = None
+              self._lock = threading.Lock()
+      
+          def chat(self, user_message: str) -> tuple[str, anthropic.types.Usage, str | None]:
+              """Process a chat turn with background session memory updates."""
+      
+              if self.current_context_window_tokens + estimate_tokens(user_message) >= self.context_limit:
+                  self.compact()  # note that when this is triggered, the compaction has already been created and is just swapped in instantly
+      
+              self.messages.append({"role": "user", "content": user_message})
+      
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=3500,
+                  system=self.system_message,
+                  messages=add_cache_control(self.messages),
+              )
+      
+              assistant_message = response.content[0].text
+              self.messages.append({"role": "assistant", "content": assistant_message})
+      
+              # Calculate token usage including cache
+              cache_read = getattr(response.usage, "cache_read_input_tokens", 0) or 0
+              total_input = response.usage.input_tokens + cache_read
+      
+              # Update context window tokens (includes cached tokens since they still count toward context)
+              self.current_context_window_tokens = total_input + response.usage.output_tokens
+      
+              # KEY DIFFERENCE: Trigger background memory update if needed proactively, before compaction is needed
+              background_status = None
+              if self._should_init_memory() or self._should_update_memory():
+                  self._trigger_background_update()
+                  background_status = "initializing" if self.session_memory is None else "updating"
+      
+              # Return usage info with cache stats
+              return assistant_message, response.usage, background_status
+      
+          # Helper methods to determine when to init session memory
+          def _should_init_memory(self) -> bool:
+              return (
+                  self.session_memory is None
+                  and self.current_context_window_tokens >= self.min_tokens_to_init
+              )
+      
+          # Helper method to determine if memory should be updated
+          def _should_update_memory(self) -> bool:
+              if self.session_memory is None:
+                  return False
+              tokens_since = self.current_context_window_tokens - self.tokens_at_last_update
+              return tokens_since >= self.min_tokens_between_updates
+      
+          # Methods to create initial session memory
+          def _create_session_memory(self, messages: list[dict]) -> str:
+              """Generate initial session memory from messages."""
+              # Put compaction instructions in user message to share cache with main chat
+              compaction_messages = [{"role": "user", "content": SESSION_MEMORY_PROMPT}]
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=5000,
+                  system=self.system_message,  # Same as main chat for cache sharing
+                  messages=add_cache_control(messages) + compaction_messages,
+              )
+              summary, _ = remove_thinking_blocks(
+                  response.content[0].text
+              )  # clean up any <think> blocks because they are not needed in the session memory
+              print(
+                  f"   [Background] Initial session memory created. Cache hit={getattr(response.usage, 'cache_read_input_tokens', 0) > 0}"
+              )
+              return summary
+      
+          def _update_session_memory(self, new_messages: list[dict]) -> str:
+              """Update existing session memory with new messages. In practice, you may want to do this via file edit rather than full re-generation. But for demo purposes we do full regeneration here."""
+              # Put compaction instructions in user message to share cache with main chat
+              compaction_update_messages = [
+                  {
+                      "role": "user",
+                      "content": SESSION_MEMORY_PROMPT
+                      + f"""There is an existing session memory: {self.session_memory}. Return the entire session memory with updates to reflect new messages.""",
+                  }
+              ]
+              response = client.messages.create(
+                  model=MODEL,
+                  max_tokens=5000,
+                  system=self.system_message,
+                  messages=new_messages
+                  + compaction_update_messages,  # you may want to use prompt caching instead, in which case you'd use add_cache_control(self.messages) here
+              )
+              updated_summary, _ = remove_thinking_blocks(
+                  response.content[0].text
+              )  # clean up any <think> blocks because they are not needed in the session memory
+              print("   [Background] Session memory updated.")
+              return updated_summary
+      
+          # Background memory update methods
+          def _background_memory_update(
+              self, messages_snapshot: list[dict], snapshot_index: int, current_tokens: int
+          ) -> None:
+              """Run session memory update in a background thread."""
+              try:
+                  with self._lock:
+                      current_session_memory = self.session_memory
+                      last_index = self.last_summarized_index
+      
+                  if current_session_memory is None:
+                      new_memory = self._create_session_memory(messages_snapshot)
+                  else:
+                      # Get new messages since last summary
+                      new_messages = messages_snapshot[last_index:]
+                      if not new_messages:
+                          return
+                      new_memory = self._update_session_memory(new_messages)
+      
+                  # Update state (thread-safe)
+                  with self._lock:
+                      self.session_memory = new_memory
+                      self.last_summarized_index = snapshot_index
+                      self.tokens_at_last_update = current_tokens
+                      self.last_update_time = time.time()
+      
+              except Exception as e:
+                  print(f"   [Background] Error updating memory: {e}")
+      
+          # This makes sure only one background update runs at a time. If one is already running, we skip starting another. If not, we start a new thread to do the update.
+          def _trigger_background_update(self):
+              """Trigger a background session memory update."""
+              if self._update_thread is not None and self._update_thread.is_alive():
+                  return
+      
+              messages_snapshot = self.messages.copy()
+              snapshot_index = len(messages_snapshot)
+              current_tokens = self.current_context_window_tokens
+      
+              self._update_thread = threading.Thread(
+                  target=self._background_memory_update,
+                  args=(messages_snapshot, snapshot_index, current_tokens),
+                  daemon=True,
+              )
+              self._update_thread.start()
+      
+          # Function to compact
+          def compact(self) -> None:
+              """INSTANT compaction using pre-built session memory."""
+              prev_msg_count = len(self.messages)
+      
+              # Ensure session memory is ready. Shouldn't be an issue normally, but here for safety.
+              if self.session_memory is None:
+                  if self._update_thread is not None and self._update_thread.is_alive():
+                      print("   ⏳ Waiting for background memory update...")
+                      self._update_thread.join(timeout=30.0)
+      
+                  if self.session_memory is None:
+                      print("   ⚠️  No pre-built memory, creating synchronously...")
+                      start = time.perf_counter()
+                      self.session_memory = self._create_session_memory(self.messages)
+                      elapsed = time.perf_counter() - start
+                      print(f"   ⏱️  Took {elapsed:.2f}s (but should be instant normally!)")
+                      self.last_summarized_index = len(self.messages)
+      
+              with self._lock:
+                  unsummarized = self.messages[self.last_summarized_index :]
+                  summary_message = [
+                      {
+                          "role": "user",
+                          "content": f"""This session is being continued from a previous conversation. Here is the session memory: {self.session_memory}.Continue from where we left off.""",
+                      }
+                  ]
+                  self.messages = summary_message + unsummarized
+                  self.last_summarized_index = 1
+      
+                  print(f"\n{'=' * 60}")
+                  print(f"⚡ INSTANT COMPACTION! Messages: {prev_msg_count} → {len(self.messages)}")
+                  print("   Session memory was pre-built (no wait time!)")
+                  print(f"{'=' * 60}")
+    outputs:
+      output 0:
+        output_type: stream
+        name: stderr
+        text:
+          /root/.pyenv/versions/3.13.11/lib/python3.13/site-packages/coconut/compiler/util.py:403: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in staticmethod() if you want to preserve the old behavior
+            grammar.streamline()
+          /root/.pyenv/versions/3.13.11/lib/python3.13/site-packages/coconut/compiler/util.py:457: FutureWarning: functools.partial will be a method descriptor in future Python versions; wrap it in staticmethod() if you want to preserve the old behavior
+            result = add_action(grammar, unpack).parseWithTabs().transformString(text)
+  markdown cell:
+    source:
+      ### Example use of Instant Compaction
+  code cell:
+    source:
+      # Low thresholds for demo - in production you'd use higher values
+      session = InstantCompactingChatSession(
+          system_message=SYSTEM_PROMPT,
+      )
+      
+      messages = [
+          "I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well thought out plot ideas for me to consider.",
+          "I don't like those ideas, can you think of one plot something more unique and unexpected?",
+          "Ok I like it. Can you help me develop the main character's backstory and motivations?",
+          "Can you draft a detailed outline for the story, breaking it down into chapters and key events?",
+          "Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.",
+          "Can you draft a second chapter that builds on the first one?",
+      ]
+      print("Starting conversation with instant compacting chat session...\n")
+      
+      turn_count = 0
+      for message in messages:
+          response, usage, background_status = session.chat(message)
+          turn_count += 1
+      
+          # Calculate cache stats
+          cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
+          cache_created = getattr(usage, "cache_creation_input_tokens", 0) or 0
+          total_input = usage.input_tokens + cache_read
+      
+          print(f"{'=' * 60}")
+          print(f"Turn {turn_count}:")
+          print(f"\nUser: {message}")
+          print(f"\nAssistant: \n{truncate_response(response, max_lines=3)}")
+          print("\nToken Usage:")
+          print(f"  Input: {total_input:,} (new: {usage.input_tokens:,}, cached: {cache_read:,})")
+          print(f"  Output: {usage.output_tokens:,}")
+          print(
+              f"  Messages: {len(session.messages)} | Memory: {'ready' if session.session_memory else 'not yet'}"
+          )
+      
+          if cache_read > 0:
+              cache_pct = (cache_read / total_input) * 100
+              print(f"  ✓ Cache hit! {cache_pct:.0f}% of input from cache")
+      
+          if background_status:
+              print(f"\n  [Background] Proactively {background_status} session memory...")
+              print(f"  Context window: {session.current_context_window_tokens:,} tokens")
+      
+          print()
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          Starting conversation with instant compacting chat session...
+          
+          ============================================================
+          Turn 1:
+          
+          User: I want to create a story about a young detective solving a mysterious case in a small town. Generate 3 well thought out plot ideas for me to consider.
+          
+          Assistant: 
+          # Three Mystery Plot Ideas
+          
+          ## 1. **The Vanishing Choir**
+          ... (36 more lines)
+          
+          Token Usage:
+            Input: 317 (new: 317, cached: 0)
+            Output: 902
+            Messages: 2 | Memory: not yet
+          
+          ============================================================
+          Turn 2:
+          
+          User: I don't like those ideas, can you think of one plot something more unique and unexpected?
+          
+          Assistant: 
+          # **The Forgetting House**
+          
+          **Setup:** Your young detective arrives in Ember Falls to investigate a string of burglaries—except the victims don't realize they've been robbed until weeks later. A woman discovers her wedding ring gone and insists she lost it yesterday, but security footage shows she hasn't worn it in a month. A man reports his grandfather's watch stolen, then his sister shows him photos proving he sold it himself at a pawn shop—which he has no memory of doing.
+          ... (16 more lines)
+          
+          Token Usage:
+            Input: 1,241 (new: 1,241, cached: 0)
+            Output: 592
+            Messages: 4 | Memory: not yet
+          
+          ============================================================
+          Turn 3:
+          
+          User: Ok I like it. Can you help me develop the main character's backstory and motivations?
+          
+          Assistant: 
+          # Your Detective: Building From The Inside Out
+          
+          ## Core Identity
+          ... (79 more lines)
+          
+          Token Usage:
+            Input: 1,856 (new: 1,856, cached: 0)
+            Output: 1,329
+            Messages: 6 | Memory: not yet
+          
+          ============================================================
+          Turn 4:
+          
+          User: Can you draft a detailed outline for the story, breaking it down into chapters and key events?
+          
+          Assistant: 
+          # **The Forgetting House: Chapter Outline**
+          
+          ---
+          ... (272 more lines)
+          
+          Token Usage:
+            Input: 3,207 (new: 3,207, cached: 0)
+            Output: 3,500
+            Messages: 8 | Memory: not yet
+          
+          ============================================================
+          Turn 5:
+          
+          User: Can you draft me a first chapter based on the plot and character ideas we've discussed so far? Make it around 2,000 words.
+          
+          Assistant: 
+          # **Chapter One: The Impossible Theft**
+          
+          The apartment smelled like burnt coffee and old paper.
+          ... (196 more lines)
+          
+          Token Usage:
+            Input: 6,743 (new: 6,743, cached: 0)
+            Output: 3,155
+            Messages: 10 | Memory: not yet
+          
+            [Background] Proactively initializing session memory...
+            Context window: 9,898 tokens
+          
+             [Background] Initial session memory created. Cache hit=True
+          ============================================================
+          Turn 6:
+          
+          User: Can you draft a second chapter that builds on the first one?
+          
+          Assistant: 
+          # **Chapter Two: Rosemont Manor**
+          
+          The house appeared through the trees like something from a postcard.
+          ... (190 more lines)
+          
+          Token Usage:
+            Input: 9,914 (new: 5,818, cached: 4,096)
+            Output: 3,500
+            Messages: 12 | Memory: ready
+            ✓ Cache hit! 41% of input from cache
+          
+            [Background] Proactively updating session memory...
+            Context window: 13,414 tokens
+          
+  code cell:
+    source:
+      message = "What did we just talk about? Give me one sentence"
+      response, usage, background_status = session.chat(message)
+      
+      # Calculate cache stats
+      cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
+      total_input = usage.input_tokens + cache_read
+      
+      print(f"\nUser: {message}")
+      print(f"\nAssistant: \n{truncate_response(response, max_lines=3)}")
+      print("\nToken Usage:")
+      print(f"  Input: {total_input:,} (new: {usage.input_tokens:,}, cached: {cache_read:,})")
+      print(f"  Output: {usage.output_tokens:,}")
+      print(
+          f"  Messages: {len(session.messages)} | Memory: {'ready' if session.session_memory else 'not yet'}"
+      )
+      
+      if cache_read > 0:
+          cache_pct = (cache_read / total_input) * 100
+          print(f"  ✓ Cache hit! {cache_pct:.0f}% of input from cache")
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          
+          ============================================================
+          ⚡ INSTANT COMPACTION! Messages: 12 → 3
+             Session memory was pre-built (no wait time!)
+          ============================================================
+          
+          User: What did we just talk about? Give me one sentence
+          
+          Assistant: 
+          I drafted Chapter 2 where Casey arrives at Rosemont Manor, interviews Iris (who deflects questions about her past and shows moments of disorientation), and realizes through comparing photos that Iris Hale is definitely their missing grandmother Iris Whitmore.
+          
+          Token Usage:
+            Input: 5,490 (new: 5,490, cached: 0)
+            Output: 60
+            Messages: 5 | Memory: ready
+  markdown cell:
+    source:
+      You'll notice here that once we hit the context limit, the session memory was instantaly swapped in, meaning the user had zero waiting time for a response!
+  markdown cell:
+    source:
+      ## Advanced: Understanding Prompt Caching
+  markdown cell:
+    source:
+      
+      The background updates can be made **~10x cheaper** by using prompt caching. The trick:
+      1. Pass the **full conversation** to the background summarizer
+      2. Add `cache_control` markers so subsequent requests hit the cache
+      3. Only the new "summarize this" instruction is billed at full price
+      
+      ```
+      ┌─────────────────────────────────────────────────────────────────────────────────┐
+      │                    PROMPT CACHING FOR LONG CONVERSATIONS                        │
+      ├─────────────────────────────────────────────────────────────────────────────────┤
+      │                                                                                 │
+      │  WITHOUT CACHING: Pay full price for entire context every turn                 │
+      │  ════════════════════════════════════════════════════════════                   │
+      │                                                                                 │
+      │  Turn 1:  [System][User1][Asst1]                         →  500 tokens  @ $3/M │
+      │  Turn 2:  [System][User1][Asst1][User2][Asst2]           → 1500 tokens  @ $3/M │
+      │  Turn 3:  [System][User1][Asst1][User2][Asst2][User3]... → 3000 tokens  @ $3/M │
+      │  Turn 4:  [System][User1][Asst1][User2][Asst2][User3]... → 5000 tokens  @ $3/M │
+      │           ─────────────────────────────────────────────                         │
+      │                                              Total: 10,000 tokens = $0.030      │
+      │                                                                                 │
+      │                                                                                 │
+      │  WITH CACHING: Pay full price once, then 90% discount on prefix                │
+      │  ═══════════════════════════════════════════════════════════════                │
+      │                                                                                 │
+      │  Turn 1:  [System][User1][Asst1]◆                        →  500 tokens  @ $3/M │
+      │                                ▲                            (cache created)    │
+      │                          cache breakpoint                                       │
+      │                                                                                 │
+      │  Turn 2:  [System][User1][Asst1][User2][Asst2]◆                                │
+      │           ╰─────── cached ──────╯                                              │
+      │                500 @ $0.30/M + 1000 new @ $3/M  =  $0.0032                     │
+      │                                                                                 │
+      │  Turn 3:  [System][User1][Asst1][User2][Asst2][User3][Asst3]◆                  │
+      │           ╰──────────── cached ─────────────╯                                  │
+      │               1500 @ $0.30/M + 1500 new @ $3/M  =  $0.0050                     │
+      │                                                                                 │
+      │  Turn 4:  [System][User1][Asst1][User2][Asst2][User3][Asst3][User4][Asst4]◆    │
+      │           ╰───────────────────── cached ─────────────────────╯                 │
+      │                     3000 @ $0.30/M + 2000 new @ $3/M  =  $0.0069               │
+      │           ─────────────────────────────────────────────                         │
+      │                                              Total: $0.0166  (45% savings)     │
+      │                                                                                 │
+      ├─────────────────────────────────────────────────────────────────────────────────┤
+      │                                                                                 │
+      │  COMPACTION + CACHING: Double benefit                                           │
+      │  ════════════════════════════════════                                           │
+      │                                                                                 │
+      │    Main Chat                      Background Summarizer                         │
+      │    ─────────                      ─────────────────────                         │
+      │                                                                                 │
+      │  [Conversation grows...]          [Same conversation prefix]◆ + [Summarize!]   │
+      │         │                                    │                                  │
+      │         │                         Cache hit! Only pays for                      │
+      │         │                         the summarization prompt                      │
+      │         │                                    │                                  │
+      │         ▼                                    ▼                                  │
+      │  Context limit reached  ──────►  Session memory ready instantly                │
+      │                                  (built cheaply in background)                  │
+      │                                                                                 │
+      │  ┌──────────────────────────────────────────────────────────────────────────┐  │
+      │  │  Key insight: The background summarizer reuses the same conversation     │  │
+      │  │  prefix that was just sent to the main chat - automatic cache hit!       │  │
+      │  └──────────────────────────────────────────────────────────────────────────┘  │
+      │                                                                                 │
+      └─────────────────────────────────────────────────────────────────────────────────┘
+      
+      ◆ = cache_control breakpoint (cache everything before this point)
+      ```
+      
+      ### Why this matters for compaction
+      
+      | Scenario | Cost per background update | Notes |
+      |----------|---------------------------|-------|
+      | No caching | Full input cost | 5,000 tokens × $3/M = $0.015 |
+      | With caching | ~10% of input cost | 500 new + 4,500 cached = $0.003 |
+      | **Savings** | **~80%** | Compounds over many updates |
+      
+      The longer the conversation, the bigger the savings—exactly when you need compaction most!
+  markdown cell:
+    source:
+      ### How the Caching Works
+      
+      The key is in `_add_cache_control()` and `_create_session_memory_cached()`:
+      
+      ```python
+      # 1. Mark the last conversation message with cache_control
+      {
+          "role": "user",
+          "content": [{
+              "type": "text",
+              "text": msg["content"],
+              "cache_control": {"type": "ephemeral"}  # <-- This creates a cache breakpoint
+          }]
+      }
+      
+      # 2. Also mark the system prompt
+      system=[{
+          "type": "text",
+          "text": "You are a session memory agent...",
+          "cache_control": {"type": "ephemeral"}
+      }]
+      ```
+      
+      **Why this works:**
+      - The first background update creates a cache entry for `[System + Messages]`
+      - Subsequent updates with the same message prefix get **cache hits**
+      - Only the new summarization instruction is billed at full price
+      - Cache entries have a 5-minute TTL, so rapid updates benefit most
+      
+      **Cost math:**
+      - Without caching: 5,000 tokens × $3.00/1M = $0.015 per update
+      - With caching: 500 new tokens × $3.00/1M + 4,500 cached × $0.30/1M = $0.00285
+      - **Savings: ~80%** on background summarization costs
+  markdown cell:
+    source:
+      ## Conclusion
+      
+      In this cookbook, you learned how to manage long-running Claude conversations through session memory compaction.
+      
+      ### What We Covered
+      
+      ✅ **Effective compaction prompts** - Structure your session memory to preserve user intent, completed work, errors, active work, and key references while discarding filler
+      
+      ✅ **Instant compaction** - Use background threading to proactively build session memory, eliminating user wait time when context limits are reached
+      
+      ✅ **Prompt caching for cost savings** - Reduce background update costs by ~80% by reusing the conversation prefix cache
+      
+      ✅ **Traditional vs. instant patterns** - Understand when to use each approach based on your application needs
+      
+      ### Key Takeaways
+      
+      1. **Weight recency heavily** - The end of a conversation is the active working context
+      2. **Preserve user corrections verbatim** - Prevents the model from reverting to old behaviors
+      3. **Build memory proactively** - Don't wait for context limits; start background updates early
+      4. **Leverage prompt caching** - Background summarization can share cache with the main conversation
+      
+      ### Next Steps
+      
+      - **For agentic workflows**: See [Automatic Context Compaction](../tool_use/automatic-context-compaction.ipynb) for SDK-based automatic compaction with tool use
+      - **For production**: Consider persisting session memory to disk rather than keeping it in memory
+      - **For optimization**: Experiment with update frequency thresholds to balance cost vs. freshness

Generated by nbdime

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review

Recommendation: APPROVE (with minor formatting fixes recommended)

Summary

This PR adds a high-quality cookbook demonstrating session memory compaction techniques for managing long-running Claude conversations. The notebook teaches both traditional compaction (reactive) and instant compaction (proactive with background threading), with excellent pedagogical structure and real metrics comparison.

Actionable Feedback (2 items)

Formatting Issues:

  • misc/session_memory_compaction.ipynb (in cell with class TraditionalCompactingChatSession:) - Add blank line after class docstring to meet ruff formatting requirements
  • misc/session_memory_compaction.ipynb (in cell with class InstantCompactingChatSession:) - Add blank line after class docstring to meet ruff formatting requirements

Before merging:

  • Run make check to verify linting passes
  • Run make fix to auto-fix any formatting issues
Detailed Review

Code Quality

Strengths:

  • ✅ Clean, well-documented code with comprehensive docstrings
  • ✅ Proper type hints using modern Python syntax (str | None, list[dict])
  • ✅ Thread-safe implementation using threading.Lock for shared state
  • ✅ Helper functions are well-organized and appropriately scoped
  • ✅ Error handling in background threads is appropriate for demonstration purposes

Minor improvements:

  • Consider slightly more specific type annotations (e.g., list[dict[str, str]] instead of list[dict])
  • Type safety is generally good throughout

Security

Excellent - No issues found:

  • ✅ Uses dotenv.load_dotenv() for API key management
  • ✅ No hardcoded credentials
  • ✅ Follows secure authentication patterns from project guidelines
  • .env.example updated appropriately (newline added)

API Usage & Models

Perfect compliance:

  • ✅ Uses current Claude model: claude-sonnet-4-5-20250929
  • ✅ Proper use of Anthropic SDK with message formatting
  • ✅ Correct implementation of prompt caching with cache_control parameter
  • ✅ Appropriate max_tokens values for different use cases

Educational Value

Outstanding - This is exemplary:

  • ✅ Introduction hooks with the problem (context limits) before explaining solutions
  • ✅ Learning objectives clearly stated upfront
  • ✅ ASCII diagrams comparing traditional vs instant compaction are very effective
  • ✅ Real execution with actual token metrics demonstrates value
  • ✅ Comprehensive session memory prompt template that readers can adapt
  • ✅ "What We Covered" section maps back to learning objectives with checkmarks
  • ✅ Practical creative writing assistant scenario is relatable

Structure follows best practices:

  • Prerequisites and setup section is well-organized
  • Text BEFORE code explains what it will do
  • Text AFTER sections explains what was learned
  • Related cookbook (automatic-context-compaction.ipynb) appropriately linked

Project Standards Compliance

Excellent adherence to CLAUDE.md:

  • ✅ Outputs kept in notebook (intentional for demonstration per rule #4)
  • ✅ Uses %%capture for pip installs
  • ✅ Minimal, appropriate dependencies (anthropic, python-dotenv)
  • ✅ One clear concept per notebook
  • ✅ Notebook tested top-to-bottom (outputs present)

Registry & Authors:

  • ✅ New author jsham042 properly added to authors.yaml with all required fields
  • ✅ Registry entry includes title, description, path, authors, date, categories
  • ✅ Categories ("Agent Patterns", "Responses") are appropriate

Key Technical Highlights

  1. Prompt Caching Explanation: The detailed diagram and explanation in the notebook showing 80% cost reduction is outstanding
  2. Threading Implementation: Proper background thread management with error handling
  3. Session Memory Prompt: Comprehensive template with analysis instructions, formatting rules, and compression guidelines
  4. Comparison Metrics: Side-by-side comparison of traditional vs instant compaction with real token counts and timing
  5. Message Formatting: Correct use of add_cache_control() helper to ensure consistent message structure for caching

Suggestions

Production Considerations (not blocking):

  • The background thread error handling (just printing) is fine for a cookbook, but production use would benefit from more robust error handling/logging
  • Consider adding a note about this in a "Production Considerations" section

Code Organization:

  • Helper functions could potentially be moved to a separate module for reuse, though keeping them inline is appropriate for a self-contained notebook

Testing

Appears complete:

  • Notebook has been executed (outputs visible)
  • Real API calls with actual token metrics shown
  • Thread safety appears sound (proper lock usage)

Before merge:

  • Run make check and make fix to address formatting
  • Verify model ID is current (appears correct: claude-sonnet-4-5-20250929)

Positive Notes

This cookbook teaches a sophisticated, production-relevant pattern that solves a real problem developers face with long-running conversations. The instant compaction approach with background threading is particularly clever - building memory proactively rather than reactively eliminates user wait time. The 80% cost reduction through prompt caching is a bonus.

The pedagogical structure is exemplary: problem → learning objectives → step-by-step implementation → comparison → reflection → next steps. The ASCII diagrams, real metrics, and practical examples make this highly educational.


Overall: This is high-quality work that provides substantial value to the cookbook collection. The only blockers are minor formatting issues that can be auto-fixed with make fix. Strongly recommend merging after running linting tools.

@jsham042 jsham042 requested a review from PedramNavid January 27, 2026 19:37
@jsham042

Copy link
Copy Markdown
Contributor Author

@PedramNavid can you review and see if this is ok to merge

@jsham042 jsham042 merged commit ce4c093 into main Jan 30, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants