fix(content): 13.2/13.3 cost math executes as printed, token equivalences

yeasy · yeasy · commit 6afb600b9ec9 · 2026-06-09T16:02:45.000-07:00
- 13.2.13: prose, code, and printed output now agree (Opus 4.6 $5/$25,
  avg 175K history -&gt; traditional $878.75, cached $106.00, batch
  $439.38, breakeven 571; previous printed block did not match what the
  code computes - verified by running it); summary attribution corrected
  to Opus 4.6 and $106.00
- 1M-cache scenario: cache reads bill the 800K cached portion, not 1M
  (prose $1.50 -&gt; $1.40, code fixed in both spots)
- chars-per-token unified: 1 hanzi ~ 1-1.5 tokens across the estimator,
  the equivalence table (was '3-4 token/char' -&gt; 300-350K chars/1M) and
  ch13 summary
- 200K context shipped Nov 2023 (Claude 2.1), not mid-2024
- mixed-language system-prompt strings cleaned; savings range 80-94% -&gt;
  80-90% (90% is the cache-read ceiling); 13.3 dangling sentence
diff --git a/13_advanced/13.2_infinite_chats.md b/13_advanced/13.2_infinite_chats.md
@@ -149,10 +149,10 @@ class LongConversationManager:
     def estimate_tokens(self, text: str) -> int:
         """粗估文本 token 数（简化版）"""
         # 英文：约 1 token 每 4 字符
-        # 中文：约 1 token 每 1.3 字符
+        # 中文：1 个汉字约合 1-1.5 个 token（随分词器浮动，这里取 1.2）
         english_chars = sum(1 for c in text if ord(c) < 128)
         chinese_chars = len(text) - english_chars
-        return int(english_chars / 4 + chinese_chars / 1.3)
+        return int(english_chars / 4 + chinese_chars * 1.2)
 
     def should_summarize(self) -> bool:
         """检查是否应该进行总结"""
@@ -250,7 +250,7 @@ class LongConversationManager:
         response = self.client.messages.create(
             model=self.model,
             max_tokens=2000,
-            system=system_prompt or "You are a helpful assistant engaged in a 长对话.",
+            system=system_prompt or "You are a helpful assistant engaged in a long-running conversation.",
             messages=context_messages
         )
 
@@ -525,7 +525,7 @@ class RobustLongConversationManager:
                 response = self.client.messages.create(
                     model=self.model,
                     max_tokens=2000,
-                    system=system_prompt or "You are a helpful assistant engaged in a 长对话.",
+                    system=system_prompt or "You are a helpful assistant engaged in a long-running conversation.",
                     messages=context_messages,
                     timeout=60.0  # 设置超时
                 )
@@ -939,15 +939,15 @@ graph TD
 
 ### 13.2.10 1M Token 窗口的现实
 
-从 2024 年中期开始，Claude 的上下文窗口已升至 200K token。到 Claude Opus 4.6/4.7/4.8 与 Sonnet 4.6，Claude API 长上下文能力已扩展到 1M token 档位，并按标准 API token 价格计费；但 Microsoft Foundry 等平台可能仍有独立上限，账号、平台和区域可用性仍要按官方模型页与价格页核验。上下文管理的重点也从“能不能塞进去”逐步转向“如何高质量地利用超长上下文”。
+Claude 早在 2023 年 11 月（Claude 2.1）就把上下文窗口升至 200K token，并在 Claude 3 全系延续。到 Claude Opus 4.6/4.7/4.8 与 Sonnet 4.6，Claude API 长上下文能力已扩展到 1M token 档位，并按标准 API token 价格计费；但 Microsoft Foundry 等平台可能仍有独立上限，账号、平台和区域可用性仍要按官方模型页与价格页核验。上下文管理的重点也从“能不能塞进去”逐步转向“如何高质量地利用超长上下文”。
 
 **1M Token 等价于（现实估算）**
 
 换算表（基于实际测试数据）：
 
 | 内容类型 | 数量 | 说明 |
 |---------|------|------|
-| 中文文本 | 约 300,000-350,000 字 | 基于平均每个中文字符占 3-4 token |
+| 中文文本 | 约 700,000-1,000,000 字 | 1 个汉字约合 1-1.5 个 token（随分词器与中英混排浮动） |
 | 英文文本 | 约 250,000-300,000 单词 | 基于平均每个英文单词占 1.3-1.5 token |
 | 代码行数 | 约 150,000-200,000 行 | 取决于代码密度和缩进 |
 | 文档页数 | 约 2,000-2,500 页 A4 | 单倍行距、11 号字体、包含代码和图表 |
@@ -1173,8 +1173,8 @@ Question: {question}"""
 场景 3: 使用提示缓存优化
 - 假设 80% 的 token 是可缓存的系统内容（文档、代码库）
 - 缓存输入成本：800K × $6.25/百万 = $5.00（一次性，1.25x × $5）
-- 实际请求输入成本：200K × $5/百万 + 1M × $0.50/百万缓存读取 = $1.50
-- 单次请求输入侧约从 $5 降到 $1.50，节省约 70%；还要把首次缓存写入成本按复用次数摊销
+- 实际请求输入成本：200K × $5/百万 + 800K × $0.50/百万缓存读取 = $1.40
+- 单次请求输入侧约从 $5 降到 $1.40，节省约 72%；还要把首次缓存写入成本按复用次数摊销
 
 ```python
 def calculate_1m_token_cost(scenario: str = "basic") -> dict:
@@ -1193,10 +1193,11 @@ def calculate_1m_token_cost(scenario: str = "basic") -> dict:
         },
         "with_cache": {
             "cache_setup_cost": 800_000 * cache_input_price,
-            "per_request_cost": 200_000 * input_price + 1_000_000 * cache_read_price + 2000 * output_price,
+            # 缓存读取只发生在已缓存的 800K 上，未缓存的 200K 按全价输入
+            "per_request_cost": 200_000 * input_price + 800_000 * cache_read_price + 2000 * output_price,
             "monthly_50_requests": (
                 800_000 * cache_input_price +  # 一次性缓存
-                (200_000 * input_price + 1_000_000 * cache_read_price + 2000 * output_price) * 50
+                (200_000 * input_price + 800_000 * cache_read_price + 2000 * output_price) * 50
             )
         },
         "batch_api": {
@@ -1245,11 +1246,11 @@ print("Batch API:", calculate_1m_token_cost("batch_api"))
 **使用提示缓存的长对话管理**
 - 缓存早期对话内容：200K token
 - 缓存写入成本（一次性）：200K × $6.25/M = $1.25
-- 每个请求的实际输入（仅新消息）：300 token × $5/M × 1000 = $1.50
+- 每个请求的实际输入（仅新消息）：200 token × $5/M × 1000 = $1.00
 - 缓存读取成本：200K × $0.5/M × 1000 = $100
 - 总输出：150 token × $25/M × 1000 = $3.75
-- **总成本：$106.50**
-- **节省：88% ($772.25)**
+- **总成本：$106.00**
+- **节省：88% ($772.75)**
 
 **使用 Batch API 的长对话（非实时）**
 - 采用 Batch API 的 50% 折扣（应用于传统模式全部成本）
@@ -1263,8 +1264,13 @@ from typing import Dict
 
 def compare_conversation_costs(model: str, num_messages: int,
                                avg_input_tokens: int = 200,
-                               avg_output_tokens: int = 150) -> Dict:
-    """对比不同方案的成本"""
+                               avg_output_tokens: int = 150,
+                               avg_history_tokens: int = 175_000) -> Dict:
+    """对比不同方案的成本
+
+    avg_history_tokens: 传统模式下每个请求携带的平均历史规模
+    （历史从 0 线性增长到约 350K，均值约 175K，已含当前消息）
+    """
 
     prices = {
         "opus": {"input": 5.0, "output": 25.0, "cache_write": 6.25, "cache_read": 0.5},
@@ -1273,10 +1279,10 @@ def compare_conversation_costs(model: str, num_messages: int,
     }
 
     p = prices.get(model, prices["sonnet"])
-    cached_history = 200_000  # 200K token 的缓存历史
+    cached_history = 200_000  # 缓存方案中固定缓存的早期历史规模
 
-    # 传统模式：每个请求都包含全部历史
-    traditional_input = (cached_history * num_messages + avg_input_tokens * num_messages) * p["input"] / 1_000_000
+    # 传统模式：每个请求都包含全部历史（按平均规模计）
+    traditional_input = avg_history_tokens * num_messages * p["input"] / 1_000_000
     traditional_output = avg_output_tokens * num_messages * p["output"] / 1_000_000
     traditional_total = traditional_input + traditional_output
 
@@ -1301,6 +1307,7 @@ def compare_conversation_costs(model: str, num_messages: int,
         "batch_cost": round(batch_cost, 2),
         "savings_with_cache": round(savings, 2),
         "savings_percent": round(savings_percent, 1),
+        # 粗略口径：新增 token 累计达到缓存规模所需的消息数
         "breakeven_messages": int(cached_history / (avg_input_tokens + avg_output_tokens))
     }
 
@@ -1312,19 +1319,19 @@ print(compare_conversation_costs("opus", 1000))
 #   "model": "opus",
 #   "num_messages": 1000,
 #   "traditional_cost": 878.75,
-#   "cached_cost": 106.50,
+#   "cached_cost": 106.0,
 #   "batch_cost": 439.38,
-#   "savings_with_cache": 772.25,
+#   "savings_with_cache": 772.75,
 #   "savings_percent": 87.9,
-#   "breakeven_messages": 158
+#   "breakeven_messages": 571
 # }
 ```
 
 ### 13.2.14 何时使用不同的成本优化策略
 
 | 场景 | 推荐方案 | 原因 |
 |-----|--------|------|
-| 实时客服对话 | 传统模式或提示缓存 | 需要低延迟，提示缓存可节省 80-94% 成本 |
+| 实时客服对话 | 传统模式或提示缓存 | 需要低延迟，提示缓存可节省约 80-90% 成本 |
 | 研究助手（几周）| 长对话管理 + 缓存 | 长期对话，稳定历史文档可通过缓存降低重复输入成本 |
 | 批量分析任务 | Batch API | 可以接受 24 小时延迟，50% 折扣 |
 | 代码库分析（固定） | 提示缓存 | 代码库内容固定，缓存成本极低 |
@@ -1346,7 +1353,7 @@ class InfiniteChatBestPractices:
 
     @staticmethod
     def initialize_chat(topic: str, context: str) -> dict:
-        """初始化一个新的 长对话"""
+        """初始化一个新的长对话"""
 
         # 1. 明确定义话题和目标
         system_message = f"""
diff --git a/13_advanced/13.3_context_engineering.md b/13_advanced/13.3_context_engineering.md
@@ -1062,7 +1062,7 @@ if __name__ == "__main__":
         {
             "id": "doc_1",
             "title": "Claude 能力概览",
-            "content": "Claude Sonnet 4.6 是 Anthropic 推出的通用高性能模型。支持 1M token 上下文窗口，",
+            "content": "Claude Sonnet 4.6 是 Anthropic 推出的通用高性能模型，支持 1M token 上下文窗口。",
         },
         {
             "id": "doc_2",
diff --git a/13_advanced/summary.md b/13_advanced/summary.md
@@ -75,7 +75,7 @@
 
 **1000 条消息持续对话**（Claude Opus 4.6，$5/$25 计价，见 13.2.13）：
 - 传统模式：$878.75（每个请求都包含完整历史）
-- 使用缓存：$106.50（只包含相关部分，节省 88%）
+- 使用缓存：$106.00（只包含相关部分，节省 88%）
 - Batch API：$439.38（允许延迟处理，节省 50%）
 
 **关键推荐**：对于需要低延迟的应用使用提示缓存，对于非实时任务使用 Batch API
@@ -182,7 +182,7 @@ RAG（检索增强生成）是上下文工程的实践工具：
 | 方案 | 总成本 | 相对传统模式 |
 |-----|------|-----------|
 | 传统模式（无优化） | $878.75 | 100% |
-| 使用提示缓存 | $106.50 | 12% (节省 88%) |
+| 使用提示缓存 | $106.00 | 12% (节省 88%) |
 | Batch API（50% 折扣） | $439.38 | 50% (节省 50%) |
 
 **关键洞察**：对于长对话，提示缓存可节省 88% 的成本，是最经济的长对话方案。
@@ -191,7 +191,7 @@ RAG（检索增强生成）是上下文工程的实践工具：
 
 | 时期 | 窗口大小 | 等价中文字数 | 应用限制 |
 |-----|---------|-----------|--------|
-| 当前 | 1M | 约 70 万-120 万（随分词器与中英混排浮动） | 完整项目、多源融合 |
+| 当前 | 1M | 约 70 万-100 万（随分词器与中英混排浮动） | 完整项目、多源融合 |
 | 近期（预期） | 更高效的 1M 利用方式 | 同上 | 更长任务链、更复杂上下文编排 |
 | 长期 | 应用侧外部记忆 + 模型窗口 | 取决于检索和压缩 | 全知识库、完整历史需外部存储与检索 |
 

Original file line number	Diff line number	Diff line change
`@@ -1062,7 +1062,7 @@ if __name__ == "__main__":`
`1062`	`1062`	`{`
`1063`	`1063`	`"id": "doc_1",`
`1064`	`1064`	`"title": "Claude 能力概览",`
`1065`		`- "content": "Claude Sonnet 4.6 是 Anthropic 推出的通用高性能模型。支持 1M token 上下文窗口，",`
	`1065`	`+ "content": "Claude Sonnet 4.6 是 Anthropic 推出的通用高性能模型，支持 1M token 上下文窗口。",`
`1066`	`1066`	`},`
`1067`	`1067`	`{`
`1068`	`1068`	`"id": "doc_2",`