refactor(memory): IdleMaint 调度清理 + LLM timeout 兜底 + review snapshot 重设计开 thinking (Project-N-E-K-O#977)

wehos · Hongzhi Wen · claude · web-flow · commit fea9d9ed9169 · 2026-04-27T10:19:58.000+08:00
* refactor(memory): IdleMaint gate 拆分 + 后台循环错峰 + Stage-2 tier 改 summary A1. Outbox replay 并发 4→2，缓和 24h 停机后启动期 LLM 后端冲击。 A2. IdleMaint subtask 2 (persona 矛盾审视) 不再被 recent_memory_auto_review 或 REVIEW_SKIP_HISTORY_LEN 限制——resolve_corrections 不读 recent history， 属独立矛盾消解管线，本就不该跟 recent.review 共用一道闸门。把 review 闸 门移到 subtask 3 头部。 A3. 5 个后台循环加 _INITIAL_DELAY_* 错峰，避免首轮全部撞 startup + interval 同一时刻： - IdleMaint: 20s（替换原 startup_phase 高频轮询机制） - Signal extraction: 60s - Rebuttal: 100s - Auto-promote: 150s（与 rebuttal 错开 50s） - Archive sweep: 250s（远小于 INTERVAL=3600s，确保短会话用户也能跑到一次） 顺手修了 except: continue 路径不 sleep 的 busy-loop 隐患（每个 except 分支补 await asyncio.sleep(INTERVAL)）。 A4. Stage-2 signal detection tier 从 correction 改 summary（与 PR Project-N-E-K-O#972 docstring 对齐）；同时把 promotion merge 从 memory/__init__.py + neko-guide.md 的 summary 列表挪到 correction 列表（与 EVIDENCE_PROMOTION_MERGE_MODEL_TIER 的实际值对齐）。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(memory): 给所有 LLM 调用站点显式 timeout + 禁掉 SDK 自动重试 之前所有 memory/ LLM 调用都没传 timeout，最坏走 OpenAI SDK 默认 600s × SDK 默认 max_retries=2 = 30 分钟单次调用上限。recent.py 和 facts.py 还有业务层 max_retries=3，叠加后单次 attempt 最坏可达 1.5 小时。 按调用路径性质给每个站点加显式 timeout + max_retries=0： - recall._fine_rank: 8s（请求路径，上游 query_memory 5s 截断） - recent.compress_history / further_compress (_get_llm): 30s（请求路径） - recent.review_history (_get_review_llm): 120s（后台，prompt 长） - persona._resolve_corrections_locked: 90s（持锁会卡 /process 路径） - fact_dedup._aresolve_locked: 60s（持锁但只阻 background worker） - facts._allm_call_with_retries (Stage-1/2/negative-keyword): 60s 默认 - reflection._synthesize_reflections_locked: 90s（持锁，输出多字段 JSON） - reflection._check_feedback_locked: 60s（后台分类） - reflection.check_feedback_for_confirmed: 60s（周期性反驳扫描） - reflection._allm_call_promotion_merge: 45s（决策 prompt 短） max_retries=0 把重试统一收口到业务层（已有的 _allm_call_with_retries 等），避免与 SDK 默认 max_retries=2 叠加翻 3 倍。SDK 抛超时直接走业务 层 retry 或外层 try/except 兜底。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(memory): review snapshot+capacity 重设计 + Stage-2 / review_history 开 thinking Phase C — review 调度从"每次 /process cancel-and-restart"改为"统一 spawn gate + 不打断"： - /process /renew /settle 不再 cancel 在跑的 review，改为调统一的 maybe_spawn_review(name)，看到 in-flight 直接 skip 本次 spawn。 - IdleMaint subtask 3 也改调 maybe_spawn_review，删掉所有内联 gate。 - maybe_spawn_review 由 per-name asyncio.Lock 串行化 gate+spawn，跑 5 道闸： in-flight / review_enabled / history_len / min_interval (active 时 ×2) / 自上次 cutoff 起累积 user msg ≥ MIN_NEW_MSGS_FOR_REVIEW(5)。 - REVIEW_MIN_INTERVAL 300s → 30s（配合 MIN_NEW_MSGS=5 + active ×2 双重限流）。 review_history 接受 snapshot 参数（spawn 时拍下的 history 副本）： - LLM 输入用 snapshot 不动当前 history → 期间 /process 可继续追加、压缩等 - 完成时基于 snapshot 末尾 K=3 条 fingerprint 在当前 history 里定位 cutoff_idx - 逆向走出 capacity（连续匹配长度），用 corrected 末尾 min(capacity, len(corrected)) 条替换 [cutoff_idx-capacity+1, cutoff_idx] 这段 slot；cutoff_idx 之后的新增消息保留不动 - review 输出比 capacity 短 = review 决定删条 → 结果就比原来短 - cutoff 在当前 history 里失配（被压缩 / 被 /new_dialog 清空）→ 'white' 返回 → caller 把 last_reviewed_cutoff_tail 设 None → 下一轮门评估视为∞放行 → 立即重 review 重建 fingerprint - review LLM 输出里的 SystemMessage（summary 备忘录）强制丢弃，保护压缩边界 新增持久化字段 _maint_state[name].last_reviewed_cutoff_tail (K=3 fingerprint)。 Phase D — 开 thinking： - Stage-2 signal detection ([memory/facts.py](memory/facts.py) _allm_detect_signals): 显式传 extra_body=None 关闭自动解析，让 thinking 模型按默认行为响应；timeout 拉到 90s。任务是 new_fact × existing_observation 的关系判断 + target_id 选择，现有防御代码就在补 LLM 幻觉，思考能减少 target_id 错位。完全后台无人等。 - review_history (recent.py:_get_review_llm): 显式传 extra_body=None 开 thinking。Phase C 重设计后 review 不持任何 manager 锁、不阻塞用户路径、 并发跑也无所谓——开 thinking 完全在收益侧（重写历史的判断密度高）。 timeout 保持 120s。 _allm_call_with_retries helper 加 extra_body 参数（默认 sentinel 表示"调用方 没指定，走 create_chat_llm 自动解析"，显式 None 表示"开 thinking"），保持 Stage-1 fact extract / negative keyword check 行为不变。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(memory): 给 maybe_spawn_review 的空 except 加语义注释 回应 github-code-quality bot 在 PR Project-N-E-K-O#977 的 inline 提示——last_review_ts 解析失败时 pass 的目的（视为'从未 review 过'）需要明示。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(memory): 回应 CodeRabbit 三条 review (review snapshot 自愈 / 身份比对 / patched fingerprint) Issue #1（白 review 不该刷 last_review_ts，Major）： review_history 返回 'white' 时，原代码同时更新 last_review_ts。这让 gate 4 (min_interval) 继续挡 30/60s，违背了用户原意——白 review 本身就是"cutoff 失效，应尽快重建锚点"的强信号。改为：白 review 时只清空 fingerprint，不 动 last_review_ts，下轮 gate 4 用旧 ts（通常已过门）+ gate 5 视 ∞ 通行 → 立即重 review。 Issue #2（finally 清理可能误删新 spawn 的句柄，Critical）： 原 finally 无条件 pop()/clear() 会在并发场景下误删 maybe_spawn_review 刚 写入的 correction_tasks / correction_cancel_flags。理论上 spawn lock + asyncio finally 同步语义已经排除了这种 race（done() 直到 finally 完成才 返回 True，maybe_spawn_review 的 in-flight 检查不会通过），但身份比对是 廉价的防御。改为：按 asyncio.current_task() / cancel_event 身份比对再 pop，确保只清自己的条目。 Issue Project-N-E-K-O#3（成功路径需要返回 patched 后的 fingerprint，Major）： review_history 之前返回 True，调用方对 snapshot 做 build_review_fingerprint。 但 review 可能改写过末尾 K 条里的任一条——存的旧 fingerprint 在新 history 里再也定位不到，下次 _count_new_user_msgs_since_last_review 退化 成 ∞ 永真，gate 5 形同虚设，每次 /process 都触发 review。改为： - review_history 返回 (status, fingerprint) tuple ('patched', new_fp) / ('white', None) / ('failed', None) - new_fp 由 review_history 内部基于 patched 后的 new_history 末尾算出 - 调用方直接写入 maint_state，不再用 build_review_fingerprint(snapshot) 附带处理： - corrected 为空（罕见 LLM 返回空"修正后的对话"）时按白 review 处理， 避免 anchor 漂移到非 review 区 - _run_review_in_background 显式接收 cancel_event 参数（不再从 dict 拿）， 与身份比对配套 - 已有手工算法验证：OLD fingerprint (snapshot tail) 在 patched history 里 找不到；NEW fingerprint (patched tail) 在 patched history 里找得到 ✓ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(memory): review LLM 输出 content 归一化为 str (CodeRabbit Issue Project-N-E-K-O#4) review prompt 让 LLM 返回 {role, content} JSON，但 thinking 模型偶尔会把 content 输出为 list/dict（多模态 segment 风格）。原代码直接塞进 HumanMessage(content=...)，下游（recall / prompt build / fingerprint 比对的 content[:50] 截取）拿到非字符串会炸。 复用 compress_history 已有的归一化策略：list → 拼 dict.text 或 str(item)； 其他 → str(item)。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Hongzhi Wen <cartabio.coder1@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/.agent/rules/neko-guide.md b/.agent/rules/neko-guide.md
@@ -13,7 +13,7 @@ trigger: always_on
 - **辅助 LLM 调用约定（memory/ + utils/）**：
   - **不下发 `temperature`**：所有 `utils.llm_client.create_chat_llm` / `ChatOpenAI` 及其包装 helper 一律不传 `temperature=...`，默认 `None` 表示"不写进请求体"。理由：(1) 兼容 o1/o3/gpt-5-thinking/Claude extended-thinking；(2) 各 task 自定温度会引入难复现的回归。守门：`scripts/check_no_temperature.py`（CI 见 `.github/workflows/analyze.yml`）。
   - **模型从 tier 拿，不 hardcoded fallback**：每个 LLM 调用都通过 `config_manager.get_model_api_config(<tier>)` 拿 model/base_url/api_key 三件套。不要再写 `api_config.get('model', SETTING_PROPOSER_MODEL)` 这类 fallback——`SETTING_PROPOSER_MODEL` / `SETTING_VERIFIER_MODEL` 已于 2026-04 退环境。tier 未配好时让 API 直接拒绝，比静默回退到 qwen-max 更安全。
-  - **memory 子模块按职责选 tier**：fact extraction / signal detection / reflection / promotion merge / fact dedup / recall rerank 走 `summary`；recent.review + persona.correction 走 `correction`。不要为单点新增 hardcoded 模型名。
+  - **memory 子模块按职责选 tier**：fact extraction / signal detection / reflection synthesis / fact dedup / recall rerank 走 `summary`；recent.review + persona.correction + promotion merge 走 `correction`。不要为单点新增 hardcoded 模型名。
 
 ## 代码风格
 
diff --git a/config/__init__.py b/config/__init__.py
@@ -1237,7 +1237,7 @@ def translate_value(val):
 # Gate 3: LLM tier 选型（候选见 RFC §6.5 Gate 3 表）
 # "summary" = qwen-plus 级；"correction" = qwen-max 级；"emotion" = qwen-flash 级
 EVIDENCE_EXTRACT_FACTS_MODEL_TIER = "summary"       # Stage-1 抽 fact
-EVIDENCE_DETECT_SIGNALS_MODEL_TIER = "correction"   # Stage-2 判 signal 映射
+EVIDENCE_DETECT_SIGNALS_MODEL_TIER = "summary"      # Stage-2 判 signal 映射
 EVIDENCE_NEGATIVE_TARGET_MODEL_TIER = "emotion"     # 关键词二次判定（延迟敏感）
 EVIDENCE_PROMOTION_MERGE_MODEL_TIER = "correction"  # Promote 合并决策
 
diff --git a/memory/__init__.py b/memory/__init__.py
@@ -22,9 +22,10 @@
    掩盖。
 
 3. **memory 子模块走的 tier**：现役 LLM 路径全部跑在 ``summary`` 或 ``correction``
-   tier 上（fact extraction / signal detection / reflection synthesis / promotion
-   merge / fact dedup / recall rerank → ``summary``；recent.review +
-   persona.correction → ``correction``）。不要再引入新的 hardcoded 模型名字。
+   tier 上（fact extraction / signal detection / reflection synthesis /
+   fact dedup / recall rerank → ``summary``；recent.review +
+   persona.correction + promotion merge → ``correction``）。不要再引入新的
+   hardcoded 模型名字。
 
 如果有非常具体的理由需要绕过，先删 ``scripts/check_no_temperature.py`` 并在
 PR 描述里说明，由 reviewer 把关。
diff --git a/memory/fact_dedup.py b/memory/fact_dedup.py
@@ -400,9 +400,13 @@ async def _aresolve_locked(self, name: str) -> int:
         try:
             set_call_type("memory_fact_dedup")
             api_config = self._config_manager.get_model_api_config('summary')
+            # timeout=60: 持 FactDedup 锁但只阻 embedding worker enqueue
+            # （background→background），用户路径无感。
+            # max_retries=0: 禁 SDK 自动重试（这里没业务 retry，单次即终态）。
             llm = create_chat_llm(
                 api_config['model'],
                 api_config['base_url'], api_config['api_key'],
+                timeout=60, max_retries=0,
             )
             try:
                 resp = await llm.ainvoke(prompt)
diff --git a/memory/facts.py b/memory/facts.py
@@ -47,6 +47,11 @@
 _ARCHIVE_AGE_DAYS = 7          # absorbed 且创建超过此天数的 facts 被归档
 _ARCHIVE_COOLDOWN_HOURS = 24   # 两次归档尝试之间的最小间隔
 
+# Sentinel：让 _allm_call_with_retries 区分"调用方没指定 extra_body"（默认走
+# create_chat_llm 自动解析）和"调用方显式传 None"（关闭 extra_body 自动解析，
+# 保留 thinking）。Phase D：Stage-2 signal detection 显式传 None 开 thinking。
+_DEFAULT_EXTRA_BODY = object()
+
 
 def safe_importance(f: dict, default: int = 5) -> int:
     """Defensively coerce ``f['importance']`` to int.
@@ -307,14 +312,26 @@ def _strip_code_fence(raw: str) -> str:
     async def _allm_call_with_retries(
         self, prompt: str, lanlan_name: str, tier: str, call_type: str,
         max_retries: int = 3,
+        timeout: float = 60,
+        extra_body=_DEFAULT_EXTRA_BODY,
     ):
         """Shared LLM helper: retry on network errors + JSON errors, same
         policy as the old `extract_facts`. Returns parsed JSON or None on
         terminal failure (caller decides whether to abort / swallow).
 
         Note: 不再接受 temperature。项目级约定一律不下发该参数（守门见
         scripts/check_no_temperature.py）。模型从 ``tier`` 对应的 api_config
-        直接拿，不再走 SETTING_PROPOSER_MODEL fallback。"""
+        直接拿，不再走 SETTING_PROPOSER_MODEL fallback。
+
+        timeout 默认 60s 适配后台 LLM（Stage-1 fact extract / Stage-2 signal
+        detect / negative keyword check）；调用方可按需提高（如 Stage-2 开
+        thinking 后传 90s）。SDK max_retries=0 避免双层 retry 叠加（业务层
+        已经有 max_retries 参数控制）。
+
+        extra_body：默认 _DEFAULT_EXTRA_BODY 让 create_chat_llm 自动按模型
+        解析（多数 provider 落地为 disable thinking）；显式传 None 表示"不
+        下发 extra_body" → 模型默认行为（thinking 模型会进入 thinking 模式）。
+        Phase D：Stage-2 signal detection 显式传 None 开 thinking。"""
         from openai import APIConnectionError, InternalServerError, RateLimitError
         from utils.llm_client import create_chat_llm
 
@@ -323,9 +340,13 @@ async def _allm_call_with_retries(
             try:
                 set_call_type(call_type)
                 api_config = self._config_manager.get_model_api_config(tier)
+                _llm_kwargs = dict(timeout=timeout, max_retries=0)
+                if extra_body is not _DEFAULT_EXTRA_BODY:
+                    _llm_kwargs['extra_body'] = extra_body
                 llm = create_chat_llm(
                     api_config['model'],
                     api_config['base_url'], api_config['api_key'],
+                    **_llm_kwargs,
                 )
                 try:
                     resp = await llm.ainvoke(prompt)
@@ -704,10 +725,17 @@ async def _allm_detect_signals(
             .replace('{EXISTING_OBSERVATIONS}', obs_text) \
             .replace('{LANLAN_NAME}', lanlan_name)
 
+        # Phase D：Stage-2 signal detection 开 thinking——
+        # 任务是 new_fact × existing_observation 的关系判断 + target_id 选择，
+        # 现有 [memory/facts.py:670-708](memory/facts.py:670) 防御代码本身就是
+        # 在补 LLM 幻觉，思考能减少 target_id 错位。完全后台 (signal extraction
+        # loop)，无人等。timeout 拉到 90s 给 thinking 模型留余量。
         parsed = await self._allm_call_with_retries(
             prompt, lanlan_name,
             tier=EVIDENCE_DETECT_SIGNALS_MODEL_TIER,
             call_type="memory_signal_detection",
+            timeout=90,
+            extra_body=None,
         )
         if parsed is None:
             return None
diff --git a/memory/persona.py b/memory/persona.py
@@ -1605,9 +1605,13 @@ async def _resolve_corrections_locked(self, name: str) -> int:
             from utils.llm_client import create_chat_llm
             set_call_type("memory_correction")
             api_config = self._config_manager.get_model_api_config('correction')
+            # timeout=90: 持 PersonaManager 锁，锁住期间会卡 /process 路径上的
+            # arecord_mentions / aapply_signal / aensure_persona，必须有时长上限。
+            # max_retries=0: 禁 SDK 自动重试，避免叠加（这里没业务 retry，单次即终态）。
             llm = create_chat_llm(
                 api_config['model'],
                 api_config['base_url'], api_config['api_key'],
+                timeout=90, max_retries=0,
             )
             try:
                 resp = await llm.ainvoke(prompt)
diff --git a/memory/recall.py b/memory/recall.py
@@ -385,9 +385,14 @@ async def _fine_rank(
 
         set_call_type("memory_recall_rerank")
         api_config = config_manager.get_model_api_config('summary')
+        # timeout=8: recall 在 query_memory 请求路径上，上游 plugin/core/context.py
+        # 默认 5s 截断；本地 8s 给 connect + 一次失败裕度。超时即抛
+        # APITimeoutError，外层 try/except 已会降级到 coarse rank。
+        # max_retries=0: 禁 SDK 自动重试，超时直接降级。
         llm = create_chat_llm(
             api_config['model'],
             api_config['base_url'], api_config['api_key'],
+            timeout=8, max_retries=0,
         )
         try:
             resp = await llm.ainvoke(prompt)
diff --git a/memory/recent.py b/memory/recent.py
diff --git a/memory/reflection.py b/memory/reflection.py
diff --git a/memory_server.py b/memory_server.py