Skip to content

Commit 93f5ac6

Browse files
HimifoxwehosHongzhi Wenclaude
authored
feat(proactive): implement meme proactive chat with regional separation and proxy support (Project-N-E-K-O#516)
* feat:增加imgflip爬虫 * fix:增加高级封装?? * feat:完备发表情包逻辑 * fix:完善对话引导 * fix:加强调度器初始化 * feat: 增强 Meme Proxy 安全性, 优化主动搭话混合来源逻辑并修复前端 Bug * refactor: improve proactive chat logic, security, and testing based on CodeRabbitAI suggestions * refactor: 优化主动搭话链接处理、表情包一致性及多语言协议支持 - 重构 system_router.py:分离 Phase 1 候选链接与 Phase 2 最终返回链接,防止链接泄露。 - 增强表情包一致性:Phase 1 仅向 AI 提供单张表情包素材,确保文案与图片匹配。 - 完善多语言支持:补全日、韩、俄语的 [MEME] 协议指令及输出格式。 - 修复 URL 编码问题:在 meme_fetcher.py 中对搜索关键词进行 quote 编码,解决特殊字符(如 /)导致的 404。 - 优化去重逻辑:仅记录最终发送给用户的话题 ID,提高去重精准度。 - 兼容性修复:还原后端音乐推荐标签以适配现有前端过滤逻辑。 - 鲁棒性修复:预初始化局部变量,防止特定路径下的 UnboundLocalError。 * fix: Gemini 气泡追踪逻辑分离与 FabiaoqingFetcher 连接泄露修复 - 前端: 分离 Gemini 文本气泡与附件追踪器,修复清理逻辑与合并冲突 - 后端: 修复 FabiaoqingFetcher 连接泄露,改用局部 httpx.AsyncClient 上下文 - 维护: 清理 system_router 冗余代码,同步 prompts_sys 俄语翻译 * fix(proactive): resolve logical vulnerabilities and audit issues in meme & music flow * fix(proactive): add translation service cleanup on shutdown * refactor(meme): 清理四宗罪技术债,拆分子方法,规范化全局图源白名单与缓存 * docs(skills): 归档基于 jmespath 的 SSR Hydration 结构化提取最佳实践 * fix(meme): address final PR review comments and code smells * docs(skill): archive pytest async mock experience * fix(meme/proactive): final deep audit fixes for stability, security, and UI race conditions - LLM Client: Implement async context manager protocol to fix crash - System Router: Fix meme source logic short-circuit and upgrade cache to byte-based TTLCache - Proactive Frontend: Fix turn-isolation race condition for attachments and dynamic submode detection - Prompts: Harmonize few-shots and remove contradictory instructions across multiple languages - Meme Fetcher: Ensure SSR extraction respects the limit * fix(proactive): include asyncio.TimeoutError in LLM retry loop * fix(audit): finalize project-wide deep audit for meme/proactive features - Patch LLM client for async context management - Harden proxy_meme_image with safe Content-Length parsing and byte-sized cache - Optimize meme fetcher with concurrent source probing - Resolve app-proactive.js rendering race conditions and turnId pollution - Fix ReferenceError in settings popup and clean up Few-shot examples across 5 languages * fix(meme/music/ui): final zero-tolerance audit sweep - 'The Four Sins' - utils/meme_fetcher.py: Overhauled try_fetch_concurrent with explicit task cancellation - utils/meme_fetcher.py: Restored genuine SECLEVEL=1 fallback in _fetch_html - static/app-proactive.js: Guarded rendering with turnId validation - main_routers/system_router.py: Included cover field in music recommendations * fix(meme/proxy/ui): final zero-tolerance hardening - 'The Three Sins' resolved 1. utils/meme_fetcher.py: Added missing ssl import and overhauled concurrent fetcher with explicit cancellation. 2. main_routers/system_router.py: Hardened meme proxy with strict Content-Type whitelist and X-Content-Type-Options: nosniff header. 3. main_routers/system_router.py: Fixed link deduplication logic for empty URLs with metadata signature fallback. 4. UI: Implemented Turn ID synchronization between backend and frontend to resolve proactive chat race conditions. * fix(meme/server/ui): resolve connection leaks and lifecycle inconsistencies - utils/meme_fetcher.py: Patched HTTP connection leaks by ensuring temporary clients are managed via context managers. - main_server.py: Implemented safe reflection-based cleanup for translation service to prevent shutdown crashes. - static/app-websocket.js: Reset Turn ID on response discard to ensure lifecycle closure and prevent late attachment binding. * fix(meme/logic): strictly align Phase 10 logic with user instructions 1. system_router.py: Separated internal fallback_channel from primary source_mode for 'BOTH' mode. 2. meme_fetcher.py: Ensured retry loop backoff is not bypassed on TLS fallback failure. 3. Documentation: Updated walkthrough and task artifacts. * fix:解决时序竞争和并发控制 * fix: 修复代码审查问题 - core.py: 修复 RUF013 类型标注,turn_id 改为 str | None - system_router.py: 移除图片代理的 CORS 通配符头,防止跨站滥用 * fix: proactive chat race condition, meme fetcher TLS fallback, and Ruff lint fixes * fix: address proactive race condition and CodeRabbit feedback * Update meme_fetcher.py * refactor(proactive): 合并 Phase 1 为单次 LLM 调用 + 来源动态权重系统 - 将 web 筛选、music 关键词、meme 关键词三个 LLM 调用合并为一次,降低 RPM - meme 关键词改为由 LLM 根据对话氛围生成(不再随机热词) - 新增来源动态权重系统:基于指数衰减统计各通道使用频率, 低权重通道在 Phase 1 前被剔除,避免搭话来源单调 - web 子模式(news/video/home/personal)各自独立计权 - build_proactive_response 支持细粒度 web 子通道记录 - 表情包候选遍历增加 URL 守卫,防止选中无 URL 的条目 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(meme): 新增 proactiveMemeEnabled 配置项及 UI 开关 - avatar-ui-drag.js: 添加 meme 模式到 CHAT_MODE_CONFIG, hasOtherSubMode 改为动态遍历配置数组 - live2d-ui-drag.js: 恢复为 main 分支状态(改动已合并到 common) - utils/preferences.py: _ALLOWED_CONVERSATION_SETTINGS 加入 proactiveMemeEnabled Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(proactive): Phase 2 音乐/表情包指令改为动态注入,避免无来源时幻觉 - 将 Phase 2 generate prompt 中硬编码的规则 10(音乐行为)和 规则 11(表情包行为)提取为独立字典 _P2_MUSIC_INSTRUCTION / _P2_MEME_INSTRUCTION - get_proactive_generate_prompt 新增 has_music / has_meme 参数, 仅在对应来源实际可用时注入指令文本,否则留空 - 5 种语言(zh/en/ja/ko/ru)均已更新 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * i18n(proactive): 补全 source_instruction 的 ja/ko/ru 翻译 - 新增 _si_ja、_si_ko、_si_ru 字典,覆盖全部 16 个组合 key - _si 分发表改用各语言独立字典,移除 en fallback 和 TODO 注释 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(proactive): 移除 [BOTH] / [SCREEN&WEB] 复合 tag,AI 只选单一主来源 [BOTH] tag 语义混乱(output_format 里指 screen+web,TAG_INSTRUCTIONS 里指 web+music,build_proactive_response 里塞 web+music 链接),且实际上 music 补充逻辑 (_append_music_recommendations) 已能自动附加音乐链接,无需复合 tag。 - 删除 build_proactive_response 中的 SCREEN&WEB case - tag 解析正则移除 SCREEN&WEB,仅保留 SCREEN/WEB/MUSIC/MEME/PASS - 删除 PROACTIVE_SCREEN_WEB_TAG_INSTRUCTIONS、PROACTIVE_SCREEN_MUSIC_TAG_INSTRUCTIONS - PROACTIVE_MUSIC_TAG_HINT / PROACTIVE_SCREEN_MUSIC_TAG_HINT 移除复合 tag 引用 - _of output_format 字典移除 [SCREEN&WEB] 标签行(5 语言) - 简化 music 冲突处理:不再需要 SCREEN&WEB→WEB 降级分支 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(proactive): 消灭 16-key 组合爆炸,tag 系统改用 [CHAT] prompts_sys.py: - get_proactive_format_sections 从 16×5=80 条硬编码改为动态拼接 source_instruction / output_format_section,净减 ~615 行 - [SCREEN] 替换为 [CHAT](语义:纯文字搭话,无副作用) - 删除 [BOTH] tag(screen+web 合并无实际意义) - 删除不再需要的 PROACTIVE_MUSIC_TAG_HINT / PROACTIVE_SCREEN_MUSIC_TAG_HINT / PROACTIVE_MUSIC_TAG_INSTRUCTIONS - 所有新增 i18n 片段覆盖 zh/en/ja/ko/ru 五种语言 system_router.py: - tag 解析正则 SCREEN → CHAT - build_proactive_response match 分支 SCREEN → CHAT - _format_recent_proactive_chats 过滤 vision 通道记录, 避免 AI 引用已过期的屏幕内容产生幻觉 - 移除已删除 dict 的 import 和手动 tag hint 补丁逻辑 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(proactive): 恢复 PROACTIVE_MUSIC_TAG_INSTRUCTIONS 防混淆提示 该 dict 在 prompt 末尾追加"聊音乐必须用 [MUSIC] 而非 [WEB]/[CHAT]" 的显式强调,是防止 AI tag 混淆的安全网,不应被删除。 同步更新提示文本中的 [SCREEN] 引用为 [CHAT]。 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(prompts): 统一 en prompt 中的安全水印标记为中文 en 语言的 _UNIFIED_P1_HEADER 和 _UNIFIED_P1_WEB_SECTION 以及 proactive_screen_web_en 中的区块标记从英文 (Chat History / Aggregated Content) 改为中文 (以下为对话历史 / 以下为汇总内容),以通过安全水印校验。 ja/ko/ru 已经是中文标记,无需修改。 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Hongzhi Wen <wenguanjung@aliyun.com> Co-authored-by: Hongzhi Wen <cartabio.coder1@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 10ba7c0 commit 93f5ac6

26 files changed

Lines changed: 3383 additions & 612 deletions
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
---
2+
name: pytest-asyncio-httpx-mocking
3+
description: When masking httpx.AsyncClient with unittest.mock in Pytest, AsyncMock must be used instead of MagicMock for async methods like post/get to prevent TypeError when awaited.
4+
---
5+
6+
# httpx.AsyncClient Mocking with AsyncMock
7+
8+
## 症状
9+
- 在测试中使用 `patch.object(httpx.AsyncClient, 'post', return_value=mock_response)`
10+
- 运行时,代码中包含 `response = await client.post(...)` 的地方抛出 `TypeError: object MagicMock can't be used in 'await' expression`
11+
12+
## 根本原因
13+
### 原因 1: 异步函数的返回值必须是 Coroutine
14+
- **问题**: `httpx.AsyncClient.post` 是一个 `async def` 方法,调用它会返回一个可等待(awaitable)的协程。
15+
- **为什么发生**: 默认的 `patch``MagicMock` 没有自动推断对象的异步特性时,它只是同步地返回了 `return_value`。当事件循环试图 `await` 这个同步的 `MagicMock` 对象时,就会报错。
16+
- **解决方案**: 在 `patch` 参数里显式使用 `new=AsyncMock(return_value=...)``new_callable=AsyncMock`
17+
18+
## 代码解决方案
19+
20+
**❌ 错误写法:**
21+
```python
22+
from unittest.mock import patch, MagicMock
23+
24+
mock_response = MagicMock(status_code=200)
25+
# 当被 await 时会触发 TypeError!
26+
with patch.object(httpx.AsyncClient, 'post', return_value=mock_response):
27+
await my_crawler.fetch()
28+
```
29+
30+
**✅ 正确写法:**
31+
```python
32+
from unittest.mock import patch, MagicMock, AsyncMock
33+
34+
mock_response = MagicMock(status_code=200) # Response 对象本身及其方法通常是同步的
35+
# 正确!覆盖掉原来的方法,使其行为成为一个 AsyncMock
36+
with patch.object(httpx.AsyncClient, 'post', new=AsyncMock(return_value=mock_response)):
37+
await my_crawler.fetch()
38+
```
39+
40+
使用 `side_effect` 模拟循序多次请求:
41+
```python
42+
with patch.object(httpx.AsyncClient, 'get', new=AsyncMock(side_effect=[mock_1, mock_2])):
43+
...
44+
```
45+
46+
## 关键经验
47+
- 针对任何 `async def` 的 Mock,必须保证它被调用时能走协程语境。
48+
- 严格区分 **异步的请求方法****同步的响应对象**`httpx.AsyncClient.get` 是异步的(需 `AsyncMock`),但它返回的 `Response` 对象上的 `.json()` 是同步的(需 `MagicMock` 即可)。
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
---
2+
name: ssr-hydration-scraping
3+
description: Best practices for extracting data from modern React/Vue SSR pages (like Next.js or Nuxt.js) by targeting hydration state blocks (__NEXT_DATA__, __NUXT__) using regex and `jmespath`, avoiding brittle DOM selector scraping.
4+
---
5+
6+
# SSR Hydration Data Scraping
7+
8+
## 症状 (Symptoms of Brittle DOM Scraping)
9+
- 爬虫经常因为前端 CSS Modules 或 Styled Components 的随机 Hash 类名(如 `class="sc-fHeRUl"`)变化而大面积失效。
10+
- 难以准确遍历 DOM 树内嵌的复杂状态(如下拉加载更多、未渲染的图集等)。
11+
12+
## 根本原因 (Root Cause)
13+
现代前端框架(React, Vue, Solid)在使用服务端渲染(SSR)时,为了在客户端“注水”(Hydration),通常会将首屏所需的完整甚至包含下一页数据的 JSON 序列化并挂载在 HTML 的 `<script>` 标签内。
14+
直接提取这段纯净的 JSON 结构比解析混合了展示逻辑的 DOM 要稳定和高效得多。
15+
16+
## 代码解决方案 (Solution)
17+
18+
### 1. 定位 SSR 数据块
19+
使用正则表达式全局提取目标脚本标签中的 JSON 字符串。
20+
```python
21+
import re
22+
import json
23+
24+
def extract_ssr_data(html: str) -> dict:
25+
# Next.js
26+
next_match = re.search(r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>', html, re.DOTALL)
27+
# Nuxt.js / Vue
28+
nuxt_match = re.search(r'window\.__NUXT__\s*=\s*({.*?});', html, re.DOTALL)
29+
# 通用 Initial State
30+
init_match = re.search(r'window\.__INITIAL_STATE__\s*=\s*({.*?});', html, re.DOTALL)
31+
32+
if next_match:
33+
return json.loads(next_match.group(1))
34+
elif nuxt_match:
35+
return json.loads(nuxt_match.group(1))
36+
elif init_match:
37+
return json.loads(init_match.group(1))
38+
return {}
39+
```
40+
41+
### 2. 使用 `jmespath` 结构化查询规避多层嵌套校验
42+
SSR 数据常有极深的组件树嵌套,直接使用字典 `.get()` 或递归极易出错或遗漏。推荐使用 `jmespath` 进行路径嗅探:
43+
```python
44+
import jmespath
45+
46+
ssr_data = extract_ssr_data(html)
47+
if ssr_data:
48+
# 使用 jmespath 嗅探可能的列表挂载点
49+
possible_paths = [
50+
"props.pageProps.data.rows",
51+
"props.pageProps.list",
52+
"payload.data[0].list"
53+
]
54+
target_list = []
55+
for path in possible_paths:
56+
res = jmespath.search(path, ssr_data)
57+
if isinstance(res, list) and len(res) > 0:
58+
target_list = res
59+
break
60+
61+
# 遍历干净的数据对象
62+
for item in target_list:
63+
print(item.get('url'), item.get('title'))
64+
```
65+
66+
## 关键经验 (Key Takeaways)
67+
1. **停止在 DOM 树里捡垃圾**:面对现代网站抓取任务,F12 后第一件事是全局搜索目标文本,查看是否直接躺在某个 `<script>``window.xxx` 的 JSON 赋值里。
68+
2. **容错性**:使用 `jmespath` 可以跨越层级查找,极大地提升了针对未知嵌套结构的防御力。
69+
3. **退路**:如果 SSR 没有数据,不要立刻写 DOM 抓取,先抓包看是否有页面渲染初期的直连 XHR API,走 XHR API("结构化白嫖")同样远优于 DOM 解析。

0 commit comments

Comments
 (0)