fix(core): allow web_fetch JSON fallback#5660
Conversation
|
@qwen-code /triage |
|
Thanks for the PR, @tt-a1i! Template looks good ✓ — all required sections present including Reviewer Test Plan, Risk & Scope, Linked Issues, and AI Assistance Disclosure. On direction: this is a clear, real bug — On approach: minimal and focused — four Accept header string changes in one Moving on to code review. 🔍 中文说明感谢贡献,@tt-a1i! 模板完整 ✓ — 所有必需章节齐全,包括 Reviewer Test Plan、Risk & Scope、Linked Issues 和 AI Assistance Disclosure。 方向:这是一个明确的真实 bug —— 方案:最小化且聚焦 —— 一个 进入代码审查 🔍 — Qwen Code · qwen3.7-max |
Code ReviewIndependent proposal (before reading the diff): I'd add The PR does exactly this. Implementation is clean and correct — four string changes in one No critical issues found. One non-blocking observation: the HTML path changes from truncate-after-convert ( Reuse check: the change modifies existing logic in-place, no new utilities or parallel code introduced. ✓ Automated Checks
Real-Scenario Testing (tmux)Drove the exact content-negotiation scenario from #5611 against both a local mock JSON-only server and the real Unit tests (from worktree, PR source): 中文说明代码审查独立方案(读 diff 之前):我会在 PR 完全做到了这些。实现干净正确 —— 一个 未发现关键问题。 一条非阻塞观察:HTML 路径从「先转换再截断」( 复用检查: 改动在原位修改现有逻辑,没有引入新工具或并行代码。✓ 自动化检查
真实场景测试(tmux)针对 #5611 的确切内容协商场景,同时测试了本地 mock JSON-only 服务器和真实的 — Qwen Code · qwen3.7-max |
|
This is a clean, focused bug fix that does exactly what it says. The motivation is real (JSON APIs returning 415), the approach is standard HTTP content negotiation (quality-weighted The mutation test is the strongest evidence I've seen in a while: reverting only the source file while keeping the PR's test file causes exactly 5 tests to fail — the 4 header assertion tests (each diffing old vs new Accept strings) and the JSON passthrough test (which catches that the old catch-all The tmux test reproduces the exact bug from #5611 against the real GitHub API: old header → HTTP 415, new header → HTTP 200 with My independent proposal (add One non-blocking observation carried from Stage 2: the HTML truncation order change (truncate-before-convert) is a reasonable tradeoff but a behavior change for pages >100 KB. Worth knowing about, not worth blocking. Verdict: approve. ✅ 中文说明这是一个干净、聚焦的 bug 修复,完全做到了所描述的内容。动机真实(JSON API 返回 415),方案是标准 HTTP 内容协商(带权重的 变异测试是我近期见过的最有力的证据:只还原源文件、保留 PR 测试文件,恰好 5 个测试失败 —— 4 个 header 断言测试(每个都 diff 了旧 vs 新的 Accept 字符串)和 JSON 直通测试(捕获了旧的兜底 tmux 测试在真实 GitHub API 上复现了 #5611 的确切 bug:旧头 → HTTP 415,新头 → HTTP 200 并拿到 我的独立方案(加 一条非阻塞观察:HTML 截断顺序变化(先截断再转换)是合理取舍,但对 >100 KB 的页面是行为变化。值得了解,不值得阻塞。 结论:批准。 ✅ — Qwen Code · qwen3.7-max |
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
LGTM, looks ready to ship. ✅
✅ Local verification — real tmux E2E (Linux)Verified as a maintainer on a real Linux box, including a full Environment: Linux 6.12 x86_64 · Node v22.22.2 · npm 10.9.7 · tmux 3.5a. PR branch is 1 commit behind 1. Real-world reproduction against the live GitHub API (the actual bug)
Confirms the diagnosis: it is purely the 2. Unit tests + non-vacuity
3. Static checks
4. Real E2E #1 — the compiled shipped tool vs a 415-on-text-only serverA local origin server emulating GitHub's content-negotiation (
5. Real E2E #2 — the full
|
| Arm | Accept origin received | HTTP | web_fetch → model? | CLI final output |
|---|---|---|---|---|
| NEW | …text/plain;q=0.8, */*;q=0.1 |
200 | yes | "published on 2026-01-27T11:50:52Z" ✅ |
| OLD | text/markdown, text/html, text/plain |
415 | no | "FETCH_FAILED — web_fetch could not retrieve the URL" ❌ |
Also ran the NEW arm against the real GitHub API end-to-end: the CLI fetched the live JSON and the side-query request to the model carried the real published_at — i.e. #5611's exact scenario now works through the shipped CLI path.
Notes (all non-blocking)
- Test precision: consider asserting the
Acceptheader inside the new JSON test so it guards the fix directly (today only the 4 dedicated header tests do). autonow uses graded q-values. Before,text/markdown, text/html, text/plainwere all implicitq=1.0(equal preference); now markdown1.0> html0.9> text0.8>*/*0.1. This strengthens the markdown preference (good for the token-saving goal), so it's an improvement — just noting the relative ordering changed, not only the catch-all addition.*/*;q=0.1admits truly binary responses. A server with no text representation can now return e.g.application/octet-stream; the tool routes non-markdown/non-plain types throughhtml-to-text, so binary becomes garbled text rather than a hard error. This matches the PR's stated risk and the "everything is normalized to text" contract — no crash, and strictly better than the previous hard-fail.
Verdict
Behavior is correct and minimal, docs match the code, no other code path hard-codes the old header, and both the real GitHub API and a full CLI run confirm the before/after. Approve / ready to merge.
中文版(点击展开)
✅ 本地验证 —— 真实 tmux 端到端测试(Linux)
作为维护者在真实 Linux 环境完成验证,包含在 tmux 中驱动的完整 qwen CLI 运行。结论:修复正确,解决 #5611,建议合并。
环境: Linux 6.12 x86_64 · Node v22.22.2 · npm 10.9.7 · tmux 3.5a。PR 落后 main 1 个提交,但所改的 3 个文件自分叉点起在 main 上均无改动,且 GitHub 显示 MERGEABLE,因此单独测试该分支可代表合并后的行为。
1. 针对真实 GitHub API 复现(真正的 bug)
用 curl 访问 issue 中的确切端点,对工具能发送的每一种 Accept 做前后对比:
format |
Accept(修复前 → 修复后) | 前 | 后 |
|---|---|---|---|
auto |
text/markdown, text/html, text/plain → …;q=0.9, …;q=0.8, */*;q=0.1 |
415 | 200 |
markdown |
text/markdown → text/markdown, */*;q=0.1 |
415 | 200 |
html |
text/html → text/html, */*;q=0.1 |
415 | 200 |
text |
text/plain → text/plain, */*;q=0.1 |
415 | 200 |
证实问题完全出在 Accept 头,*/*;q=0.1 兜底修复了全部四种格式。
2. 单元测试 + 非空泛性
vitest run src/tools/web-fetch.test.ts→ 19/19 通过。- 差分(证明测试确实守护本修复): 仅把
web-fetch.ts回退到修复前,恰好那 4 个 header 断言测试失败(差异即旧→新 header),其余通过。 - 小提示:新增的
should process JSON content…测试在修复前的代码上也能通过——它直接 mock 抓取结果、只断言 JSON 进入模型,覆盖的是下游处理而非 header 改动。真正锁定回归的是那 4 个 header 测试。(非阻塞,见“说明”。)
3. 静态检查
prettier --check ✅ · eslint ✅ · tsc --noEmit(core)✅ · build(core)✅ · git diff --check ✅
(eslint 起初因缺少开发依赖 eslint-plugin-check-file 报错——属本地安装产物;PR 未改 eslint 配置或清单,补装后通过。)
4. 真实端到端 #1 —— 已编译的发布产物工具 vs 仅接受文本即返回 415 的服务器
本地起一个模拟 GitHub 内容协商的源站(Accept 不含 */* 即 415)+ 来自 packages/core/dist 的真实已编译 WebFetchTool(仅 stub 模型侧,抓取路径 100% 真实)。A/B 仅替换那 4 个已编译的 Accept 字符串:
| 工具 | 目标 | 工具实际发送的 Accept(服务器记录) | 结果 |
|---|---|---|---|
| 新 | 模拟源站 (auto) |
text/markdown, text/html;q=0.9, text/plain;q=0.8, */*;q=0.1 |
200 · JSON published_at 抵达模型 ✅ |
| 新 | 模拟源站 (markdown/html/text) |
text/<type>, */*;q=0.1 |
各自 200 ✅ |
| 新 | 真实 api.github.com (auto) |
(新 auto 头) | 200 · 真实 published_at:2026-01-27T11:50:52Z 抵达模型 ✅ |
| 旧 | 模拟源站 (auto) |
text/markdown, text/html, text/plain |
415 · web_fetch_fallback_failed · 模型未被调用 ❌ |
| 旧 | 真实 api.github.com (auto) |
(旧头) | 415(复现 #5611)❌ |
5. 真实端到端 #2 —— 在 tmux 中运行完整 qwen CLI
在 tmux 内运行真实 CLI(node packages/cli … -p "…" --approval-mode yolo),后端接一个模拟 OpenAI 兼容服务(无需真实模型 key);模型发出 web_fetch 工具调用,工具真实执行,结果被总结并作答。对已编译工具做 A/B:
| 分支 | 源站收到的 Accept | HTTP | web_fetch→模型? | CLI 最终输出 |
|---|---|---|---|---|
| 新 | …text/plain;q=0.8, */*;q=0.1 |
200 | 是 | "published on 2026-01-27T11:50:52Z" ✅ |
| 旧 | text/markdown, text/html, text/plain |
415 | 否 | "FETCH_FAILED —— web_fetch 无法获取该 URL" ❌ |
并用“新”分支对真实 GitHub API跑了完整链路:CLI 抓到线上 JSON,发给模型的 side-query 携带了真实 published_at——即 #5611 的确切场景已在发布 CLI 路径上跑通。
说明(均为非阻塞)
- 测试精度: 建议在新增的 JSON 测试里也断言
Accept头,使其直接守护本修复(目前只有那 4 个专门的 header 测试在守护)。 auto现在使用带权重的 q 值。 之前text/markdown, text/html, text/plain都是隐式q=1.0(同等优先,由服务器选择);现在 markdown1.0> html0.9> text0.8>*/*0.1。这其实强化了对 markdown 的偏好(契合省 token 目标),属改进——仅提示相对优先级也变了,不只是加了兜底。*/*;q=0.1会接纳真正的二进制响应。 没有文本表示的服务器现在可能返回如application/octet-stream;工具会把非 markdown/非 plain 的类型走html-to-text,于是二进制会变成乱码文本而非直接报错。这符合 PR 自述的风险与“一切归一化为文本”的约定——不会崩溃,且优于此前的直接失败。
结论
行为正确且改动最小,文档与代码一致,没有其他代码路径硬编码旧 header,真实 GitHub API 与完整 CLI 运行都印证了前后差异。批准 / 可合并。
✅ Maintainer verification — real local E2E test (tmux)I verified this PR locally by driving the real Setup
Test 1 — Ground truth: the real GitHub API (the exact URL from #5611)Confirms the issue: the bug is purely the over-narrow Test 2 — Real
|
format |
BEFORE (main): header → HTTP → tool | AFTER (this PR): header → HTTP → tool |
|---|---|---|
auto |
text/markdown, text/html, text/plain → 415 → ERROR |
text/markdown, text/html;q=0.9, text/plain;q=0.8, */*;q=0.1 → 200 → SUCCESS |
markdown |
text/markdown → 415 → ERROR |
text/markdown, */*;q=0.1 → 200 → SUCCESS |
html |
text/html → 415 → ERROR |
text/html, */*;q=0.1 → 200 → SUCCESS |
text |
text/plain → 415 → ERROR |
text/plain, */*;q=0.1 → 200 → SUCCESS |
- BEFORE returns
WEB_FETCH_FALLBACK_FAILEDfor every format; against real GitHub it reproduces the issue verbatim:Error during fetch … Request failed with status code 415 Unsupported Media Type. - AFTER, every format succeeds and the JSON content actually reaches the model prompt. Against real GitHub, the tool now feeds
published_at: 2026-01-27T11:50:52Zto the model — the exact value the HTML page couldn't provide.
Test 3 — Unit tests + A/B test-integrity
- On the PR:
web-fetch.test.ts→ 19/19 pass, including the newshould process JSON content returned by fallback content negotiationand the fourshould prefer …header assertions. - Revert only the
getAcceptHeader()change back to thetext/*-only strings → exactly 4 tests fail, each diffing the new header (expected) against the old one (received), e.g.:Restoring the fix turns them green again → the four header tests are real regression guards.- "Accept": "text/markdown, text/html;q=0.9, text/plain;q=0.8, */*;q=0.1" (expected, PR) + "Accept": "text/markdown, text/html, text/plain" (received, reverted) - Minor note (non-blocking): the new
…JSON content…test still passes under the reverted code, because it mocksfetchWithTimeoutto return JSON regardless of the header — so it guards the JSON→model processing path, while the fourshould prefer/defaulttests guard the header strings. Both aspects are covered, just by different tests.
Test 4 — No regression: the token-saving markdown path is preserved
The one real risk is "could the new */* fallback make a markdown-capable server return something other than markdown?" Drove the real tool (auto) against a server that supports both text/markdown and application/json:
auto vs markdown+json server -> tool SUCCESS; served = text/markdown (token-saving path preserved ✓)
Because */* is q=0.1 (strictly the lowest), a compliant server still returns the most-preferred type it supports; */* only activates when nothing else matches. No change for normal HTML pages, markdown-capable servers, raw text files, or servers that ignore Accept.
Conclusion: the real tool now fetches JSON REST APIs (fixing #5611) across all four format values, the markdown token-saving path is untouched, and the suite genuinely guards the change. 👍
🇨🇳 中文版(合并参考)
✅ 维护者验证 —— 本地真实端到端测试(tmux)
我在本地用真实的 WebFetchTool(基于 PR head 构建)同时针对一个本地 GitHub 风格服务器和 #5611 里真实的 api.github.com 接口进行了验证,不只是看 diff。结论:LGTM,可以合并。 修复解决了 415、新测试能守住它、节省 token 的 markdown 路径被保留、且对非 JSON 目标没有回归。
环境
- 在 PR head
8637730c5上单独开了 git worktree,执行了真实npm ci(没有软链node_modules);所有命令都在真实 tmux 会话里跑,Nodev22.22.2,Linux x64。 - 探针从构建产物
core/dist里 import 线上同款WebFetchTool,跑它真实的execute()路径(真实fetchWithTimeout、真实内容协商)。只有下游 LLM 调用被 fakeConfig打桩——和web-fetch.test.ts完全一致——从而把本 PR 改动的抓取行为单独隔离出来验证。
测试 1 —— 基准事实:真实 GitHub API(#5611 的原始 URL)
Accept: text/markdown, text/html, text/plain -> HTTP 415 (旧 auto)
Accept: text/markdown, text/html;q=0.9, text/plain;q=0.8, */*;q=0.1 -> HTTP 200 (新 auto)
证实了 issue:bug 完全在于过窄的 Accept 头,*/*;q=0.1 修好了它。
测试 2 —— 真实 WebFetchTool 端到端,修复前 vs 修复后
同一个工具、同一个 JSON-only 服务器(以及同一个真实 GitHub 接口),唯一区别就是本 PR 对 getAcceptHeader() 的一行改动。Accept 列是服务器实际收到的头(服务器侧抓取),证明工具确实发送了我们所说的内容。
format |
修复前(main):header → HTTP → tool | 修复后(本 PR):header → HTTP → tool |
|---|---|---|
auto |
text/markdown, text/html, text/plain → 415 → ERROR |
text/markdown, text/html;q=0.9, text/plain;q=0.8, */*;q=0.1 → 200 → SUCCESS |
markdown |
text/markdown → 415 → ERROR |
text/markdown, */*;q=0.1 → 200 → SUCCESS |
html |
text/html → 415 → ERROR |
text/html, */*;q=0.1 → 200 → SUCCESS |
text |
text/plain → 415 → ERROR |
text/plain, */*;q=0.1 → 200 → SUCCESS |
- 修复前每种 format 都返回
WEB_FETCH_FALLBACK_FAILED;针对真实 GitHub 复现了 issue 原文:Error during fetch … Request failed with status code 415 Unsupported Media Type。 - 修复后每种 format 都成功,且 JSON 内容真正进入了模型 prompt。针对真实 GitHub,工具现在会把
published_at: 2026-01-27T11:50:52Z喂给模型——正是 HTML 页面无法提供的那个值。
测试 3 —— 单测 + A/B 测试有效性
- 在 PR 上:
web-fetch.test.ts→ 19/19 通过,包含新增的should process JSON content returned by fallback content negotiation和四个should prefer …头断言。 - 只把
getAcceptHeader()还原成text/*-only 字符串 → 恰好 4 个测试失败,每个都把新 header(expected)和旧 header(received)做了 diff,例如:恢复修复后又变绿 → 这四个 header 测试是真正的回归守卫。- "Accept": "text/markdown, text/html;q=0.9, text/plain;q=0.8, */*;q=0.1" (expected, PR) + "Accept": "text/markdown, text/html, text/plain" (received, 还原后) - 小提示(不阻塞):新增的
…JSON content…测试在还原代码后仍然通过,因为它把fetchWithTimeoutmock 成无论 header 如何都返回 JSON——所以它守的是 JSON→模型的处理路径,而四个should prefer/default测试守的是 header 字符串。两方面都覆盖了,只是由不同测试负责。
测试 4 —— 无回归:节省 token 的 markdown 路径被保留
唯一的真实风险是"新的 */* 兜底会不会让支持 markdown 的服务器返回 markdown 以外的东西?"用真实工具(auto)针对一个同时支持 text/markdown 和 application/json 的服务器:
auto vs markdown+json server -> tool SUCCESS; served = text/markdown(节省 token 的路径被保留 ✓)
因为 */* 的权重是 q=0.1(严格最低),合规服务器仍会返回它能提供的最优先类型;*/* 只在没有其它匹配时才生效。普通 HTML 页面、支持 markdown 的服务器、原始文本文件、以及忽略 Accept 的服务器都不受影响。
结论: 真实工具现在能在全部四种 format 下抓取 JSON REST API(修复 #5611),markdown 节省 token 路径未受影响,且测试套件确实守住了这个改动。👍
| headers: new Headers({ 'content-type': 'application/json' }), | ||
| text: () => | ||
| Promise.resolve( | ||
| JSON.stringify({ published_at: '2026-01-27T11:50:52Z' }), |
There was a problem hiding this comment.
[Critical] The JSON fallback test uses a fixture with no angle brackets or HTML entities, so it doesn't exercise the corruption path. In web-fetch.ts:128-135, the else branch feeds non-markdown, non-plain-text content through html-to-text's convert(). With the new */*;q=0.1 Accept header, application/json responses now hit this branch.
I verified empirically that convert() corrupts JSON in two ways:
- Angle bracket stripping:
convert('{"body":"Fix <script> injection"}')→{"body":"Fix— everything after<script>is consumed as an HTML element and discarded - Entity decoding:
convert('{"desc":"Use & for ampersand"}')→{"desc":"Use & for ampersand"}
This directly undermines the PR's goal of supporting JSON-only APIs. Real-world API responses (GitHub releases, CMS APIs) routinely contain HTML in string values.
Suggested fix — source: Flip the else branch to only convert actual HTML:
} else if (contentType.includes('text/html')) {
this.debugLogger.debug('[WebFetchTool] Converting HTML to text');
textContent = convert(responseText, { ... }).substring(0, MAX_CONTENT_LENGTH);
} else {
this.debugLogger.debug(`[WebFetchTool] Passing through ${contentType || 'unknown'} content as text`);
textContent = responseText.substring(0, MAX_CONTENT_LENGTH);
}Suggested fix — test: Use a fixture with angle brackets and HTML entities:
text: () => Promise.resolve(JSON.stringify({
body: '<p>Release <b>notes</b></p>',
desc: 'Use & for ampersand',
})),Then assert expect(receivedContent).toContain('<p>') and expect(receivedContent).toContain('&').
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
covered this in the latest push. The JSON fixture now includes <p>... and &, and the assertion checks those survive unchanged so this catches the html-to-text corruption path.
| case 'auto': | ||
| default: | ||
| return 'text/markdown, text/html, text/plain'; | ||
| return 'text/markdown, text/html;q=0.9, text/plain;q=0.8, */*;q=0.1'; |
There was a problem hiding this comment.
[Suggestion] The */* fallback invites large non-HTML responses (JSON API payloads, XML feeds) into the else branch at line 128, which calls convert(responseText, {...}).substring(0, MAX_CONTENT_LENGTH) — running html-to-text on the full response body before truncating. The markdown and plain-text branches truncate first (responseText.substring(0, MAX_CONTENT_LENGTH)). A multi-MB JSON response would be fully parsed by the HTML converter before being truncated to 100K chars, wasting CPU and memory.
| return 'text/markdown, text/html;q=0.9, text/plain;q=0.8, */*;q=0.1'; | |
| return 'text/markdown, text/html;q=0.9, text/plain;q=0.8, */*;q=0.1'; |
No source change needed here, but the else branch should truncate before converting for consistency:
textContent = convert(responseText.substring(0, MAX_CONTENT_LENGTH), { ... });— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
updated this path. JSON/unknown content types now skip html-to-text and pass through as text, while text/html is the only branch that converts. I also changed the JSON fixture to include <p> and & so the test catches conversion corruption.
✅ Local verification report (maintainer)Built and tested this PR locally to confirm the fix before merge. Environment
Results
Mutation test (the key evidence). I reverted The other 15 pass. This confirms the header change is genuinely pinned. Correctness review
Minor / non-blocking observation. The added Verdict: LGTM. Correct, minimal, mutation-verified; merges cleanly, type/lint/format clean. Safe to merge from my side. 🇨🇳 中文版✅ 本地验证报告(维护者)合并前在本地构建并测试了本 PR 以确认修复。 环境
结果
变异测试(关键证据)。 我把 其余 15 个通过。说明头部改动被真正锁定。 正确性审查
次要 / 非阻塞观察。 新增的 结论:LGTM。 正确、最小化、经变异测试验证;可干净合并,类型/lint/格式干净。从我这边看可以安全合并。 |
8637730 to
043965f
Compare
doudouOUC
left a comment
There was a problem hiding this comment.
No review findings on this revision. The prior Critical (JSON corrupted by html-to-text) and Suggestion are both correctly resolved. Downgraded from Approve to Comment: CI failing (Test on macos-latest, Node 22.x).
— qwen3.7-max via Qwen Code /review
|
@qwen-code /triage |
Maintainer local verification — real build + wire-level A/B ✅I verified this PR locally with the real Verdict: behaves exactly as described — both the content-negotiation fix and the JSON passthrough work. Looks mergeable. A couple of non-blocking observations below. Environment
Real wire-level A/B (same harness,
|
| Dimension | BASE (pre-PR) | PR 043965f7b |
|---|---|---|
auto Accept header on the wire |
text/markdown, text/html, text/plain |
text/markdown, text/html;q=0.9, text/plain;q=0.8, */*;q=0.1 |
| JSON API that 415s on text-only | ❌ HTTP 415 → web_fetch_fallback_failed (never reaches the model) |
✅ HTTP 200 → success, model received the body and answered published_at = 2026-01-27T11:50:52Z |
| JSON body handling (forced 200) | ❌ run through the HTML converter → mangled: Notes: Fix & ship; keep spacing. (<b>…</b> stripped, &→&, double space collapsed) |
✅ passed through raw/intact: Notes: <b>Fix</b> & ship; keep spacing. |
So the headline 415 case is fixed by the low-priority */*;q=0.1, and the new else-passthrough branch stops the HTML converter from corrupting non-HTML payloads. Both confirmed on real wire traffic.
Automated checks (on the real PR source)
- ✅ Unit tests —
vitest run src/tools/web-fetch.test.ts: 19/19 passed. - ✅ Prettier —
--checkon all 3 changed files: clean. - ✅ Type-check — 0 type errors in the PR's files (
web-fetch.ts,web-fetch.test.ts). (The full-packagetsc --buildreports errors only in unrelatedsrc/providers/__tests__/*files — a pre-existing artifact of my divergent-dist worktree setup, not caused by this PR.) ⚠️ ESLint — not runnable in my environment (eslint-plugin-check-fileis missing fromnode_modules, so the flat config fails to load anywhere). Formatting is covered by Prettier above; CI lint still applies.
Observations (non-blocking, optional follow-ups)
- Truncate-before-convert for HTML. This PR changes
convert(responseText).substring(0, MAX)→convert(responseText.substring(0, MAX), …). For HTML pages larger thanMAX_CONTENT_LENGTH(100 KB) this yields somewhat less text than before (you get the converted text of the first 100 KB of source HTML, rather than the first 100 KB of fully-converted text). It's a reasonable trade — it boundshtml-to-textwork on huge documents — but it is a behavior change for very large pages and isn't covered by a test. - Markup content types other than
text/htmlnow pass through raw. e.g.application/xhtml+xmlpreviously went through the HTML→text converter and now hits the new passthrough branch (raw markup to the model). Rare in practice, and theq=0.1weight makes servers prefer the text formats, so this is low-impact — just noting it as an edge of the new branch.
Neither blocks merge. Nice, focused fix with a real-world payoff for JSON APIs. 🚀
中文版(点击展开)
维护者本地验证 —— 真实构建 + 线级 A/B ✅
我在本地用真实的 WebFetchTool 源码(没有 mock 工具本身)对一个本地 mock 服务器做了验证:该服务器模拟 GitHub REST 这类纯 JSON API 的行为(当 Accept 头里没有通配符时返回 415,有通配符时返回 200 + application/json)。这样可以确定性地复现 issue 里的前后对比,不依赖 api.github.com。
结论:行为与描述完全一致 —— 内容协商修复和 JSON 直通都生效,可以合并。 下面有两条不阻塞的观察。
环境
- PR head
043965f7b,独立 worktree;base = merge-based350dd8df。Nodev22.22.2,Linux。 - 方法:用
tsx直接跑真实的web-fetch.ts(真实fetchWithTimeout、真实线上Accept头)。唯一被 stub 的是下游模型调用(getBaseLlmClient().generateText),用来捕获工具实际抽取出来的textContent—— 摘要路径本身不在本 PR 改动范围内。
真实线级 A/B(同一套 harness,仅在 base↔PR 之间替换 web-fetch.ts)
| 维度 | BASE(PR 之前) | PR 043965f7b |
|---|---|---|
auto 模式实际发出的 Accept 头 |
text/markdown, text/html, text/plain |
text/markdown, text/html;q=0.9, text/plain;q=0.8, */*;q=0.1 |
| 对 text-only 返回 415 的 JSON API | ❌ HTTP 415 → web_fetch_fallback_failed(根本到不了模型) |
✅ HTTP 200 → 成功,模型拿到 body 并回答 published_at = 2026-01-27T11:50:52Z |
| JSON body 的处理(强制 200) | ❌ 被丢进 HTML 转换器 → 被破坏:Notes: Fix & ship; keep spacing.(<b>…</b> 被剥掉、&→&、连续空格被合并) |
✅ 原样直通:Notes: <b>Fix</b> & ship; keep spacing. |
所以核心的 415 场景由低优先级的 */*;q=0.1 修复,新增的 else 直通分支也避免了 HTML 转换器破坏非 HTML 内容。两点都在真实线上流量中得到确认。
自动化检查(针对真实 PR 源码)
- ✅ 单元测试 ——
vitest run src/tools/web-fetch.test.ts:19/19 通过。 - ✅ Prettier —— 对 3 个改动文件
--check:干净。 - ✅ 类型检查 —— PR 自身文件(
web-fetch.ts、web-fetch.test.ts)0 类型错误。(整包tsc --build仅在无关的src/providers/__tests__/*报错,这是我 worktree 的 dist 版本不一致导致的既有现象,与本 PR 无关。) ⚠️ ESLint —— 在我的环境里跑不起来(node_modules缺eslint-plugin-check-file,flat config 无法加载)。格式问题已由上面的 Prettier 覆盖;CI 的 lint 仍然有效。
观察(不阻塞,可选的后续)
- HTML 改为「先截断再转换」。 本 PR 把
convert(responseText).substring(0, MAX)改成convert(responseText.substring(0, MAX), …)。对超过MAX_CONTENT_LENGTH(100 KB)的 HTML 页面,这会比之前拿到更少的文本(拿到的是「源 HTML 前 100 KB」转换后的文本,而不是「完整转换后文本」的前 100 KB)。这是个合理的取舍——能限制html-to-text在超大文档上的开销——但对超大页面是一种行为变化,且没有测试覆盖。 - 除
text/html外的标记类内容类型现在会原样直通。 例如application/xhtml+xml之前会走 HTML→文本转换,现在落到新的直通分支(把原始标记交给模型)。实际很少见,而且q=0.1的权重让服务器更倾向返回文本格式,所以影响很小——只是指出这是新分支的一个边界。
两点都不阻塞合并。是个聚焦、对 JSON API 很实用的小修复。🚀
Verified locally by the maintainer with a real source-level wire harness (build + 19/19 unit tests + base↔PR A/B). Posted via Qwen Code.
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
LGTM, looks ready to ship. ✅
What this PR does
This PR loosens
web_fetchcontent negotiation by adding a low-priority*/*;q=0.1fallback to everyAcceptheader the tool sends. The existing preferences stay intact:autostill prefers markdown, then HTML, then plain text, while explicitmarkdown,html, andtextformats still prefer their requested media type first.It also updates the web-fetch developer docs to describe the new headers and adds coverage for JSON content returned through fallback negotiation.
Why it's needed
web_fetchcurrently sends onlytext/*media types. JSON-only APIs such as GitHub REST reject that request with HTTP 415, so the tool cannot fetch API responses even though the downstream processing path already normalizes fetched content to text.Adding
*/*;q=0.1keeps the token-saving markdown path for servers that support it, but lets JSON APIs and other non-text endpoints respond when they cannot satisfy the preferred text formats.Reviewer Test Plan
How to verify
Run the focused unit tests and checks:
Manual content-negotiation check against the GitHub REST example from the issue:
Evidence (Before & After)
Before: the current
autoheader,Accept: text/markdown, text/html, text/plain, returns HTTP415for the GitHub REST release endpoint.After: the new
autoheader,Accept: text/markdown, text/html;q=0.9, text/plain;q=0.8, */*;q=0.1, returns HTTP200; the response includespublished_at: 2026-01-27T11:50:52Z.Focused test result:
src/tools/web-fetch.test.tspassed with 19 tests.Tested on
Environment (optional)
Node.js 22 local checkout. No sandbox required for the focused unit tests.
Risk & Scope
*/*hasq=0.1, and existing processing already converts fetched content to text before prompting the model.Linked Issues
Fixes #5611
AI Assistance Disclosure
I used Codex to review the changes, sanity-check the implementation against existing patterns, and help spot potential edge cases.
中文说明
What this PR does
这个 PR 放宽了
web_fetch的内容协商逻辑:给工具发送的每一种Accept头都加上低优先级的*/*;q=0.1兜底。原有偏好保持不变:auto仍然优先请求 markdown,然后是 HTML,再然后是纯文本;显式的markdown、html、text格式也仍然优先请求对应媒体类型。同时更新了 web-fetch 开发者文档,说明新的 header,并补了一个通过兜底协商拿到 JSON 内容的测试覆盖。
Why it's needed
当前
web_fetch只发送text/*媒体类型。像 GitHub REST 这样的纯 JSON API 会用 HTTP 415 拒绝这类请求,所以工具无法抓取 API 响应,尽管下游处理流程本来就会把抓取到的内容归一化成文本。添加
*/*;q=0.1后,支持 markdown 的服务器仍然会走节省 token 的 markdown 路径;但当 JSON API 或其他非文本端点无法满足首选文本格式时,也可以正常返回内容。Reviewer Test Plan
How to verify
运行 focused 单测和检查:
针对 issue 里的 GitHub REST 示例做手动内容协商检查:
Evidence (Before & After)
修改前:当前
autoheader,Accept: text/markdown, text/html, text/plain,访问 GitHub REST release endpoint 返回 HTTP415。修改后:新的
autoheader,Accept: text/markdown, text/html;q=0.9, text/plain;q=0.8, */*;q=0.1,返回 HTTP200;响应里包含published_at: 2026-01-27T11:50:52Z。focused 测试结果:
src/tools/web-fetch.test.ts通过,共 19 个测试。Tested on
Environment (optional)
Node.js 22 本地 checkout。focused 单测不需要 sandbox。
Risk & Scope
*/*的权重是q=0.1,且现有处理流程已经会在把内容交给模型前归一化为文本。Linked Issues
Fixes #5611
AI Assistance Disclosure
I used Codex to review the changes, sanity-check the implementation against existing patterns, and help spot potential edge cases.