fix(triage): strengthen PR gate with batch detection, problem existence check, and red flag patterns#5723
fix(triage): strengthen PR gate with batch detection, problem existence check, and red flag patterns#5723pomelo-nwu wants to merge 8 commits into
Conversation
…ce check, and red flag patterns The PR triage gate was too permissive — it approved 20 validation-noise PRs from an AI bot in a single day without blocking any. Root cause: the gate asked "is the direction correct?" (easy to pass) but never asked "does this problem actually exist?" (the real question). Three new checks added to pr-workflow.md: 1. Gate Philosophy — explicit stance: the gate says no by default, burden of proof is on the author. AI-bot volume does not equal value. 2. Stage 0: Batch Pattern Detection — before individual review, check if the same author has 3+ similar PRs in 7 days. If so, evaluate as a group and close the noise batch together. 3. Stage 1b: Problem Existence Check — mandatory before direction review. Distinguishes observed bugs (with reproduction) from theoretical hardening (without). Includes red flag phrases that signal "no real problem" (e.g. "the runtime validators already enforce..."). Also updated Stage 1 comment template and Stage 3 reflection questions.
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
Hey @pomelo-nwu — the PR body doesn't follow the PR template. The body uses ## Background and ## What changed instead of the required headings: ## What this PR does, ## Why it's needed, ## Reviewer Test Plan, ## Risk & Scope, and ## Linked Issues.
Please update the PR description to match the template. The content is all there — it just needs to be structured under the right headings so reviewers can find what they need.
中文说明
嗨 @pomelo-niju — PR 正文没有按照 PR 模板 填写。当前使用的是 ## Background 和 ## What changed,但模板要求的标题是:## What this PR does、## Why it's needed、## Reviewer Test Plan、## Risk & Scope 和 ## Linked Issues。
请按照模板重新组织 PR 描述。内容都在,只需要放到正确的标题下,方便 reviewer 查阅。
— Qwen Code · qwen3.7-max
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
wenshao
left a comment
There was a problem hiding this comment.
Re-reviewed the latest head 9fabd3ee08b2483f1ab0b2ea1a2676b1291d4f7d. I did not find additional high-confidence issues beyond the existing open review threads.
Verification: GitHub checks are passing/skipped as expected; deterministic changed-file checks reported no findings; git diff --check passed. I am leaving this as a comment rather than an approval because the current review still has unresolved Critical threads around the batch-detection command and auto-close/escalation behavior.
- GPT-5 Codex via Qwen Code /review
… tighten prompt - Fix jq date comparison: string '7 days ago' always false, use fromdateiso8601 - Fix --state all to --state open to match '3+ open PRs' prose - Add terminal exit list after Stage 1 comment (1b problem-existence was missing) - Fix Stage 3 batch threshold (10+) to reference Stage 0 (3+) - Compress Gate Philosophy, Stage 0, and 1b for conciseness
|
@qwen-code /triage |
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
Hi @pomelo-nwu — thanks for the contribution! The PR body doesn't follow the PR template. Required headings are missing:
## What this PR does## Why it's needed## Reviewer Test Plan(with How to verify, Evidence, Tested on)## Risk & Scope## Linked Issues
The current body uses ## Background and ## What changed instead. Could you restructure to match the template? It helps reviewers (and the triage gate) evaluate the PR consistently.
The motivation and changes are clear from the description — this is just a formatting pass to align with the project's review workflow.
嗨 @pomelo-nwu — 感谢贡献!PR body 没有按照 PR 模板 填写,缺少以下必需标题:
## What this PR does## Why it's needed## Reviewer Test Plan(含 How to verify、Evidence、Tested on)## Risk & Scope## Linked Issues
当前使用的是 ## Background 和 ## What changed。请按模板结构调整一下,方便 reviewer 和 triage gate 统一评估。
动机和改动内容从描述里看得很清楚——只是格式上对齐一下项目的 review 流程。
— Qwen Code · qwen3.7-max
wenshao
left a comment
There was a problem hiding this comment.
Reviewed the latest head c7b69e93. The follow-up commit resolves the earlier review threads: the Stage 0 batch-detection jq filter now parses timestamps with fromdateiso8601 / now - 604800 (verified working end-to-end against gh's gojq engine), --state open matches the "open PRs" prose, the Stage 3 reflection threshold is aligned to 3+, and the terminal-exit list now includes the problem-existence stop. The sub-stage renumbering (1b→1c, 1c→1d) has no dangling references elsewhere. No new blocking issues. The remaining open Suggestion threads (batch-handling phrasing, "no reproduction = no fix" absolutism) are non-blocking doc-wording calls at the author's discretion. LGTM! ✅
— claude-opus-4-8[1m] via Qwen Code /qreview
CI "failure" is a transient checkout race, not a code/test problem — just re-run the jobThe red Test (ubuntu-latest) job did not fail on a test. Looking at the job steps, it failed at step 2, The actual error (from the Checkout step):
GitHub regenerates Why this is infra, not your change:
Action: none on the PR. Just Re-run failed jobs — the merge ref is stable now, so checkout will succeed. No rebase or code change is needed. 中文版这个 CI "失败"是一次短暂的 checkout 竞态,不是代码/测试问题 —— 重跑该 job 即可红的 Test (ubuntu-latest) job 并不是测试挂了。看 job 的步骤,它挂在第 2 步 真正的报错(来自 Checkout 步骤):
GitHub 在目标分支推进时会异步重新生成 为什么这是基础设施问题,而不是你的改动:
该怎么做: PR 这边无需任何改动。直接 Re-run failed jobs 即可 —— merge ref 现在已经稳定,checkout 会成功。不需要 rebase,也不需要改代码。 |
Add Stage 0b that flatly rejects community-contributed refactor PRs touching core infrastructure (packages/core, auth, providers, models, config, tools, services). No exceptions — core refactors must be maintainer-initiated with prior design discussion. Triggered by PR #5089: 75-file refactor across core/auth/providers/models that should never have been accepted from a community contributor.
eed53c7 to
d8397eb
Compare
…approving The meta-principle: 'the direction looks correct' is the most dangerous sentence in triage — it means the gate understood intent but not impact. For core module changes, the gate must name every downstream consumer; if it cannot, escalate to maintainer. 100% confidence or escalate. Updated both pr-workflow.md (Stage 0b) and AGENTS.md (Working Principles).
Previous version asked the gate to 'assess whether it understands the impact' — too soft, AI will always rationalize that it understands enough. New version: non-maintainer + core paths = instant reject. No evaluation, no thinking, no exceptions. The gate is not qualified to judge core refactors. Period.
…dence for small
Large-scope core changes (10+ files / 500+ lines) are a hard block — no
evaluation, no exceptions. Small-scope core changes may proceed only if
the gate is 100% confident; any doubt triggers maintainer escalation.
Previous versions were either too soft ('assess your understanding') or
too blunt ('all non-maintainer core PRs rejected'). Two-tier approach
allows legitimate small bugfixes through while walling off refactors.
…lways returns empty for human contributors Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
|
|
||
| ### Stage 0b: Core Module Protection (two-tier check) | ||
|
|
||
| Core infrastructure: `packages/core/src/**`, auth, providers, models, config, tools, services, cross-package changes. |
There was a problem hiding this comment.
[Critical] The core infrastructure definition uses vague category names — "auth, providers, models, config, tools, services" — without mapping them to concrete path patterns. An LLM agent may interpret "auth" narrowly (only packages/core/src/auth/) and miss packages/core/src/providers/oauth.ts or packages/cli/src/auth-handler.ts, leading to inconsistent classification across triage runs.
Additionally, "cross-package changes" is listed here but omitted from the AGENTS.md definition — the two files are already inconsistent in the same PR.
| Core infrastructure: `packages/core/src/**`, auth, providers, models, config, tools, services, cross-package changes. | |
| Core infrastructure: files matching any of `packages/core/src/**`, `packages/*/src/auth/**`, `packages/*/src/providers/**`, `packages/*/src/models/**`, `packages/*/src/config/**`, `packages/*/src/tools/**`, `packages/*/src/services/**`. |
— qwen3.7-max via Qwen Code /review
|
|
||
| Core infrastructure: `packages/core/src/**`, auth, providers, models, config, tools, services, cross-package changes. | ||
|
|
||
| **Tier 1 — Large-scope changes to core → HARD BLOCK.** If the PR touches 10+ files or 500+ lines in core paths → reject immediately. No evaluation, no Stage 1. Reply: "This change touches core infrastructure at scale. Core refactors must be maintainer-initiated — please open an issue to discuss the design first." Then **stop**. This is a wall, not a guideline. |
There was a problem hiding this comment.
[Critical] Two issues with this Tier 1 threshold:
-
"500+ lines" is unspecified — additions only, deletions only, or combined? A 300-line refactor (300 deletions + 300 additions) is 300 lines of real change but 600 by
git diff --stat. -
No
ghcommand to deliver the hard-block reply. Stage 1a providesgh pr review --request-changes --body-file, but Stage 0b only specifies the reply text with no delivery mechanism. The agent must guess whether to usegh pr review --request-changes,gh pr close, orgh api.
| **Tier 1 — Large-scope changes to core → HARD BLOCK.** If the PR touches 10+ files or 500+ lines in core paths → reject immediately. No evaluation, no Stage 1. Reply: "This change touches core infrastructure at scale. Core refactors must be maintainer-initiated — please open an issue to discuss the design first." Then **stop**. This is a wall, not a guideline. | |
| **Tier 1 — Large-scope changes to core → HARD BLOCK.** If the PR touches 10+ files or 500+ added/deleted lines (combined diff) in core paths → reject immediately. No evaluation, no Stage 1. | |
| ```bash | |
| gh pr review "$PR_NUMBER" --repo "$REPO" --request-changes --body "This change touches core infrastructure at scale. Core refactors must be maintainer-initiated — please open an issue to discuss the design first." |
Then stop. This is a wall, not a guideline.
_— qwen3.7-max via Qwen Code /review_
|
|
||
| **Tier 1 — Large-scope changes to core → HARD BLOCK.** If the PR touches 10+ files or 500+ lines in core paths → reject immediately. No evaluation, no Stage 1. Reply: "This change touches core infrastructure at scale. Core refactors must be maintainer-initiated — please open an issue to discuss the design first." Then **stop**. This is a wall, not a guideline. | ||
|
|
||
| **Tier 2 — Small-scope changes to core → evaluate with 100% confidence.** If the PR touches fewer files but still hits core paths, you MAY proceed to Stage 1 — but only if you are **100% confident** the change is correct and safe. If there is any doubt at all — "the direction looks correct" is NOT 100% confidence — escalate to maintainer before proceeding. You must be able to name every downstream consumer affected; if you cannot, escalate. |
There was a problem hiding this comment.
[Critical] Tier 2 requires the agent to "name every downstream consumer affected" and be "100% confident" — but Stage 0b runs before Stage 1 (worktree creation) and Stage 2a (code reading). At this point, the agent has only seen gh pr diff metadata (file list), not the code itself. Naming downstream consumers requires reading import graphs, which hasn't happened yet.
This makes Tier 2 functionally identical to Tier 1 — every small-scope core change escalates because the confidence condition is structurally impossible to satisfy pre-code-review. The "two-tier" framing becomes misleading.
Consider either: (a) moving the downstream-consumer check into Stage 2a after code review, or (b) replacing "100% confident" with achievable pre-code-read criteria (e.g., "single-file change, existing tests cover the path, no new exports or type changes").
— qwen3.7-max via Qwen Code /review
|
|
||
| Save this comment's ID. If direction is escalated → stop here. Template | ||
| failures already stopped in Stage 1a. | ||
| Save this comment's ID. Terminal exits — stop here if any applies: |
There was a problem hiding this comment.
[Critical] The terminal exits list is missing Stage 0b (core module hard block). If the agent evaluates stages out of order (common with LLM agents), there is no entry telling it to stop after a Stage 0b rejection. The agent could post a Stage 0b hard-block reply and then proceed to Stage 2 code review, producing contradictory output.
| Save this comment's ID. Terminal exits — stop here if any applies: | |
| Save this comment's ID. Terminal exits — stop here if any applies: | |
| - Core module hard block (Stage 0b) → rejected, do not proceed. | |
| - Template failure → stopped in 1a. | |
| - Problem does not exist → request changes in 1b, do not proceed to Stage 2. | |
| - Direction escalated → stop here. |
— qwen3.7-max via Qwen Code /review
| Save this comment's ID. Terminal exits — stop here if any applies: | ||
|
|
||
| - Template failure → stopped in 1a. | ||
| - Problem does not exist → request changes in 1b, do not proceed to Stage 2. |
There was a problem hiding this comment.
[Critical] Stage 1b says the terminal action for "no reproduction" is: first request changes (ask for reproduction), then close if the author cannot provide one. But this terminal exits list says only "request changes in 1b" — a different gh operation from gh pr close.
Clarify the two-step flow: e.g., "Problem does not exist → request changes in 1b (first run) or close (re-run with no reproduction provided), do not proceed to Stage 2."
— qwen3.7-max via Qwen Code /review
|
|
||
| - **Stop individual review.** Evaluate the batch as a group first. | ||
| - Mostly noise → close the batch with a single explanation. | ||
| - A few have value → extract those, close the rest. |
There was a problem hiding this comment.
[Suggestion] "Extract those, close the rest" provides no procedure for extraction. The triage agent has gh commands for closing and commenting, but no workflow for extracting code from one PR into another. The agent will improvise or skip extraction entirely.
Consider replacing with concrete instructions, e.g., "Comment on the valuable PRs asking the author to resubmit them individually with focused scope. Close the remaining batch PRs."
— qwen3.7-max via Qwen Code /review
|
|
||
| _Adapted from Andrej Karpathy's [CLAUDE.md](https://github.com/multica-ai/andrej-karpathy-skills/blob/main/CLAUDE.md)._ | ||
|
|
||
| ### Core Infrastructure Is Maintainer-Only (two-tier rule) |
There was a problem hiding this comment.
[Critical] This triage-gate-specific rule is placed under "Working Principles" in AGENTS.md, which is loaded into every agent session — including feat-dev, bugfix, and qwen-code-claw. The heading says "Maintainer-Only" and the body says "hard block, no exceptions." A coding agent that reads this may refuse to touch core files during feature development or bug fixing, interpreting a triage-gate policy as a development-time prohibition.
Additionally, the core definition here (packages/core/src/**, auth, providers, models, config, tools, services) omits "cross-package changes" that pr-workflow.md Stage 0b includes — the two files are inconsistent in the same PR.
Consider either: (a) scoping the heading to triage context ("PRs from external contributors touching core..."), (b) adding an explicit note that this applies only to the triage gate, or (c) keeping the full rule in pr-workflow.md only and replacing this section with a one-liner pointer.
— qwen3.7-max via Qwen Code /review
What this PR does
Adds two new checks to the PR triage gate and fixes several control flow bugs in the gate prompt (
pr-workflow.md):Bug fixes from review feedback:
"7 days ago"string comparison was always false) and--state all/--state openmismatch.Why it's needed
On 2026-06-22, a high-volume automated contributor submitted 20 PRs in a single day. 11 of them were validation noise — "reject fractional maxConnections", "require integer read_file ranges", "validate sessions list limit" — each individually plausible, each collectively a waste of maintainer attention. Our PR triage gate approved all of them.
The root cause: the gate asked "is the direction correct?" — a question every one of these PRs could pass, because "rejecting invalid values" sounds reasonable. It never asked the harder question: "does this problem actually exist?" Nobody ever passed
maxConnections: 1.5. The runtime already coerced or rejected these values. The PRs were schema hygiene dressed up as bug fixes, and the gate waved them through.This PR closes that gap: the gate now checks problem existence before direction, and detects batch patterns before individual review.
Reviewer Test Plan
This is a prompt-only change (no code paths affected). Verification is by reading the modified
pr-workflow.mdand confirming the gate logic is correct.How to verify
.qwen/skills/triage/references/pr-workflow.mdand confirm the Stage 0 jq command returns correct results (the old"7 days ago"string comparison was always false).Evidence (Before & After)
N/A — prompt-only change, no user-visible behavior.
Tested on
Environment (optional)
N/A
Risk & Scope
Linked Issues
Related to the 2026-06-22 batch incident (no specific issue filed).
中文说明
这个 PR 做了什么
在 PR triage gate 中新增两项检查,并修复 gate prompt(
pr-workflow.md)中的多个控制流 bug:来自 review 反馈的 bug 修复:
"7 days ago"字符串比较永远返回 false)以及--state all/--state open不一致。为什么需要
2026-06-22,一个高频自动贡献者在一天内提交了 20 个 PR。其中 11 个是验证噪音——"拒绝小数 maxConnections"、"要求整数 read_file 范围"、"验证 sessions list limit"——每个单独看都合理,合在一起是浪费 maintainer 时间。我们的 PR triage gate 全部放行。
根因:gate 问的是"方向对不对"——这些 PR 都能过,因为"拒绝非法值"听起来没毛病。但从未问更难的问题:"这个问题实际存在吗?" 没人传过
maxConnections: 1.5,运行时已经 coerce 或拒绝了这些值。这些 PR 是把 schema 洁癖包装成 bug fix,gate 全部放行。这个 PR 补上了这个缺口:gate 现在在方向 review 之前检查问题存在性,在逐个 review 之前检测批量模式。
Reviewer 测试计划
纯 prompt 改动(不影响代码路径)。验证方式是阅读修改后的
pr-workflow.md确认 gate 逻辑正确。如何验证
.qwen/skills/triage/references/pr-workflow.md,确认 Stage 0 的 jq 命令能返回正确结果(旧的"7 days ago"字符串比较永远返回 false)。证据(改进前 vs 改进后)
N/A — 纯 prompt 改动,无用户可见行为变化。
测试环境
环境(可选)
N/A
风险与范围
关联 Issue
关联 2026-06-22 批量事件(未单独建 issue)。