Self-Improvement: convergence, consistency, and signal preservation in review loops

## Background

A recent large security-focused PR ([nexu-io/open-design#1704](https://github.com/nexu-io/open-design/pull/1704), 72 files, +8,529/-1,362 lines, 26 days of review, ~175 inline comments) inadvertently served as a comprehensive natural test case for Looper's review loop. The PR touched nearly every web security domain — authentication, network routing, CSRF, session management, middleware chaining, and tunnel proxy handling — exposing Looper to a wide variety of review scenarios in a single run.

This proved valuable: Looper successfully identified 8 genuine bugs, including a critical XFF spoofing bypass and a subtle IPv6 BigInt mask error. At the same time, the breadth of cases revealed recurring patterns in how the review loop itself behaves — specifically around convergence, consistency, and signal preservation.

This report documents those patterns and proposes targeted improvements. Each proposal is grounded in the observed behavior and supported by relevant research on automated code review and iterative AI systems.

### References

The patterns observed align with findings from prior work on automated review systems and iterative AI refinement:

- **Wang et al., 2023** — "Self-Consistency Improves Chain of Thought Reasoning in Language Models" [[NeurIPS](https://arxiv.org/abs/2203.11171)] (6,800+ citations). LLMs can produce contradictory outputs for the same input across iterations. Consistency mechanisms are needed when using LLMs in multi-round workflows.

- **Jin & Chen, 2025** — "Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement" [[arXiv](https://arxiv.org/abs/2603.00539)]. LLM reviewers systematically overcorrect — flagging correct code as defective and proposing unnecessary changes. This directly relates to review signal quality and the risk of fix suggestions introducing regressions.

- **Adams et al., 2025** — "Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency" [[arXiv](https://arxiv.org/abs/2605.30208)]. Industry case study at Meta scale. Key lesson: risk-calibrated review allocation is essential to maintain signal-to-noise ratio. Low-risk automated comments erode reviewer trust when not properly filtered.

- **Al-Maamari, 2025** — "Why LLMs Fail: A Failure Analysis and Partial Success Measurement for Automated Security Patch Generation" [[arXiv](https://arxiv.org/abs/2603.10072)]. Systematic failure analysis of LLM-generated security patches. Documents measurable failure modes including incorrect fix logic and introduced regressions — directly relevant to validating AI review suggestions.

---

## 1. Bugs Found by Looper (Value Delivered)

Looper demonstrated strong bug-finding capability on a TypeScript/Express codebase:

| # | Bug | Severity | Description |
|---|-----|----------|-------------|
| 1 | IPv6 /128 BigInt mask error | HIGH | `0xffffffff_ffffffffn` is a 64-bit mask, not 128-bit. Combined with BigInt sign issues, all IPv6 /128 CIDRs match. |
| 2 | XFF spoofing → loopback impersonation | CRITICAL | With `OD_TRUST_PROXY` enabled, XFF is trusted without TCP peer verification. `X-Forwarded-For: 127.0.0.1` completely bypasses auth. |
| 3 | `OD_API_TOKEN` missing from `authEnabledRef` | HIGH | Deployments using only an API token have auth treated as disabled. |
| 4 | SPA catch-all intercepts `/mcp` route | HIGH | Express route ordering bug completely blocks remote MCP endpoint. |
| 5 | Missing `next()` call | HIGH | Requests with valid auth tokens hang until timeout. |
| 6 | Session reset on bootstrap | HIGH | Session resets immediately after first key generation, blocking the browser that just created the key. |
| 7 | Tunnel environment loopback bypass | HIGH | Cloudflare Tunnel forwarding to 127.0.0.1 invalidates auth loopback bypass. |
| 8 | CSRF on `/api/auth/reset-keys` | HIGH | Destructive endpoint with no Origin verification. |

**Takeaway:** Looper's core value proposition — finding subtle security bugs that humans miss — is validated. The issues below are about improving the *process* around that capability.

---

## 2. Self-Improvement Problems

### 2.1 Contradictory Review Rounds

On the same code path (`OD_TRUST_PROXY=1` + loopback + no XFF header), one reviewer produced **8 round trips over 12 days** with contradictory guidance:

1. "Add fail-closed behavior" → Author implements fail-closed
2. "Now direct loopback is blocked" → Reviewer criticizes their own requirement
3. Author points out contradiction, proposes Option 3
4. Human coordinator intervenes: "Clarify the invariant"
5. Reviewer responds "fail-open" → Author implements header-presence check
6. Reviewer: "Still fail-open" (×3 more rounds)

**Root cause:** No invariant specification before review started. The reviewer had no stable reference point, so each round applied a different implicit policy.

### 2.2 Cyclic Re-review with No Dedup

| Metric | Reviewer A | Reviewer B |
|--------|-----------|-----------|
| Total rounds | 7 | 15+ |
| Productive rounds | ~60% | ~35% |
| Duplicate comments | ~30% | ~50% |

Of ~174 comments, roughly half were duplicates of previously raised issues. The system has no mechanism to track what was already found, so each round re-discovers the same problems.

**Root cause:** Every push triggers a full re-review with no memory of previous rounds. The diff + context is re-analyzed from scratch each time.

### 2.3 Proposed Fixes Introducing New Bugs

| Proposal | Consequence |
|----------|-------------|
| "Enforce fail-closed" | Completely blocks local CLI + browser access |
| "Apply `isLocalManagementRequest`" | Removes Origin verification → weakens CSRF → same reviewer discovers the CSRF regression next round |

**Root cause:** Fix suggestions are not validated against impact scenarios before being proposed.

### 2.4 Late Discovery of Critical Bugs

The missing `next()` call is a **critical runtime bug** — every authenticated request hangs. Yet it was not discovered until round 15+. Before that, review capacity was consumed by the XFF/loopback policy debate.

**Root cause:** No priority ordering. Policy and design debates receive the same attention as runtime correctness bugs.

### 2.5 Importance Loss After Request-Changes Cap

Once a reviewer reaches the request-changes cap, all subsequent comments become `COMMENT` status. This makes it impossible to distinguish "I noticed a style issue" from "this is a blocking security vulnerability."

**Root cause:** Cap mechanism changes comment semantics without preserving severity signal.

---

## 3. Improvement Proposals

### P-001: Require Invariant Specification Before Review

**Problem:** Security reviews cycle when reviewers apply different implicit policies each round.

**Proposal:** For security-related PRs, require an invariant document before review begins. Example invariants for the XFF/auth case:

1. Direct loopback connections always get management access
2. Absent XFF → treat as direct connection
3. Present but empty XFF → reject

**Implementation:** Add a `LOOPER_INVARIANTS` field to the PR body or a linked document. Reviewers must reference invariants when raising issues.

**Expected impact:** Eliminates the 12-day contradiction cycle observed in this case study.

**Related:** Wang et al. (2023) showed that LLM outputs diverge across iterations without an anchoring reference. Invariant specs provide that anchor. Jin & Chen (2025) confirmed that LLM reviewers systematically overcorrect when no ground truth is specified.

---

### P-002: Dedup + Convergence Detection

**Problem:** ~50% of comments are duplicates. No termination condition exists.

**Proposal:**

1. **Dedup:** Same file + line + issue → suppress new comment, update existing
2. **Convergence:** 3 consecutive rounds with 0 new findings → auto-terminate
3. **Priority tracking:** CRITICAL/HIGH re-verified each round; MEDIUM/LOW
raised once only

**Expected impact:** ~50% comment volume reduction, clear termination signal.

**Related:** Adams et al. (2025) found at Meta scale that risk-calibrated review allocation is essential — low-risk automated comments erode reviewer trust. Jin & Chen (2025) confirmed systematic overcorrection increases comment volume without proportional value.

---

### P-003: Impact Scenarios for Fix Proposals

**Problem:** AI-proposed fixes can introduce new bugs (e.g., fail-closed blocking all local access).

**Proposal:** Require every fix suggestion to include an "affected user scenarios" section:

```
## Suggested Fix
Change XFF handling to fail-closed.

## Affected Scenarios
- ✅ Remote attacker with spoofed XFF → blocked (intended)
- ❌ Local CLI user without proxy → blocked (side-effect)
- ❌ Browser on localhost → blocked (side-effect)
```

**Expected impact:** Side-effects surfaced before author implements, reducing round-trip waste.

**Related:** Al-Maamari (2025) documented systematic failure modes in LLM-generated security patches, including incorrect fix logic and regressions. Explicit impact validation before proposal reduces error propagation.

---

### P-004: Runtime Bug Priority Ordering

**Problem:** Critical runtime bugs (`next()` omission) go undiscovered while policy debates consume review capacity.

**Proposal:** Enforce review ordering:

1. **Runtime correctness** (bugs, crashes, hangs)
2. **Security vulnerabilities** (injection, auth bypass, CSRF)
3. **Policy / design discussions** (fail-open vs fail-closed)

Runtime bugs must be fully catalogued before moving to security review, and security before policy.

**Expected impact:** Critical bugs found in early rounds, not round 15+.

**Related:** Jin & Chen (2025) observed that LLM reviewers systematically overcorrect, focusing on policy issues while missing critical runtime bugs. Automated reviewers need explicit priority ordering to replicate human intuition.

---

### P-005: Post-Cap Summary Reports

**Problem:** After the request-changes cap, all comments have `COMMENT` status, making severity indistinguishable.

**Proposal:** When a reviewer reaches the cap, switch from individual comments to a single summary report per round:

```
## Round N Summary
- 🔴 BLOCKING: <issue> (line X)
- 🟡 SUGGESTED: <issue> (line Y)
- ✅ RESOLVED: <previously raised issue>
```

**Expected impact:** Severity signal preserved post-cap; author can still prioritize effectively.

**Related:** Adams et al. (2025) emphasized at Meta scale that risk-calibrated signals are essential — exactly what happens when all post-cap comments carry the same `COMMENT` status and severity becomes indistinguishable.

---

## 4. Summary

| Aspect | Current State | Proposed Improvement |
|--------|--------------|---------------------|
| Bug finding | ✅ Strong (8 real bugs found) | Maintain |
| Review convergence | ❌ No termination condition | P-002: Auto-terminate on 3 dry rounds |
| Consistency | ❌ Self-contradiction undetected | P-001: Invariant spec upfront |
| Fix quality | ❌ Fixes introduce new bugs | P-003: Impact scenarios required |
| Prioritization | ❌ Runtime bugs found late | P-004: Strict ordering |
| Post-cap signal | ❌ Severity lost | P-005: Summary reports |

**Core message:** Looper's bug-finding capability is genuinely valuable. These proposals aim to reduce process overhead (50% duplicate comments, 12-day contradiction cycles, late critical bug discovery) so that value is delivered faster and with higher signal-to-noise ratio.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-Improvement: convergence, consistency, and signal preservation in review loops #491

Background

References

1. Bugs Found by Looper (Value Delivered)

2. Self-Improvement Problems

2.1 Contradictory Review Rounds

2.2 Cyclic Re-review with No Dedup

2.3 Proposed Fixes Introducing New Bugs

2.4 Late Discovery of Critical Bugs

2.5 Importance Loss After Request-Changes Cap

3. Improvement Proposals

P-001: Require Invariant Specification Before Review

P-002: Dedup + Convergence Detection

P-003: Impact Scenarios for Fix Proposals

P-004: Runtime Bug Priority Ordering

P-005: Post-Cap Summary Reports

4. Summary

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

#	Bug	Severity	Description
1	IPv6 /128 BigInt mask error	HIGH	`0xffffffff_ffffffffn` is a 64-bit mask, not 128-bit. Combined with BigInt sign issues, all IPv6 /128 CIDRs match.
2	XFF spoofing → loopback impersonation	CRITICAL	With `OD_TRUST_PROXY` enabled, XFF is trusted without TCP peer verification. `X-Forwarded-For: 127.0.0.1` completely bypasses auth.
3	`OD_API_TOKEN` missing from `authEnabledRef`	HIGH	Deployments using only an API token have auth treated as disabled.
4	SPA catch-all intercepts `/mcp` route	HIGH	Express route ordering bug completely blocks remote MCP endpoint.
5	Missing `next()` call	HIGH	Requests with valid auth tokens hang until timeout.
6	Session reset on bootstrap	HIGH	Session resets immediately after first key generation, blocking the browser that just created the key.
7	Tunnel environment loopback bypass	HIGH	Cloudflare Tunnel forwarding to 127.0.0.1 invalidates auth loopback bypass.
8	CSRF on `/api/auth/reset-keys`	HIGH	Destructive endpoint with no Origin verification.

Metric	Reviewer A	Reviewer B
Total rounds	7	15+
Productive rounds	~60%	~35%
Duplicate comments	~30%	~50%

Proposal	Consequence
"Enforce fail-closed"	Completely blocks local CLI + browser access
"Apply `isLocalManagementRequest`"	Removes Origin verification → weakens CSRF → same reviewer discovers the CSRF regression next round

Aspect	Current State	Proposed Improvement
Bug finding	✅ Strong (8 real bugs found)	Maintain
Review convergence	❌ No termination condition	P-002: Auto-terminate on 3 dry rounds
Consistency	❌ Self-contradiction undetected	P-001: Invariant spec upfront
Fix quality	❌ Fixes introduce new bugs	P-003: Impact scenarios required
Prioritization	❌ Runtime bugs found late	P-004: Strict ordering
Post-cap signal	❌ Severity lost	P-005: Summary reports

Self-Improvement: convergence, consistency, and signal preservation in review loops #491

Description

Background

References

1. Bugs Found by Looper (Value Delivered)

2. Self-Improvement Problems

2.1 Contradictory Review Rounds

2.2 Cyclic Re-review with No Dedup

2.3 Proposed Fixes Introducing New Bugs

2.4 Late Discovery of Critical Bugs

2.5 Importance Loss After Request-Changes Cap

3. Improvement Proposals

P-001: Require Invariant Specification Before Review

P-002: Dedup + Convergence Detection

P-003: Impact Scenarios for Fix Proposals

P-004: Runtime Bug Priority Ordering

P-005: Post-Cap Summary Reports

4. Summary

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions