Skip to content

Commit f7a4f3f

Browse files
authored
Merge pull request #2106 from qodo-ai/of/gpt-5-1
docs: add GPT-5.1 benchmark results to PR benchmark documentation
2 parents 4c5d3d6 + 0bbad14 commit f7a4f3f

File tree

1 file changed

+23
-55
lines changed

1 file changed

+23
-55
lines changed

docs/docs/pr_benchmark/index.md

Lines changed: 23 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,12 @@ A list of the models used for generating the baseline suggestions, and example r
8282
<td style="text-align:left;">4096</td>
8383
<td style="text-align:center;"><b>48.8</b></td>
8484
</tr>
85+
<tr>
86+
<td style="text-align:left;">GPT-5.1</td>
87+
<td style="text-align:left;">2025-11-13</td>
88+
<td style="text-align:left;">medium</td>
89+
<td style="text-align:center;"><b>44.9</b></td>
90+
</tr>
8591
<tr>
8692
<td style="text-align:left;">Gemini-2.5-pro</td>
8793
<td style="text-align:left;">2025-06-05</td>
@@ -157,7 +163,7 @@ A list of the models used for generating the baseline suggestions, and example r
157163
</tbody>
158164
</table>
159165

160-
## Results Analysis
166+
## Results Analysis (Latest Additions)
161167

162168
### GPT-5-pro
163169

@@ -247,6 +253,22 @@ Weaknesses:
247253
- **Inconsistent output robustness:** Several cases show truncated or malformed responses, reducing value despite correct analysis elsewhere.
248254
- **Frequent false negatives:** The model sometimes returns an empty list even when clear regressions exist, indicating conservative behaviour that misses mandatory fixes.
249255

256+
### GPT-5.1 ('medium' thinking budget)
257+
258+
Final score: **44.9**
259+
260+
Strengths:
261+
262+
- **High precision & guideline compliance:** When the model does emit suggestions they are almost always technically sound, respect the "new-lines-only / ≤3 suggestions / no-imports" rules, and are formatted correctly. It rarely introduces harmful changes and often provides clear, runnable patches.
263+
- **Ability to spot subtle or unique defects:** In several cases the model caught a critical issue that most or all baselines missed, showing good deep-code reasoning when it does engage.
264+
- **Good judgment on noise-free diffs:** On purely data or documentation changes the model frequently (and correctly) returns an empty list, avoiding false-positive "nit" feedback.
265+
266+
Weaknesses:
267+
268+
- **Very low recall / over-conservatism:** In a large fraction of examples it outputs an empty suggestion list while clear critical bugs exist (well over 50 % of cases), making it inferior to almost every baseline answer that offered any fix.
269+
- **Narrow coverage when it speaks:** Even when it flags one bug, it often stops there and ignores other equally critical problems present in the same diff, leaving reviewers with partial insight.
270+
- **Occasional misdiagnosis or harmful fix:** A minority of suggestions are wrong or counter-productive, showing that precision, while good, is not perfect.
271+
250272
### Claude-sonnet-4.5 (4096 thinking tokens)
251273

252274
Final score: **44.2**
@@ -298,43 +320,6 @@ Weaknesses:
298320
- **Guideline slips: In several examples it edits unchanged lines, adds forbidden imports/version bumps, mis-labels severities, or supplies non-critical stylistic advice.
299321
- **Inconsistent diligence: Roughly a quarter of the cases return an empty list despite real problems, while others duplicate existing PR changes, indicating weak diff comprehension.
300322

301-
### Claude-4 Sonnet (4096 thinking tokens)
302-
303-
Final score: **39.7**
304-
305-
Strengths:
306-
307-
- **High guideline & format compliance:** Almost always returns valid YAML, keeps ≤ 3 suggestions, avoids forbidden import/boiler-plate changes and provides clear before/after snippets.
308-
- **Good pinpoint accuracy on single issues:** Frequently spots at least one real critical bug and proposes a concise, technically correct fix that compiles/runs.
309-
- **Clarity & brevity of patches:** Explanations are short, actionable, and focused on changed lines, making the advice easy for reviewers to apply.
310-
311-
Weaknesses:
312-
313-
- **Low coverage / recall:** Regularly surfaces only one minor issue (or none) while missing other, often more severe, problems caught by peer models.
314-
- **High "empty-list" rate:** In many diffs the model returns no suggestions even when clear critical bugs exist, offering zero reviewer value.
315-
- **Occasional incorrect or harmful fixes:** A non-trivial number of suggestions are speculative, contradict code intent, or would break compilation/runtime; sometimes duplicates or contradicts itself.
316-
- **Inconsistent severity labelling & duplication:** Repeats the same point in multiple slots, marks cosmetic edits as "critical", or leaves `improved_code` identical to original.
317-
318-
319-
### Claude-4 Sonnet
320-
321-
Final score: **39.0**
322-
323-
Strengths:
324-
325-
- **Consistently well-formatted & rule-compliant output:** Almost every answer follows the required YAML schema, keeps within the 3-suggestion limit, and returns an empty list when no issues are found, showing good instruction following.
326-
327-
- **Actionable, code-level patches:** When it does spot a defect the model usually supplies clear, minimal diffs or replacement snippets that compile / run, making the fix easy to apply.
328-
329-
- **Decent hit-rate on “obvious” bugs:** The model reliably catches the most blatant syntax errors, null-checks, enum / cast problems, and other first-order issues, so it often ties or slightly beats weaker baseline replies.
330-
331-
Weaknesses:
332-
333-
- **Shallow coverage:** It frequently stops after one easy bug and overlooks additional, equally-critical problems that stronger reviewers find, leaving significant risks unaddressed.
334-
335-
- **False positives & harmful fixes:** In a noticeable minority of cases it misdiagnoses code, suggests changes that break compilation or behaviour, or flags non-issues, sometimes making its output worse than doing nothing.
336-
337-
- **Drifts into non-critical or out-of-scope advice:** The model regularly proposes style tweaks, documentation edits, or changes to unchanged lines, violating the "critical new-code only" requirement.
338323

339324
### OpenAI codex-mini
340325

@@ -403,23 +388,6 @@ Weaknesses:
403388
- **Limited breadth:** Even when it finds a real defect it rarely reports additional related problems that peers catch, leading to partial reviews.
404389
- **Occasional guideline slips:** A few replies modify unchanged lines, suggest new imports, or duplicate suggestions, showing imperfect compliance with instructions.
405390

406-
### GPT-4.1
407-
408-
Final score: **26.5**
409-
410-
Strengths:
411-
412-
- **Consistent format & guideline obedience:** Output is almost always valid YAML, within the 3-suggestion limit, and rarely touches lines not prefixed with "+".
413-
- **Low false-positive rate:** When no real defect exists, the model correctly returns an empty list instead of inventing speculative fixes, avoiding the "noise" many baseline answers add.
414-
- **Clear, concise patches when it does act:** In the minority of cases where it detects a bug, the fix is usually correct, minimal, and easy to apply.
415-
416-
Weaknesses:
417-
418-
- **Very low recall / coverage:** In a large majority of examples it outputs an empty list or only 1 trivial suggestion while obvious critical issues remain unfixed; it systematically misses circular bugs, null-checks, schema errors, etc.
419-
- **Shallow analysis:** Even when it finds one problem it seldom looks deeper, so more severe or additional bugs in the same diff are left unaddressed.
420-
- **Occasional technical inaccuracies:** A noticeable subset of suggestions are wrong (mis-ordered assertions, harmful Bash `set` change, false dangling-reference claims) or carry metadata errors (mis-labeling files as "python").
421-
- **Repetitive / derivative fixes:** Many outputs duplicate earlier simplistic ideas (e.g., single null-check) without new insight, showing limited reasoning breadth.
422-
423391
## Appendix - Example Results
424392

425393
Some examples of benchmarked PRs and their results:

0 commit comments

Comments
 (0)