Skip to content

Commit 3ce4780

Browse files
authored
Merge pull request #2111 from qodo-ai/of/doc-Gemini-3-pro-review-2025-11-18-ranking
docs: add Gemini-3-pro-review benchmark results
2 parents e661147 + edd9ef9 commit 3ce4780

File tree

1 file changed

+46
-0
lines changed

1 file changed

+46
-0
lines changed

docs/docs/pr_benchmark/index.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,12 +70,24 @@ A list of the models used for generating the baseline suggestions, and example r
7070
<td style="text-align:left;">'medium' (<a href="https://ai.google.dev/gemini-api/docs/openai">8000</a>)</td>
7171
<td style="text-align:center;"><b>57.7</b></td>
7272
</tr>
73+
<tr>
74+
<td style="text-align:left;">Gemini-3-pro-review</td>
75+
<td style="text-align:left;">2025-11-18</td>
76+
<td style="text-align:left;">high</td>
77+
<td style="text-align:center;"><b>57.3</b></td>
78+
</tr>
7379
<tr>
7480
<td style="text-align:left;">Gemini-2.5-pro</td>
7581
<td style="text-align:left;">2025-06-05</td>
7682
<td style="text-align:left;">4096</td>
7783
<td style="text-align:center;"><b>56.3</b></td>
7884
</tr>
85+
<tr>
86+
<td style="text-align:left;">Gemini-3-pro-review</td>
87+
<td style="text-align:left;">2025-11-18</td>
88+
<td style="text-align:left;">low</td>
89+
<td style="text-align:center;"><b>55.6</b></td>
90+
</tr>
7991
<tr>
8092
<td style="text-align:left;">Claude-haiku-4.5</td>
8193
<td style="text-align:left;">2025-10-01</td>
@@ -218,6 +230,23 @@ Weaknesses:
218230
- **False or harmful fixes:** Several answers introduce new compilation errors, propose out-of-scope changes, or violate explicit rules (e.g., adding imports, version bumps, touching untouched lines), reducing trustworthiness.
219231
- **Shallow coverage:** Even when it identifies one real issue it often stops there, missing additional critical problems found by stronger peers; breadth and depth are inconsistent.
220232

233+
### Gemini-3-pro-review (high thinking budget)
234+
235+
Final score: **57.3**
236+
237+
Strengths:
238+
239+
- **Good schema & format discipline:** Consistently returns well-formed YAML with correct fields and respects the 3-suggestion limit; rarely breaks the required output structure.
240+
- **Reasonable guideline awareness:** Often recognises when a diff contains only data / translations and properly emits an empty list, avoiding over-reporting.
241+
- **Clear, actionable patches when correct:** When it does find a bug it usually supplies minimal-diff, compilable code snippets with concise explanations, and occasionally surfaces issues no other model spotted.
242+
243+
Weaknesses:
244+
245+
- **Spot-coverage gaps on critical defects:** In a large share of cases it overlooks the principal regression the tests were written for, while fixating on minor style or performance nits.
246+
- **False or speculative fixes:** A noticeable number of answers invent non-existent problems or propose changes that would not compile or would re-introduce removed behaviour.
247+
- **Guideline violations creep in:** Sometimes touches unchanged lines, adds forbidden imports / labels, or supplies more than "critical" advice, showing imperfect rule adherence.
248+
- **High variance / inconsistency:** Quality swings from best-in-class to harmful within consecutive examples, indicating unstable defect-prioritisation and review depth.
249+
221250
### Gemini-2.5 Pro (4096 thinking tokens)
222251

223252
Final score: **56.3**
@@ -236,6 +265,23 @@ Weaknesses:
236265
- **False positives / speculative fixes:** In several cases it flags non-issues (style, performance, redundant code) or supplies debatable “improvements”, lowering precision and sometimes breaching the “critical bugs only” rule.
237266
- **Inconsistent error coverage:** For certain domains (build scripts, schema files, test code) it either returns an empty list when real regressions exist or proposes cosmetic edits, indicating gaps in specialised knowledge.
238267

268+
### Gemini-3-pro-review (low thinking budget)
269+
270+
Final score: **55.6**
271+
272+
Strengths:
273+
274+
- **Concise, well-structured patches:** Suggestions are usually expressed in short, self-contained YAML items with clear before/after code blocks and just enough rationale, making them easy for reviewers to apply.
275+
- **Good eye for crash-level defects:** When the model does spot a problem it often focuses on high-impact issues such as compile-time errors, NPEs, nil-pointer races, buffer overflows, etc., and supplies a minimal, correct fix.
276+
- **High guideline compliance (format & scope):** In most cases it respects the 1-3-item limit and the "new lines only" rule, avoids changing imports, and keeps snippets syntactically valid.
277+
278+
Weaknesses:
279+
280+
- **Coverage inconsistency:** Many answers miss other obvious or even more critical regressions spotted by peers; breadth fluctuates from excellent to empty, leaving reviewers with partial insight.
281+
- **False positives & speculative advice:** A noticeable share of suggestions target stylistic or non-critical tweaks, or even introduce wrong changes, betraying occasional mis-reading of the diff and hurting trust.
282+
- **Rule violations still occur:** There are repeated instances of touching unchanged code, recommending version bumps/imports, mis-labelling severities, or outputting malformed snippets—showing lapses in instruction adherence.
283+
- **Quality variance / empty outputs:** Some responses provide no suggestions despite real bugs, while others supply harmful fixes; this volatility lowers overall reliability.
284+
239285
### Claude-haiku-4.5 (4096 thinking tokens)
240286

241287
Final score: **48.8**

0 commit comments

Comments
 (0)