Add benchmark for Gemini 2.5 Pro Preview as architect model and Claude Sonnet 3.7 (32K thinking token) as editor model #3800

tuannh99 · 2025-04-14T06:31:16Z

I've been using Gemini 2.5 Pro Preview as my Architect (Plan) model paired with Claude 3.7 Sonnet (with 32K thinking tokens) as Editor (Act) mode for a while now, and this combination has been working exceptionally well for me. I was curious how this setup would compare to other configurations on the Aider leaderboard, so I decided to run the benchmarks locally.

Result: 75.1% pass rate in round 2 with 100% well-formed responses. This validates my personal experience with this combination.

Burned lots of token and cash too (due to iterations until I figured out the right config for thinking_tokens..). That's precisely why I'm so grateful for this leaderboard. I often look it up from time to time. It saves the community from having to individually experiment with expensive configurations.

Appreciate y'all. Keep shipping.

…e Sonnet 3.7 (32K thinking token) as editor model

…editor-32k-thinking

CLAassistant · 2025-04-15T01:06:59Z

All committers have signed the CLA.

…editor-32k-thinking

tuannh99 · 2025-04-17T02:50:35Z

Hi @paul-gauthier 👋 , wondering if you had a chance to look at this one.
I know you have been busy with all the stuff and releases on top of releases from providers..

Appreciate you.

will-rest · 2025-04-17T12:40:35Z

Hey @tuannh99 , thanks for opening this PR. Quick question: Have you also tried Sonnet 3.7 (thinking) be the main model (architect) and have gemini be the code editor here in this case since Sonnet 3.7 thinking model's reasoning capabilities would be great for this scenario here.

Also your combo would be placed 2nd place with 75% (price $30). My question here would be - would users flock pay $30 for a 3% improvement in coding improvement over Gemini 2.5 pro (72% at $6) in diff-fenced mode. I'm also a big fan of the aider LLM leaderboard and love exploring different combos.

This article shows O1 as the architect and deepseek as the code editor has a pass rate of 85%! I wonder what the pass rate would be for a model like o3 (high) as the architect and Gemini2.5pro as the code editing model. Would be interesting to see the results here.

tuannh99 added 2 commits April 14, 2025 13:16

Add benchmark for Gemini 2.5 Pro Preview as architect model and Claud…

bd1be94

…e Sonnet 3.7 (32K thinking token) as editor model

Merge branch 'main' into gemini-2.5-pro-preview-architect-sonnet-3.7-…

90cead0

…editor-32k-thinking

Merge branch 'main' into gemini-2.5-pro-preview-architect-sonnet-3.7-…

1482154

…editor-32k-thinking

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add benchmark for Gemini 2.5 Pro Preview as architect model and Claude Sonnet 3.7 (32K thinking token) as editor model #3800

Add benchmark for Gemini 2.5 Pro Preview as architect model and Claude Sonnet 3.7 (32K thinking token) as editor model #3800

Uh oh!

tuannh99 commented Apr 14, 2025

Uh oh!

CLAassistant commented Apr 15, 2025 •

edited

Loading

Uh oh!

tuannh99 commented Apr 17, 2025

Uh oh!

will-rest commented Apr 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add benchmark for Gemini 2.5 Pro Preview as architect model and Claude Sonnet 3.7 (32K thinking token) as editor model #3800

Are you sure you want to change the base?

Add benchmark for Gemini 2.5 Pro Preview as architect model and Claude Sonnet 3.7 (32K thinking token) as editor model #3800

Uh oh!

Conversation

tuannh99 commented Apr 14, 2025

Uh oh!

CLAassistant commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tuannh99 commented Apr 17, 2025

Uh oh!

will-rest commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Apr 15, 2025 •

edited

Loading

will-rest commented Apr 17, 2025 •

edited

Loading