Add benchmark for Gemini 2.5 Pro Preview as architect model and Claude Sonnet 3.7 (32K thinking token) as editor model #3800
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.

I've been using Gemini 2.5 Pro Preview as my Architect (Plan) model paired with Claude 3.7 Sonnet (with 32K thinking tokens) as Editor (Act) mode for a while now, and this combination has been working exceptionally well for me. I was curious how this setup would compare to other configurations on the Aider leaderboard, so I decided to run the benchmarks locally.
Result: 75.1% pass rate in round 2 with 100% well-formed responses. This validates my personal experience with this combination.
Burned lots of token and cash too (due to iterations until I figured out the right config for thinking_tokens..). That's precisely why I'm so grateful for this leaderboard. I often look it up from time to time. It saves the community from having to individually experiment with expensive configurations.
Appreciate y'all. Keep shipping.