This benchmark evaluates 6 leading programming LLMs on real-world tasks (frontend development, game design, etc.) using 胜算云「AI群聊」. Tested models:
Qwen3-Coder-Plus, Kimi K2, GLM-4.5, Claude Sonnet 4, Gemini 2.5 Pro, OpenAI o4-mini-high.
- High Completion Rates: All models (except GLM-4.5 in Snake Game logic) delivered functional code for basic tasks.
- Design Superiority: Chinese models (Qwen, GLM, Kimi) outperformed in UI/UX tasks (e.g., weather cards, restaurant homepages) with better visuals and dynamic interactions (e.g., countdowns, quote generators).
- Basic Playability Achieved: All models implemented core mechanics for Snake and Gomoku (Five-in-a-Row).
- Weak AI Strategy: Opponent logic (e.g., Gomoku) was often simplistic.
- Interaction Highlights:
- Kimi/Claude: Added pause functionality.
- Qwen: Offered undo moves for better usability.
| Use Case | Top Picks |
|---|---|
| All-Rounder | Gemini 2.5 Pro, Qwen3-Coder-Plus |
| UI/UX Focus | Qwen3-Coder-Plus, GLM-4.5 |
| Game Dev | Claude Sonnet 4, Kimi K2 |
- Task Diversity: Covers frontend design, game logic, and interactive elements.
- Actionable Insights: Direct model recommendations for different needs.
- Transparent Platform: Tests run via 胜算云「AI群聊」.