Skip to content

xinrui-z/Coding-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Programming LLM Benchmark: Code Generation & Game Development

This benchmark evaluates 6 leading programming LLMs on real-world tasks (frontend development, game design, etc.) using 胜算云「AI群聊」. Tested models:
Qwen3-Coder-Plus, Kimi K2, GLM-4.5, Claude Sonnet 4, Gemini 2.5 Pro, OpenAI o4-mini-high.

Key Findings

🚀 Code Generation: Domestic Models Shine

  • High Completion Rates: All models (except GLM-4.5 in Snake Game logic) delivered functional code for basic tasks.
  • Design Superiority: Chinese models (Qwen, GLM, Kimi) outperformed in UI/UX tasks (e.g., weather cards, restaurant homepages) with better visuals and dynamic interactions (e.g., countdowns, quote generators).

🎮 Game Development: Functional but Limited Strategy

  • Basic Playability Achieved: All models implemented core mechanics for Snake and Gomoku (Five-in-a-Row).
  • Weak AI Strategy: Opponent logic (e.g., Gomoku) was often simplistic.
  • Interaction Highlights:
    • Kimi/Claude: Added pause functionality.
    • Qwen: Offered undo moves for better usability.

Model Recommendations

Use Case Top Picks
All-Rounder Gemini 2.5 Pro, Qwen3-Coder-Plus
UI/UX Focus Qwen3-Coder-Plus, GLM-4.5
Game Dev Claude Sonnet 4, Kimi K2

Key Features:

  • Task Diversity: Covers frontend design, game logic, and interactive elements.
  • Actionable Insights: Direct model recommendations for different needs.
  • Transparent Platform: Tests run via 胜算云「AI群聊」.

About

This repository presents a comprehensive evaluation of mainstream LLMs for programming tasks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published