- Table of Contents
- Overview
- New features in Beta v0.7
- Project Structure
- Installation
- Model Data Setup
- Usage
- Results Data Format
- Testing Strategy & Quality Assurance
- License
This project provides tools for scoring and comparing large language models based on the following criteria:
Originally developed within the AIenhancedWork repository to evaluate models in the LLMs section, it has now been migrated to this dedicated project for improved organization, scalability, and focus.
- Entity benchmarks (30 points max)
- Dev benchmarks (30 points max)
- Community score (20 points max)
- Technical specifications (20 points max)
The final score is calculated out of 100 points (if you want to have a detailed breakdown of the scoring framework, please refer to the scoring_framework_development_notes.md file).
Please note that this is a beta version and the scoring system is subject to change.
To help us refine and improve LLMScoreEngine during this beta phase, we actively encourage user feedback, bug reports, and contributions to help us refine and improve LLMScoreEngine. Please feel free to open an issue or contribute to the project. Make sure to respect the Code of Conduct.
- Automated Benchmark Pipeline: A powerful new tool (
fill-benchmark-pipeline) to automatically populate model benchmarks from external APIs. - Improved Reporting: Updated HTML graph generation with enhanced visualization capabilities, better data handling, and improved interactivity.
- Scoring Engine Update: Enhanced technical score calculation to include model input/output pricing.
- Added command-line option : Enhanced flexibility with new CLI argument for custom configurations, allowing more granular control.
- Quality Assurance: Introduced a comprehensive testing strategy and CI pipeline to enforce performance budgets and prevent regressions.
- Optimized Architecture: Streamlined dependencies and updated documentation for a better developer experience.
- Interactive Shell Overhaul (Preview): A completely redesigned
llmscore shellis in active development, featuring a multi-pane layout and smart dock.
LLMScoreEngine/
βββ config/ # Configuration files
β βββ scoring_config.py # Scoring parameters and thresholds
βββ model_scoring/ # Main package
β βββ core/ # Core functionality (exceptions, types, constants)
β βββ data/ # Data handling (loaders, validators)
β β βββ loaders.py
β β βββ validators.py
β βββ scoring/ # Scoring logic
β β βββ hf_score.py
β β βββ models_scoring.py
β βββ utils/ # Utility functions
β β βββ config_loader.py
β β βββ csv_reporter.py
β β βββ graph_reporter.py
β β βββ logging.py
β βββ __init__.py
β βββ run_scoring.py # Script for running scoring programmatically
βββ tools/ # Additional tools and utilities
β βββ fill-benchmark-pipeline/ # Automated pipeline to fill model benchmark JSONs
β βββ llm_benchmark_pipeline.py
β βββ README.md
β βββ config_example.yaml
β βββ requirements.txt
βββ Models/ # Model data directory (Create this manually)
βββ Results/ # Results directory (Created automatically)
βββ tests/ # Unit and integration tests
β βββ config/
β β βββ test_scoring_config.py
β βββ data/
β β βββ test_validators.py
β βββ scoring/
β β βββ test_hf_score.py
β β βββ test_models_scoring.py
β βββ utils/
β β βββ test_config_loader.py
β β βββ test_csv_reporter.py
β β βββ test_graph_reporter.py
β βββ __init__.py
β βββ test_run_scoring.py
βββ LICENSE # Project license file
βββ README.md # This file
βββ pyproject.toml # Project configuration (for build system, linters, etc.)
βββ requirements.txt # Project dependencies
βββ score_models.py # Main command-line scoring script
Prerequisites:
- Python >=3.11 installed
- uv installed (recommended for dependency management)
Step 1: Clone the repository:
git clone https://github.com/LSeu-Open/LLMScoreEngine.gitStep 2 (recommended): Run the automated setup script
-
Unix/macOS:
cd LLMScoreEngine chmod +x setup.sh # first time only ./setup.sh
-
Windows (PowerShell or Command Prompt):
cd LLMScoreEngine setup_windows.bat
These scripts will verify Python/uv, create .venv, install requirements.txt, and create the Models/ and filled_models/ folders. After they finish, activate the environment with:
- Unix/macOS:
source .venv/bin/activate - Windows:
call .venv\Scripts\activate.bat
Manual setup (alternative):
-
Create and activate a virtual environment using uv:
uv venv # On Windows: .venv\Scripts\activate # On Unix or macOS: source .venv/bin/activate
-
Install dependencies:
- Standard usage:
uv pip install -e . - Development/testing:
uv pip install -e ".[dev]" - Or with pip:
pip install -r requirements.txt pip install -e ".[dev]"
- Standard usage:
Step 1: Ensure the Models directory exists.
- If you used
setup.shorsetup_windows.bat, this directory is already created for you (along withfilled_models). - Otherwise, create it manually:
mkdir Models
Step 2: Add Model Data:
- Inside the
Modelsdirectory, create a JSON file for each model you want to score (e.g.,Deepseek-R1.json). - The filename (without the
.jsonextension) should precisely match the model identifier you plan to use. - Avoid any blank spaces in the model name if you want to score it using the command line.
- Populate each JSON file according to the Models Data Format.
Models data should be stored as JSON files in the Models directory, with the following structure:
{
"entity_benchmarks": {
"artificial_analysis": null,
"OpenCompass": null,
"LLM Explorer": null,
"Livebench": null,
"open_llm": null,
"UGI Leaderboard": null,
"big_code_bench": null,
"EvalPlus Leaderboard": null,
"Dubesord_LLM": null,
"Open VLM": null
},
"dev_benchmarks": {
"MMLU": null,
"MMLU Pro": null,
"BigBenchHard": null,
"GPQA diamond": null,
"DROP": null,
"HellaSwag": null,
"Humanity's Last Exam": null,
"ARC-C": null,
"Wild Bench": null,
"MT-bench": null,
"IFEval": null,
"Arena-Hard": null,
"MATH": null,
"GSM-8K": null,
"AIME": null,
"HumanEval": null,
"MBPP": null,
"LiveCodeBench": null,
"Aider Polyglot": null,
"SWE-Bench": null,
"SciCode": null,
"MGSM": null,
"MMMLU": null,
"C-Eval or CMMLU": null,
"AraMMLu": null,
"LongBench": null,
"RULER 128K": null,
"RULER 32K": null,
"MTOB": null,
"BFCL": null,
"AgentBench": null,
"Gorilla Benchmark": null,
"ToolBench": null,
"MINT": null,
"MMMU": null,
"Mathvista": null,
"ChartQA": null,
"DocVQA": null,
"AI2D": null
},
"community_score": {
"lm_sys_arena_score": null,
"hf_score": null
},
"model_specs": {
"input_price": null,
"output_price": null,
"context_window": null,
"param_count": null,
"architecture": null
}
}Fill the null values with the actual data. While you don't need to fill all values, the following fields are mandatory:
model_specs(all subfields: price, context_window, param_count, architecture)community_score(at least one subfield: lm_sys_arena_score, hf_score)- At least one benchmark score in
entity_benchmarks - At least one benchmark score in
dev_benchmarks
All other fields are optional and can remain null if data is not available.
The recommended workflow is to run the Fill Benchmark Pipeline (tools/fill-benchmark-pipeline/llm_benchmark_pipeline.py launch). It queries the APIs listed below, hydrates entity_benchmarks, dev_benchmarks, community_score, and model_specs, and flags any remaining gaps for manual follow-up. Use the manual sources underneath only when the pipeline cannot retrieve a particular metric or when you are working completely offline.
entity_benchmarks:
- Artificial Analysis
- OpenCompass
- LLM Explorer
- Livebench
- Open LLM
- UGI Leaderboard
- Big Code Bench
- EvalPlus Leaderboard
- Dubesord_LLM
- Open VLM
-
dev_benchmarks: The pipeline pulls from provider metadata and Hugging Face, but you can also read the model's provider or Hugging Face page directly for any scores still missing. -
community_score: LMSYS ELO is sourced from the LM-SYS Arena Leaderboard. The Fill Benchmark Pipeline automatically fetches the Hugging Face community score plus telemetry wheneverhf_idis provided, so manual runs ofhf_score.pyare only necessary for offline workflows or custom experiments.
If you need to call the script manually, it lives in model_scoring/scoring and requires the huggingface_hub dependency:
pip install huggingface_hub
python model_scoring/scoring/hf_score.py deepseek-ai/DeepSeek-R1model_specs: The pipeline will attempt to infer pricing, context, parameters, and architecture via provider APIs; otherwise, collect the details from the model provider or Hugging Face pages (Artificial Analysis is another good source).
You can run the scoring script from your terminal.
Score specific models:
Provide the names of the models (without the .json extension) as arguments:
python score_models.py ModelName1 ModelName2Score all models:
Use the --all flag to score all models present in the Models directory.
python score_models.py --allTo read models from a different folder, pass --models-dir alongside --all (or any explicit model list):
python score_models.py --all --models-dir CustomModels/The tools/fill-benchmark-pipeline/ directory contains a powerful new automated pipeline for filling model benchmark JSON files with data from multiple API sources. This is the recommended tool for preparing model data.
Features:
- π Interactive CLI with guided prompts and automatic model detection
- π§ Multi-API Integration (Artificial Analysis, Hugging Face)
- β Input Validation using Pydantic models
- β‘ Rate Limiting & Retry Logic with exponential backoff
- π Rich Progress Reporting and coverage statistics
Installation:
pip install -e .[fill-benchmark-pipeline]Usage:
# Interactive mode (recommended)
python tools/fill-benchmark-pipeline/llm_benchmark_pipeline.py launch
# Process with config file
python tools/fill-benchmark-pipeline/llm_benchmark_pipeline.py --config config.yamlFor detailed usage instructions, see the pipeline README.
Discover everything you need to evaluate performance and efficiency at a glance:
- Interactive Leaderboard: Rank all your models with smart filters for quick comparisons.
- Insightful Visualizations: Explore key metrics including:
- Performance vs. Parameter Count
- Score Composition
- Cost Analysis
- Architecture Distribution
- Cost-Efficiency Leaderboard: Identify the best-performing models relative to their cost.
- Model Comparison Tool: Easily compare multiple models side by side.
All insights in one unified, actionable report β no more scattered data.
Create this comprehensive report from your models in just two commands:
- Run a silent, CSV-exported model evaluation :
python score_models.py --all --quiet --csv- Generate visualizations and the final report :
python score_models.py --graphYou can customize the scoring process with the following optional flags:
| Flag | Description | Example |
|---|---|---|
--all |
Score all models found in the Models/ directory (or the folder set via --models-dir). |
python score_models.py --all |
--models-dir PATH |
Directory containing model JSON files (defaults to ./Models). |
python score_models.py --all --models-dir CustomModels/ |
--quiet |
Suppress all informational output and only print the final scores in the console. Useful for scripting. | python score_models.py --all --quiet |
--config <path> |
Path to a custom Python configuration file to override the default scoring parameters. | python score_models.py ModelName --config my_config.py |
--csv |
Generate a CSV report from existing results. | python score_models.py --csv |
--graph |
Generate a graph report from existing csv report. | python score_models.py --graph |
--skip-hf-score |
(Fill pipeline) Skip Hugging Face telemetry calls if you need to conserve quota or run offline. | python tools/fill-benchmark-pipeline/llm_benchmark_pipeline.py --skip-hf-score ... |
You can also call the scoring functions directly from your Python code. Import the necessary functions from model_scoring.scoring.score_models and use them programmatically.
β οΈ Note: The Interactive Shell is currently in active development and is considered an experimental feature. APIs and UI elements may change frequently.
The LLMScore Shell (llmscore shell) has undergone a comprehensive UI/UX overhaul to support daily evaluation workflows with a modern, multi-pane console experience. This unified interface brings together execution, monitoring, and context management into a single ergonomic workspace.
- Multi-Pane Workspace: A responsive 3-column layout featuring:
- Command Stream: The central hub for execution and logs.
- Timeline Pane: A chronological feed of events, actions, and system state.
- Context Pane: Persistent view of active configurations, pinned metrics, and scratchpad notes.
- Smart Dock: A fixed, always-visible input area that anchors the bottom of the screen, hosting the prompt, autocomplete suggestions, and quick-action status chips.
- Command Palette 2.0: A keyboard-centric (
Ctrl+K) palette for quick navigation, command execution, and workspace management, featuring fuzzy search and grouped actions. - Structured Output Cards: Rich, interactive result cards that replace raw text logs, offering organized summaries, copy-to-clipboard buttons, and export options.
We are committed to a shell that works for everyone. Beta v0.7 introduces:
- Deterministic Focus Order: Cycle seamlessly between panes (Timeline β Context β Command β Dock) using
Ctrl+Pwith screen-reader announcements. - Reduced Motion Mode: A dedicated configuration flag to disable spinners and sliding animations for a static, distraction-free experience.
- Color-Blind Support: High-contrast palette tokens ensuring readability and distinction for status indicators and charts
| Shortcut | Action |
|---|---|
Ctrl+K |
Open Command Palette |
Ctrl+P |
Cycle Pane Focus (Timeline β Context β Stream β Dock) |
Ctrl+T |
Toggle Timeline Pane |
Ctrl+C |
Toggle Context Pane |
Ctrl+Shift+L |
Clear Screen |
For a complete guide on commands, layout configuration, and troubleshooting, see the Shell Documentation.
Results will be stored as JSON files in the Results directory, with the following structure (example for Deepseek-R1):
{
"model_name": "Deepseek-R1",
"scores": {
"entity_score": 18.84,
"dev_score": 23.06,
"community_score": 16.76,
"technical_score": 16.96,
"final_score": 75.63,
"avg_performance": 73.21
},
"entity_benchmarks": {
"artificial_analysis": 60.22,
"OpenCompass": 86.7,
"LLM Explorer": 59.0,
"Livebench": 72.49,
"open_llm": null,
"UGI Leaderboard": 55.65,
"big_code_bench": 35.1,
"EvalPlus Leaderboard": null,
"Dubesord_LLM": 70.5,
"Open VLM": null
},
"dev_benchmarks": {
"MMLU": 90.8,
"MMLU Pro": 84.0,
"GPQA diamond": 71.5,
"DROP": 92.2,
"IFEval": 83.3,
"Arena-Hard": 92.3,
"MATH": 97.3,
"AIME": 79.8,
"LiveCodeBench": 65.9,
"Aider Polyglot": 53.3,
"SWE-Bench": 49.2,
"C-Eval or CMMLU": 91.8
},
"community_score": {
"lm_sys_arena_score": 1389,
"hf_score": 8.79
},
"model_specs": {
"input_price": 0.55,
"output_price": 2.19,
"context_window": 128000,
"param_count": 685,
"architecture": "moe"
}
}We employ a multi-layer testing strategy to ensure stability, accuracy, and performance across the LLMScoreEngine.
- Unit & Contract: Validates business logic, schemas, and action handlers in isolation.
- Integration: Verifies interactions with the file system, session store, and external APIs.
- End-to-End (CLI): Validates full user flows, including
shell,run, andexecmodes. - UI Snapshots: Regression testing for the Shell UI using
pytest-regressionsto ensure layout stability and visual consistency. - Performance: Smoke tests for critical paths (
score.batch,results.leaderboard) and concurrency stress tests for automation.
Our GitHub Actions workflow (perf-accessibility.yml) enforces quality gates on every push:
- Static Analysis:
ruff,mypy. - Test Suite: Runs the full
pytestsuite, including legacy regressions and new shell UI tests. - Performance Gates: Checks for runtime regressions (>20%) and memory leaks.
- Accessibility: Verifies focus order and reduced-motion compliance.
# Run fast unit tests
pytest tests/scoring tests/data tests/config
# Run Shell UI regression tests
pytest tests/shell/test_layout.py tests/shell/test_output_cards.py --regression-fail-under=100
# Run full suite
pytestFor a deep dive into our testing methodology, please refer to dev_docs/TESTING.md.
This project is licensed under the MIT License - see the LICENSE.md file for details.
