A command-line tool for comparing different LLM models hosted on the Hyperbolic API using MMLU benchmarks.
- Compare two models on MMLU benchmarks with customizable test parameters
- Measure comprehensive metrics:
- Speed: Time to first token, total latency, tokens per second
- Accuracy: MMLU score based on correct answers
- Quality: Consistency and text similarity scores
- Cost: Token usage costs and cost-performance ratio
- Test model consistency with efficient text similarity metrics
- Generate detailed side-by-side comparison reports
- Robust error handling with exponential backoff for API rate limits
pip install -e .python hypercompare.py "deepseek-ai/DeepSeek-V3-0324" "Qwen/QwQ-32B"-s, --subjects: Number of subjects to test (default: 2)-q, --questions: Questions per subject (default: 3)-p, --prompts: Number of prompts for consistency testing (default: 2)-r, --runs: Number of runs per prompt for consistency (default: 3)--rate-limit-delay: Delay between API calls to avoid rate limiting (default: 1.0)--max-retries: Maximum number of retries for API calls (default: 3)-v, --verbose: Enable verbose output
python hypercompare.py "deepseek-ai/DeepSeek-V3-0324" "Qwen/QwQ-32B" -s 1 -q 1 -p 1 -r 1 --rate-limit-delay 2.0- Python 3.10+
- Dependencies (install via pip):
- openai
- python-dotenv
- numpy
- difflib
- Clone this repository
- Create a
.envfile in the project root with your Hyperbolic API key:HYPERBOLIC_API_KEY=your_api_key_here - Install dependencies:
pip install -r requirements.txt
The tool provides a detailed comparison summary including:
- Speed metrics (time to first token, total latency, tokens/sec)
- Accuracy metrics (MMLU score, quality assessment)
- Cost analysis (token costs, cost-performance ratio)
Results are also saved to a JSON file for further analysis.