Skip to content

Commit 0b03d69

Browse files
committed
reasoning support, unit tests, bug fixes
1 parent 8b3f01e commit 0b03d69

File tree

12 files changed

+916
-55
lines changed

12 files changed

+916
-55
lines changed

CHANGELOG.md

Lines changed: 1 addition & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -9,21 +9,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99

1010
### Added
1111
- Comprehensive unit test suite with 31 tests covering all major components
12-
- Test configuration in `pytest.ini` with markers for unit and integration tests
13-
- Shared test fixtures in `conftest.py` for mocking external dependencies
14-
- Tests for configuration module (`test_config.py`)
15-
- Tests for AI/benchmarking module (`test_ai.py`)
16-
- Tests for display/metrics module (`test_display.py`)
17-
- Tests for CLI commands (`test_cli.py`)
18-
- Development dependencies for testing (pytest, pytest-asyncio, pytest-mock)
1912

2013
### Changed
21-
- Project structure now includes a dedicated `tests/` directory
22-
- All external API calls are properly mocked in tests for reliability
14+
- Renamed `--lim` to `--tokens`
2315

24-
### Developer Experience
25-
- Run tests with `uv run pytest` or `uv run pytest -v` for verbose output
26-
- Tests achieve high coverage of core functionality without requiring API keys
2716

2817
## [0.5.0] - 2025-01-27
2918

@@ -85,7 +74,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
8574
### Added
8675
- `--lim` parameter to control maximum tokens per response (default: 2000)
8776
- Full async/parallel execution for all benchmarks
88-
- Dynamic model testing with `test-models` command
8977

9078
### Changed
9179
- Tokens/second is now the primary metric instead of raw time

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,13 +63,13 @@ pip install tacho
6363
tacho gpt-4.1-nano gemini/gemini-2.0-flash
6464

6565
# Custom settings
66-
tacho gpt-4.1-nano gemini/gemini-2.0-flash --runs 3 --lim 1000
66+
tacho gpt-4.1-nano gemini/gemini-2.0-flash --runs 3 --tokens 1000
6767
```
6868

6969
### Command options
7070

7171
- `--runs, -r`: Number of inference runs per model (default: 5)
72-
- `--lim, -l`: Maximum tokens to generate per response (default: 500)
72+
- `--tokens, -t`: Maximum tokens to generate per response (default: 500)
7373
- `--prompt, -p`: Custom prompt for benchmarking
7474

7575
## Output

ROADMAP.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
## Features
2+
3+
- [ ] Nicely display different types of LiteLLM errors
4+
- [ ] Not authorized
5+
- [ ] Model not found
6+
- [ ] Provider not found (?)
7+
- [ ] Rate Limit
8+
- [ ] Others
9+
10+
11+
## Bugs

docs/index.html

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -357,16 +357,17 @@ <h1>tacho - LLM Speed Test</h1>
357357
<span class="terminal-dot"></span>
358358
<span class="terminal-dot"></span>
359359
</div>
360-
<div class="terminal-content"><span class="dim">$</span> tacho gpt-4.1-nano gemini/gemini-2.0-flash
361-
362-
<span class="success"></span> gemini/gemini-2.0-flash
363-
<span class="success"></span> gpt-4.1-nano
364-
┌─────────────────────────┬───────────┬───────────┬───────────┬──────────┐
365-
│ Model │ Avg tok/s │ Min tok/s │ Max tok/s │ Avg Time │
366-
├─────────────────────────┼───────────┼───────────┼───────────┼──────────┤
367-
│ gemini/gemini-2.0-flash │ 124.0 │ 110.5 │ 136.6 │ 4.0s │
368-
│ gpt-4.1-nano │ 116.9 │ 105.4 │ 129.5 │ 4.3s │
369-
└─────────────────────────┴───────────┴───────────┴───────────┴──────────┘</div>
360+
<div class="terminal-content"><span class="dim">$</span> tacho gpt-4.1 gemini/gemini-2.5-pro vertex_ai/claude-sonnet-4@20250514
361+
<span class="success"></span> gpt-4.1
362+
<span class="success"></span> vertex_ai/claude-sonnet-4@20250514
363+
<span class="success"></span> gemini/gemini-2.5-pro
364+
┌────────────────────────────────────┬───────────┬───────────┬───────────┬──────────┐
365+
│ Model │ Avg tok/s │ Min tok/s │ Max tok/s │ Avg Time │
366+
├────────────────────────────────────┼───────────┼───────────┼───────────┼──────────┤
367+
│ gemini/gemini-2.5-pro │ 84.6 │ 50.3 │ 133.8 │ 13.44s │
368+
│ gpt-4.1 │ 49.7 │ 35.1 │ 66.6 │ 10.75s │
369+
│ vertex_ai/claude-sonnet-4@20250514 │ 48.7 │ 47.3 │ 50.9 │ 10.27s │
370+
└────────────────────────────────────┴───────────┴───────────┴───────────┴──────────┘</div>
370371
</div>
371372
</section>
372373

@@ -409,8 +410,8 @@ <h3>🔒 100% Private</h3>
409410
<p>No telemetry or data sent to our servers</p>
410411
</div>
411412
<div class="feature">
412-
<h3>🏓 Ping Models</h3>
413-
<p>Quickly verify API keys and model access</p>
413+
<h3>🧠 Reasoning Support</h3>
414+
<p>Accurately takes into account thinking tokens</p>
414415
</div>
415416
</div>
416417
</section>
@@ -425,7 +426,7 @@ <h2>Usage</h2>
425426

426427
<p>Run benchmarks with custom settings:</p>
427428
<div class="code-block">
428-
<code>tacho gpt-4.1-nano claude-3.5-haiku --runs 3 --lim 1000</code>
429+
<code>tacho gpt-4.1-nano claude-3.5-haiku --runs 3 --tokens 1000</code>
429430
</div>
430431
</section>
431432
</main>

pyproject.toml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,14 @@ classifiers = [
2121
"Topic :: Scientific/Engineering :: Artificial Intelligence",
2222
]
2323
dependencies = [
24-
"litellm[bedrock,vertex_ai]>=1.73.1",
24+
"boto3>=1.38.46",
25+
"google-auth>=2.40.3",
26+
"google>=3.0.0",
27+
"litellm>=1.73.1",
2528
"python-dotenv>=1.1.1",
2629
"rich>=14.0.0",
2730
"typer>=0.16.0",
31+
"google-cloud-aiplatform>=1.100.0",
2832
]
2933

3034
[project.urls]

tacho/ai.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,5 +26,11 @@ async def bench_model(model: str, max_tokens: int) -> tuple[float, int]:
2626
start_time = time.time()
2727
response = await llm(model, BENCHMARK_PROMPT, max_tokens)
2828
duration = time.time() - start_time
29-
tokens = response.usage.completion_tokens if response.usage else 0
29+
30+
31+
tokens = response.usage.completion_tokens
32+
if hasattr(response.usage, 'completion_tokens_details') and response.usage.completion_tokens_details:
33+
if hasattr(response.usage.completion_tokens_details, 'reasoning_tokens'):
34+
tokens += response.usage.completion_tokens_details.reasoning_tokens
35+
3036
return duration, tokens

tacho/cli.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,14 +26,14 @@ def cli_main(
2626
ctx: typer.Context,
2727
models: list[str] | None = typer.Argument(None),
2828
runs: int = typer.Option(5, "--runs", "-r"),
29-
lim: int = typer.Option(500, "--lim", "-l"),
29+
tokens: int = typer.Option(500, "--tokens", "-t"),
3030
version: bool | None = typer.Option(
3131
None, "--version", callback=version_callback, is_eager=True
3232
),
3333
):
3434
"""Default command when models are provided directly"""
3535
if ctx.invoked_subcommand is None and models:
36-
bench(models, runs, lim)
36+
bench(models, runs, tokens)
3737

3838

3939
@app.command()
@@ -43,16 +43,16 @@ def bench(
4343
help="List of models to benchmark using LiteLLM names",
4444
),
4545
runs: int = typer.Option(5, "--runs", "-r", help="Number of runs per model"),
46-
lim: int = typer.Option(
47-
500, "--lim", "-l", help="Maximum tokens to generate per response"
46+
tokens: int = typer.Option(
47+
500, "--tokens", "-t", help="Maximum tokens to generate per response"
4848
),
4949
):
5050
"""Benchmark inference speed of different LLM models"""
5151
res = asyncio.run(run_pings(models))
5252
valid_models = [models[i] for i in range(len(models)) if res[i]]
5353
if not valid_models:
5454
raise typer.Exit(1)
55-
res = asyncio.run(run_benchmarks(valid_models, runs, lim))
55+
res = asyncio.run(run_benchmarks(valid_models, runs, tokens))
5656
display_results(valid_models, runs, res)
5757

5858

tacho/config.py

Lines changed: 1 addition & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -24,30 +24,19 @@ def ensure_env_file():
2424
if not env_path.exists():
2525
template = """# Tacho Configuration File
2626
# Add your API keys here to avoid exporting them each time
27-
#
28-
# === BASIC PROVIDERS ===
29-
# OpenAI
3027
# OPENAI_API_KEY=sk-...
3128
#
32-
# Anthropic
3329
# ANTHROPIC_API_KEY=sk-ant-...
3430
#
35-
# Google Gemini
3631
# GEMINI_API_KEY=...
3732
#
38-
# Groq
39-
# GROQ_API_KEY=gsk_...
40-
#
41-
# === CLOUD PROVIDERS ===
42-
# AWS Bedrock (requires all three)
4333
# AWS_ACCESS_KEY_ID=...
4434
# AWS_SECRET_ACCESS_KEY=...
4535
# AWS_REGION_NAME=us-east-1
4636
#
47-
# Google Vertex AI (requires both)
4837
# GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
4938
# VERTEXAI_PROJECT=your-gcp-project-id
50-
# VERTEXAI_LOCATION=us-central1 # optional, defaults to us-central1
39+
# VERTEXAI_LOCATION=us-east5 # optional, defaults to us-east5
5140
#
5241
# For more providers, see: https://docs.litellm.ai/docs/providers
5342

tests/test_ai.py

Lines changed: 58 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -78,9 +78,13 @@ async def test_bench_model_success(self, mock_litellm, mocker):
7878
mock_time = mocker.patch('tacho.ai.time.time')
7979
mock_time.side_effect = [100.0, 102.5] # 2.5 second duration
8080

81-
# Configure mock response with usage data
81+
# Configure mock response with usage data (no reasoning tokens)
8282
mock_response = MagicMock()
83-
mock_response.usage.completion_tokens = 150
83+
mock_usage = MagicMock()
84+
mock_usage.completion_tokens = 150
85+
# Explicitly configure to not have completion_tokens_details
86+
mock_usage.completion_tokens_details = None
87+
mock_response.usage = mock_usage
8488
mock_litellm.return_value = mock_response
8589

8690
duration, tokens = await bench_model("gpt-4", 500)
@@ -120,4 +124,55 @@ async def test_bench_model_exception_handling(self, mock_litellm):
120124
mock_litellm.side_effect = Exception("Network error")
121125

122126
with pytest.raises(Exception, match="Network error"):
123-
await bench_model("gpt-4", 500)
127+
await bench_model("gpt-4", 500)
128+
129+
@pytest.mark.asyncio
130+
async def test_bench_model_with_reasoning_tokens(self, mock_litellm, mocker):
131+
"""Test benchmark with reasoning models that have completion_tokens_details"""
132+
# Mock time
133+
mock_time = mocker.patch('tacho.ai.time.time')
134+
mock_time.side_effect = [100.0, 103.0] # 3 second duration
135+
136+
# Configure mock response with reasoning tokens
137+
mock_response = MagicMock()
138+
mock_response.usage.completion_tokens = 50 # Regular completion tokens
139+
140+
# Mock completion_tokens_details with reasoning_tokens
141+
mock_details = MagicMock()
142+
mock_details.reasoning_tokens = 200 # Reasoning tokens
143+
mock_response.usage.completion_tokens_details = mock_details
144+
145+
mock_litellm.return_value = mock_response
146+
147+
duration, tokens = await bench_model("o1-mini", 500)
148+
149+
# Verify results - should include both completion and reasoning tokens
150+
assert duration == 3.0
151+
assert tokens == 250 # 50 completion + 200 reasoning
152+
153+
# Verify LLM was called correctly
154+
mock_litellm.assert_called_once_with(
155+
"o1-mini",
156+
[{"role": "user", "content": BENCHMARK_PROMPT}],
157+
max_tokens=500
158+
)
159+
160+
@pytest.mark.asyncio
161+
async def test_bench_model_with_empty_completion_details(self, mock_litellm, mocker):
162+
"""Test benchmark when completion_tokens_details exists but has no reasoning_tokens"""
163+
# Mock time
164+
mock_time = mocker.patch('tacho.ai.time.time')
165+
mock_time.side_effect = [100.0, 102.0]
166+
167+
# Configure mock response with completion_tokens_details but no reasoning_tokens
168+
mock_response = MagicMock()
169+
mock_response.usage.completion_tokens = 100
170+
mock_response.usage.completion_tokens_details = MagicMock(spec=[]) # No reasoning_tokens attribute
171+
172+
mock_litellm.return_value = mock_response
173+
174+
duration, tokens = await bench_model("gpt-4", 500)
175+
176+
# Should only count regular completion tokens
177+
assert duration == 2.0
178+
assert tokens == 100

tests/test_cli.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ def test_bench_command_success(self, runner, mocker):
4545
mock_display = mocker.patch('tacho.cli.display_results')
4646

4747
# Call bench directly
48-
bench(["gpt-4", "claude-3"], runs=5, lim=500)
48+
bench(["gpt-4", "claude-3"], runs=5, tokens=500)
4949

5050
# Verify display was called with correct arguments
5151
mock_display.assert_called_once()
@@ -76,7 +76,7 @@ def test_bench_command_with_options(self, runner, mocker):
7676
mock_display = mocker.patch('tacho.cli.display_results')
7777

7878
# Call bench directly with options
79-
bench(["gpt-4"], runs=10, lim=1000)
79+
bench(["gpt-4"], runs=10, tokens=1000)
8080

8181
# Verify display was called
8282
mock_display.assert_called_once()
@@ -142,7 +142,7 @@ def test_bench_function_partial_valid_models(self, mocker):
142142
mock_display = mocker.patch('tacho.cli.display_results')
143143

144144
# Call bench with mixed valid/invalid models
145-
bench(["gpt-4", "invalid", "claude-3"], runs=3, lim=250)
145+
bench(["gpt-4", "invalid", "claude-3"], runs=3, tokens=250)
146146

147147
# Verify only valid models were benchmarked
148148
mock_display.assert_called_once()

0 commit comments

Comments
 (0)