Skip to content

Commit 23f7026

Browse files
committed
Add autocompare tracing, UI model add, and CI tests
Implements OpenAI/Anthropic autocompare instrumentation, trace-first logging and dashboard views, monthly model presets, and secondary metrics callbacks. Adds UI + model replay flow and a minimal CI workflow with pytest + Playwright coverage. kumquat Made-with: Cursor
1 parent 4d60c0b commit 23f7026

25 files changed

Lines changed: 1327 additions & 565 deletions

.github/workflows/ci.yml

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
name: ci
2+
3+
on:
4+
pull_request:
5+
push:
6+
branches:
7+
- main
8+
9+
jobs:
10+
test:
11+
runs-on: ubuntu-22.04
12+
steps:
13+
- name: Checkout repo
14+
uses: actions/checkout@v4
15+
16+
- name: Set up Python
17+
uses: actions/setup-python@v5
18+
with:
19+
python-version: "3.10"
20+
21+
- name: Install dependencies
22+
run: |
23+
python -m pip install --upgrade pip
24+
python -m pip install -e ".[dev]"
25+
26+
- name: Lint
27+
run: |
28+
ruff check .
29+
ruff format --check .
30+
31+
- name: Install Playwright browsers
32+
run: python -m playwright install --with-deps chromium
33+
34+
- name: Unit and UI tests
35+
run: pytest -q

README.md

Lines changed: 81 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ pip install smollest[all] # both
2020

2121
## Usage
2222

23-
Install `openai` from `smollest` and then write your code as normal!
23+
Install `openai` from `smollest` and write your code as normal:
2424

2525
```python
2626
from smollest import openai
@@ -66,6 +66,33 @@ result = client.messages.create(
6666
)
6767
```
6868

69+
Or instrument existing SDK usage with `autocompare()`:
70+
71+
```python
72+
import openai
73+
from smollest.openai import autocompare
74+
75+
autocompare(project="my-project")
76+
client = openai.OpenAI()
77+
client.chat.completions.create(
78+
model="gpt-4.1-mini",
79+
messages=[{"role": "user", "content": "Return JSON sentiment"}],
80+
)
81+
```
82+
83+
```python
84+
import anthropic
85+
from smollest.anthropic import autocompare
86+
87+
autocompare(project="my-project")
88+
client = anthropic.Anthropic()
89+
client.messages.create(
90+
model="claude-sonnet-4-20250514",
91+
max_tokens=120,
92+
messages=[{"role": "user", "content": "Return JSON sentiment"}],
93+
)
94+
```
95+
6996
## How it works
7097

7198
1. Your API call goes to the baseline model as normal
@@ -75,6 +102,31 @@ result = client.messages.create(
75102

76103
Remote candidates run in parallel; local candidates run sequentially.
77104

105+
## Model presets
106+
107+
Default comparison candidates come from a date-indexed preset list and resolve to the latest month automatically. You can inspect or pin a month:
108+
109+
```python
110+
from smollest import get_default_candidates
111+
112+
latest = get_default_candidates()
113+
march = get_default_candidates("2026-03")
114+
```
115+
116+
## Secondary metrics
117+
118+
You can register callbacks to compute arbitrary metrics for baseline and candidate runs:
119+
120+
```python
121+
from smollest import register_secondary_metric
122+
123+
def co2_metric(payload: dict) -> dict[str, float]:
124+
tokens = payload.get("input_tokens", 0) + payload.get("output_tokens", 0)
125+
return {"co2_g": tokens * 0.00009}
126+
127+
register_secondary_metric(co2_metric)
128+
```
129+
78130
## Dashboard
79131

80132
```bash
@@ -83,12 +135,36 @@ smollest show
83135

84136
Opens a web dashboard with projects in the sidebar, a results table with truncation for long outputs, latency and cost per model, and aggregate match rates. The image above shows the UI, which you can reproduce by cloning this repo and running: `python examples/demo_dashboard.py`
85137

138+
The dashboard now includes:
86139

87-
## Roadmap
140+
- Trace view with input/output inspection
141+
- Model size badges
142+
- Secondary metrics display
143+
- A `+` column action to add another model and replay saved traces against it
88144

89-
- Allow adding additional models directly through the UI
90-
- Add LLM as judge to score outputs that are not structured
91-
- Let developers eaisly fine tune models on outputs
145+
## Examples
146+
147+
Two runnable example groups are provided:
148+
149+
- `examples/mock/` for quick local seeding to inspect UI states
150+
- `seed_basic.py`
151+
- `seed_traces.py`
152+
- `seed_secondary_metrics.py`
153+
- `examples/real/` for real SDK usage patterns (requires API keys)
154+
- `openai_wrapper_basic.py`
155+
- `openai_autocompare_chat.py`
156+
- `openai_autocompare_responses.py`
157+
- `anthropic_wrapper_basic.py`
158+
- `anthropic_autocompare_messages.py`
159+
- `openai_secondary_metrics.py`
160+
161+
162+
Run one:
163+
164+
```bash
165+
python examples/mock/seed_traces.py
166+
smollest show
167+
```
92168

93169
## License
94170

examples/demo_dashboard.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@
88
import json
99
import random
1010
from datetime import datetime, timedelta, timezone
11-
from pathlib import Path
1211

1312
from smollest.results import DATA_DIR
1413
from smollest.web import show

examples/mock/seed_basic.py

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
from __future__ import annotations
2+
3+
import uuid
4+
5+
from smollest.compare import ComparisonResult
6+
from smollest.results import log_result
7+
8+
9+
def main() -> None:
10+
project = "mock-basic"
11+
baseline_messages = [
12+
{"role": "system", "content": "Return compact JSON."},
13+
{"role": "user", "content": "Classify: I loved the service."},
14+
]
15+
for i in range(3):
16+
trace_id = str(uuid.uuid4())
17+
baseline_content = '{"label":"positive","confidence":0.9}'
18+
for candidate_name, score in [
19+
("Qwen/Qwen3.5-3B-Instruct", 1.0),
20+
("meta-llama/Llama-3.1-8B-Instruct", 0.5),
21+
]:
22+
comparison = ComparisonResult(
23+
candidate=candidate_name,
24+
score=score,
25+
total_fields=2,
26+
matching_fields=["label"] if score < 1.0 else ["label", "confidence"],
27+
mismatched_fields=[]
28+
if score == 1.0
29+
else [{"field": "confidence", "baseline": 0.9, "candidate": 0.7}],
30+
)
31+
log_result(
32+
project=project,
33+
provider="openai",
34+
baseline_model="gpt-4.1-mini",
35+
baseline_model_size="small",
36+
baseline_messages=baseline_messages,
37+
baseline_content=baseline_content,
38+
baseline_latency_ms=180 + i * 11,
39+
baseline_input_tokens=34,
40+
baseline_output_tokens=12,
41+
baseline_cost=0.00004,
42+
baseline_secondary_metrics={},
43+
comparison=comparison,
44+
candidate_content='{"label":"positive","confidence":0.7}',
45+
candidate_model_size="small",
46+
candidate_latency_ms=120 + i * 8,
47+
candidate_input_tokens=34,
48+
candidate_output_tokens=12,
49+
candidate_cost=0.0,
50+
candidate_secondary_metrics={},
51+
trace_id=trace_id,
52+
parent_span_id=str(uuid.uuid4()),
53+
input_payload={"model": "gpt-4.1-mini"},
54+
)
55+
56+
57+
if __name__ == "__main__":
58+
main()
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
from __future__ import annotations
2+
3+
import uuid
4+
5+
from smollest.compare import ComparisonResult
6+
from smollest.results import log_result
7+
8+
9+
def main() -> None:
10+
project = "mock-secondary-metrics"
11+
messages = [{"role": "user", "content": "Extract topic from this paragraph."}]
12+
trace_id = str(uuid.uuid4())
13+
baseline = "topic=mlops"
14+
models = [
15+
("mistralai/Mistral-Small-24B-Instruct-2501", "large", 240.0, 0.00011, 0.024),
16+
("Qwen/Qwen3.5-3B-Instruct", "small", 90.0, 0.0, 0.007),
17+
]
18+
for model, size, latency_ms, cost, co2 in models:
19+
log_result(
20+
project=project,
21+
provider="anthropic",
22+
baseline_model="claude-sonnet-4-20250514",
23+
baseline_model_size="large",
24+
baseline_messages=messages,
25+
baseline_content=baseline,
26+
baseline_latency_ms=330.0,
27+
baseline_input_tokens=98,
28+
baseline_output_tokens=15,
29+
baseline_cost=0.0005,
30+
baseline_secondary_metrics={"co2_g": 0.041},
31+
comparison=ComparisonResult(candidate=model, score=1.0, total_fields=1),
32+
candidate_content=baseline,
33+
candidate_model_size=size,
34+
candidate_latency_ms=latency_ms,
35+
candidate_input_tokens=98,
36+
candidate_output_tokens=15,
37+
candidate_cost=cost,
38+
candidate_secondary_metrics={"co2_g": co2},
39+
trace_id=trace_id,
40+
parent_span_id=str(uuid.uuid4()),
41+
input_payload={"model": "claude-sonnet-4-20250514"},
42+
)
43+
44+
45+
if __name__ == "__main__":
46+
main()

examples/mock/seed_traces.py

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
from __future__ import annotations
2+
3+
import uuid
4+
5+
from smollest.compare import ComparisonResult
6+
from smollest.results import log_result
7+
8+
9+
def seed_trace(trace_input: str, baseline_output: str, project: str) -> None:
10+
trace_id = str(uuid.uuid4())
11+
base_messages = [{"role": "user", "content": trace_input}]
12+
candidates = [
13+
("Qwen/Qwen3.5-3B-Instruct", baseline_output, 1.0),
14+
("meta-llama/Llama-3.1-8B-Instruct", baseline_output.replace("2", "3"), 0.0),
15+
]
16+
for model, output, score in candidates:
17+
comparison = ComparisonResult(
18+
candidate=model,
19+
score=score,
20+
total_fields=1,
21+
matching_fields=["answer"] if score == 1.0 else [],
22+
mismatched_fields=[]
23+
if score == 1.0
24+
else [{"field": "answer", "baseline": "2", "candidate": "3"}],
25+
)
26+
log_result(
27+
project=project,
28+
provider="openai",
29+
baseline_model="gpt-4.1",
30+
baseline_model_size="large",
31+
baseline_messages=base_messages,
32+
baseline_content=baseline_output,
33+
baseline_latency_ms=250.0,
34+
baseline_input_tokens=20,
35+
baseline_output_tokens=5,
36+
baseline_cost=0.00009,
37+
baseline_secondary_metrics={},
38+
comparison=comparison,
39+
candidate_content=output,
40+
candidate_model_size="small",
41+
candidate_latency_ms=110.0,
42+
candidate_input_tokens=20,
43+
candidate_output_tokens=5,
44+
candidate_cost=0.0,
45+
candidate_secondary_metrics={},
46+
trace_id=trace_id,
47+
parent_span_id=str(uuid.uuid4()),
48+
input_payload={"model": "gpt-4.1"},
49+
)
50+
51+
52+
def main() -> None:
53+
project = "mock-traces"
54+
seed_trace('{"task":"math","question":"1+1?"}', '{"answer":"2"}', project)
55+
seed_trace('{"task":"math","question":"2+2?"}', '{"answer":"4"}', project)
56+
57+
58+
if __name__ == "__main__":
59+
main()
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
from __future__ import annotations
2+
3+
import anthropic
4+
5+
from smollest.anthropic import autocompare
6+
7+
8+
def main() -> None:
9+
autocompare(
10+
project="real-anthropic-autocompare",
11+
candidates=[
12+
"Qwen/Qwen3.5-3B-Instruct",
13+
"mistralai/Mistral-Small-24B-Instruct-2501",
14+
],
15+
)
16+
client = anthropic.Anthropic()
17+
response = client.messages.create(
18+
model="claude-sonnet-4-20250514",
19+
max_tokens=120,
20+
system="Return compact JSON only.",
21+
messages=[
22+
{
23+
"role": "user",
24+
"content": 'Extract entities from: "Apple acquired a startup in Paris."',
25+
}
26+
],
27+
)
28+
print(response.content[0].text)
29+
30+
31+
if __name__ == "__main__":
32+
main()
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
from __future__ import annotations
2+
3+
from smollest import anthropic
4+
5+
6+
def main() -> None:
7+
client = anthropic.Anthropic(project="real-anthropic-wrapper-basic")
8+
response = client.messages.create(
9+
model="claude-sonnet-4-20250514",
10+
max_tokens=120,
11+
messages=[
12+
{
13+
"role": "user",
14+
"content": 'Return JSON with fields "intent" and "priority" for: "Server is down in eu-west-1"',
15+
}
16+
],
17+
)
18+
print(response.content[0].text)
19+
20+
21+
if __name__ == "__main__":
22+
main()

0 commit comments

Comments
 (0)