Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Harbor eval output
jobs/

# Python
evals/.venv/
__pycache__/

# Generated
evals/dashboard.html
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
[submodule "plugins/temporal-developer/skills/temporal-developer"]
path = plugins/temporal-developer/skills/temporal-developer
url = https://github.com/temporalio/skill-temporal-developer
[submodule "evals/harbor"]
path = evals/harbor
url = https://github.com/donald-pinckney/harbor.git
122 changes: 122 additions & 0 deletions evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Temporal Developer Skill Evals

Benchmarks for measuring the impact of the `temporal-developer` skill on code generation quality.

## Prerequisites

- [uv](https://docs.astral.sh/uv/) installed
- `ANTHROPIC_API_KEY` set in your environment

## Quick Start

```bash
# Easiest: dev mode (starts Temporal server + worker automatically)
uv run --project evals eval-run skills --dev-mode
uv run --project evals eval-run baseline --dev-mode
```

Or with an external Temporal server:

```bash
# Terminal 1: Start the Temporal dev server
temporal server start-dev

# Terminal 2: Start the eval worker
uv run --project evals eval-worker

# Terminal 3: Run evals
uv run --project evals eval-run skills # with skill (before submitting a PR)
uv run --project evals eval-run baseline # without skill (when tasks or Harbor change)
```

## Extending Agent Configurations

Edit the `AGENTS` list in `temporal_evals/run_evals.py`:

```python
AGENTS = [
AgentConfig(name="claude-code", model="anthropic/claude-sonnet-4-6"),
AgentConfig(name="claude-code", model="anthropic/claude-opus-4-1"),
AgentConfig(name="aider", model="anthropic/claude-sonnet-4-6"),
]
```

Each agent config runs across all datasets in parallel via the Temporal workflow.

## Directory Structure

```
evals/
├── harbor/ # Harbor framework (git submodule)
├── pyproject.toml # Python project (temporalio dependency)
├── results.yaml # Eval results with skills (keyed by commit SHA)
├── baseline.yaml # Baseline results without skills (flat array)
├── temporal_evals/ # Temporal workflow package
│ ├── models.py # Shared dataclasses
│ ├── activities/
│ │ ├── harbor.py # Harbor job execution activity
│ │ └── record.py # Result recording activity
│ ├── workflows/
│ │ └── eval_workflow.py # EvalWorkflow (fans out jobs)
│ ├── worker.py # Worker entry point
│ └── run_evals.py # CLI starter (baseline or skills)
├── datasets/
│ ├── temporal-python/
│ │ └── greeting-workflow/ # Python greeting workflow task
│ ├── temporal-typescript/
│ │ └── greeting-workflow/ # TypeScript greeting workflow task
│ └── temporal-questions/
│ └── what-is-workflow-determinism/ # Q/A: workflow determinism
└── templates/ # Copyable templates for new tasks
├── temporal-python/
├── temporal-typescript/
└── temporal-questions/
```

## Adding a New Task

### Code generation tasks (temporal-python, temporal-typescript)

1. Copy the template: `cp -r evals/templates/temporal-python evals/datasets/temporal-python/my-task`
2. Edit `instruction.md` with the task prompt
3. Edit `tests/test.py` with validation checks (write rewards to `/logs/verifier/reward.json`)
4. Edit `solution/solve.sh` with a reference solution
5. Adjust `task.toml` timeouts and resources as needed
6. Optionally customize `environment/Dockerfile`

### Q/A tasks (temporal-questions)

1. Copy the template: `cp -r evals/templates/temporal-questions evals/datasets/temporal-questions/my-question`
2. Edit `instruction.md` with your question (must tell the agent to write to `answer.md`)
3. Edit the rubric in `tests/rubric.md` — define what key concepts a correct answer should cover
4. Edit `solution/solve.sh` with a reference answer

Q/A tasks use LLM-as-judge (Claude Haiku) to grade the answer against the rubric. The verifier gets `ANTHROPIC_API_KEY` via the `[verifier.env]` section in `task.toml`.

## Tracking Results

Two result files, both version-controlled:

- **`results.yaml`** — with-skill results, keyed by commit SHA. Updated on every skill change.
- **`baseline.yaml`** — no-skill results, flat array. Updated rarely (when tasks or Harbor change).

**Before submitting a PR:**

```bash
uv run --project evals eval-run skills --dev-mode
# Commit both the skill changes and the updated results.yaml
```

## Scoring

Tests write per-check rewards to `/logs/verifier/reward.json` as a dict of named scores (0.0 or 1.0 each).

**Q/A tasks** use LLM-as-judge with a rubric:

| Score | Meaning |
|-------|---------|
| 1.0 | Covers all rubric areas accurately with good depth |
| 0.75 | Covers most areas well, or all with minor gaps |
| 0.5 | Covers about half, or has some inaccuracies |
| 0.25 | Covers little, or is mostly vague |
| 0.0 | Missing, wrong, or off-topic |
3 changes: 3 additions & 0 deletions evals/baseline.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Baseline eval results (no skills enabled).
# Re-run when eval tasks or Harbor change, not on every skill change.
# Generated by: temporal_evals.activities.record
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
FROM ubuntu:24.04

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y \
curl \
git \
python3.12 \
python3.12-venv \
python3-pip \
&& rm -rf /var/lib/apt/lists/*

# Node.js 20.x
RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash - \
&& apt-get install -y nodejs \
&& rm -rf /var/lib/apt/lists/*

# Temporal CLI
RUN curl -sSf https://temporal.download/cli.sh | sh
ENV PATH="/root/.temporalio/bin:${PATH}"

# uv (for running Python scripts with inline deps)
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
ENV PATH="/root/.local/bin:${PATH}"

RUN mkdir -p /workspace /logs/verifier
WORKDIR /workspace
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Create a multi-step workflow using Temporal in Python that orchestrates 3 or more
sequential activities with distinct processing steps (e.g., data validation,
processing, and notification). Include comprehensive logging at each step so the
workflow's progress can be traced.

The workflow should accept an input string and process it through the activity chain.

Write all code in a single file called run_workflow.py. It should be runnable with:
`uv run python run_workflow.py "test input"`

The program should print output showing each step's execution and the final result.

The temporal dev server is available on this system, but you may need to add
dependencies to this project.
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
version = "1.0"

[metadata]
author_name = "Temporal Developer Skill"
difficulty = "hard"
category = "programming"
tags = ["temporal", "python", "workflow"]

[agent]
timeout_sec = 300.0

[verifier]
timeout_sec = 120.0

[environment]
build_timeout_sec = 300.0
cpus = 2
memory_mb = 8192
storage_mb = 10240
allow_internet = true
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
"""Validates multi-step workflow logging implementation."""

import json
import os
import py_compile
import re
import subprocess
import asyncio
from pathlib import Path

import subprocess as _sp
_sp.run(["pip", "install", "--break-system-packages", "-q", "temporalio"], check=True)
del _sp

from temporalio.testing import WorkflowEnvironment

WORKSPACE = Path("/workspace")
REWARD_PATH = Path("/logs/verifier/reward.json")
rewards: dict[str, float] = {}


def check(key: str, passed: bool, label: str) -> None:
rewards[key] = 1.0 if passed else 0.0
status = "PASS" if passed else "FAIL"
print(f"{status}: {label}")


def run(cmd: list[str], timeout: int = 30) -> tuple[str, str, int]:
try:
result = subprocess.run(
cmd, capture_output=True, text=True, timeout=timeout, cwd=WORKSPACE
)
return result.stdout, result.stderr, result.returncode
except subprocess.TimeoutExpired:
return "", "TIMEOUT", -1


def any_py_contains(pattern: str) -> bool:
for f in WORKSPACE.rglob("*.py"):
try:
if re.search(pattern, f.read_text()):
return True
except Exception:
pass
return False


os.chdir(WORKSPACE)


async def main() -> None:
async with await WorkflowEnvironment.start_local(port=7233) as env:
_, _, sync_rc = run(["uv", "sync"], timeout=60)
check("deps_installed", sync_rc == 0, "uv sync succeeds")

stdout, stderr, rc = run(
["uv", "run", "python", "run_workflow.py", "test input"], timeout=15
)
check("runs_successfully", rc == 0, "program exits cleanly")
check("correct_output", len(stdout.strip()) > 0, "produces output")
check(
"has_workflow_defn",
any_py_contains(r"@workflow\.defn"),
"@workflow.defn decorator",
)

py_files = list(WORKSPACE.rglob("*.py"))
all_compile = True
for f in py_files:
try:
py_compile.compile(str(f), doraise=True)
except py_compile.PyCompileError:
all_compile = False
break
check("syntax_valid", all_compile, "all .py files compile")

REWARD_PATH.parent.mkdir(parents=True, exist_ok=True)
REWARD_PATH.write_text(json.dumps(rewards))


asyncio.run(main())
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#!/bin/bash
exec python3 /tests/test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
FROM ubuntu:24.04

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y \
curl \
git \
python3.12 \
python3.12-venv \
python3-pip \
&& rm -rf /var/lib/apt/lists/*

# Node.js 20.x
RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash - \
&& apt-get install -y nodejs \
&& rm -rf /var/lib/apt/lists/*

# Temporal CLI
RUN curl -sSf https://temporal.download/cli.sh | sh
ENV PATH="/root/.temporalio/bin:${PATH}"

# uv (for running Python scripts with inline deps)
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
ENV PATH="/root/.local/bin:${PATH}"

RUN mkdir -p /workspace /logs/verifier
WORKDIR /workspace
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Create an order management system using Temporal in Python with a distributed queue
workflow. The system should support:

- Push: Add an order to the queue (each order has order_id, item_name, quantity)
- Pop: Remove and return the next order from the queue
- GetQueueLength: Return the current number of orders in the queue
- GetAllOrders: Return all orders currently in the queue

The workflow should run indefinitely, allowing multiple clients to interact with it.

Create two files:
- worker.py: Contains the workflow definition and starts the worker
- example_client.py: Demonstrates pushing orders, querying the queue, and popping orders

Running `uv run python example_client.py` should demonstrate the full lifecycle
(push some orders, check length, get all, pop one).

The temporal dev server is available on this system, but you may need to add
dependencies to this project.
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
version = "1.0"

[metadata]
author_name = "Temporal Developer Skill"
difficulty = "hard"
category = "programming"
tags = ["temporal", "python", "workflow"]

[agent]
timeout_sec = 300.0

[verifier]
timeout_sec = 120.0

[environment]
build_timeout_sec = 300.0
cpus = 2
memory_mb = 8192
storage_mb = 10240
allow_internet = true
Loading