temporalio · donald-pinckney · Mar 11, 2026 · Mar 13, 2026 · Mar 16, 2026
@@ -0,0 +1,9 @@
+# Harbor eval output
+jobs/
+
+# Python
+evals/.venv/
+__pycache__/
+
+# Generated
+evals/dashboard.html
@@ -1,3 +1,6 @@
 [submodule "plugins/temporal-developer/skills/temporal-developer"]
 	path = plugins/temporal-developer/skills/temporal-developer
 	url = https://github.com/temporalio/skill-temporal-developer
+[submodule "evals/harbor"]
+	path = evals/harbor
+	url = https://github.com/donald-pinckney/harbor.git
@@ -0,0 +1,122 @@
+# Temporal Developer Skill Evals
+
+Benchmarks for measuring the impact of the `temporal-developer` skill on code generation quality.
+
+## Prerequisites
+
+- [uv](https://docs.astral.sh/uv/) installed
+- `ANTHROPIC_API_KEY` set in your environment
+
+## Quick Start
+
+```bash
+# Easiest: dev mode (starts Temporal server + worker automatically)
+uv run --project evals eval-run skills --dev-mode
+uv run --project evals eval-run baseline --dev-mode
+```
+
+Or with an external Temporal server:
+
+```bash
+# Terminal 1: Start the Temporal dev server
+temporal server start-dev
+
+# Terminal 2: Start the eval worker
+uv run --project evals eval-worker
+
+# Terminal 3: Run evals
+uv run --project evals eval-run skills    # with skill (before submitting a PR)
+uv run --project evals eval-run baseline  # without skill (when tasks or Harbor change)
+```
+
+## Extending Agent Configurations
+
+Edit the `AGENTS` list in `temporal_evals/run_evals.py`:
+
+```python
+AGENTS = [
+    AgentConfig(name="claude-code", model="anthropic/claude-sonnet-4-6"),
+    AgentConfig(name="claude-code", model="anthropic/claude-opus-4-1"),
+    AgentConfig(name="aider", model="anthropic/claude-sonnet-4-6"),
+]
+```
+
+Each agent config runs across all datasets in parallel via the Temporal workflow.
+
+## Directory Structure
+
+```
+evals/
+├── harbor/                          # Harbor framework (git submodule)
+├── pyproject.toml                   # Python project (temporalio dependency)
+├── results.yaml                     # Eval results with skills (keyed by commit SHA)
+├── baseline.yaml                    # Baseline results without skills (flat array)
+├── temporal_evals/                  # Temporal workflow package
+│   ├── models.py                   # Shared dataclasses
+│   ├── activities/
+│   │   ├── harbor.py              # Harbor job execution activity
+│   │   └── record.py             # Result recording activity
+│   ├── workflows/
+│   │   └── eval_workflow.py       # EvalWorkflow (fans out jobs)
+│   ├── worker.py                   # Worker entry point
+│   └── run_evals.py               # CLI starter (baseline or skills)
+├── datasets/
+│   ├── temporal-python/
+│   │   └── greeting-workflow/       # Python greeting workflow task
+│   ├── temporal-typescript/
+│   │   └── greeting-workflow/       # TypeScript greeting workflow task
+│   └── temporal-questions/
+│       └── what-is-workflow-determinism/  # Q/A: workflow determinism
+└── templates/                       # Copyable templates for new tasks
+    ├── temporal-python/
+    ├── temporal-typescript/
+    └── temporal-questions/
+```
+
+## Adding a New Task
+
+### Code generation tasks (temporal-python, temporal-typescript)
+
+1. Copy the template: `cp -r evals/templates/temporal-python evals/datasets/temporal-python/my-task`
+2. Edit `instruction.md` with the task prompt
+3. Edit `tests/test.py` with validation checks (write rewards to `/logs/verifier/reward.json`)
+4. Edit `solution/solve.sh` with a reference solution
+5. Adjust `task.toml` timeouts and resources as needed
+6. Optionally customize `environment/Dockerfile`
+
+### Q/A tasks (temporal-questions)
+
+1. Copy the template: `cp -r evals/templates/temporal-questions evals/datasets/temporal-questions/my-question`
+2. Edit `instruction.md` with your question (must tell the agent to write to `answer.md`)
+3. Edit the rubric in `tests/rubric.md` — define what key concepts a correct answer should cover
+4. Edit `solution/solve.sh` with a reference answer
+
+Q/A tasks use LLM-as-judge (Claude Haiku) to grade the answer against the rubric. The verifier gets `ANTHROPIC_API_KEY` via the `[verifier.env]` section in `task.toml`.
+
+## Tracking Results
+
+Two result files, both version-controlled:
+
+- **`results.yaml`** — with-skill results, keyed by commit SHA. Updated on every skill change.
+- **`baseline.yaml`** — no-skill results, flat array. Updated rarely (when tasks or Harbor change).
+
+**Before submitting a PR:**
+
+```bash
+uv run --project evals eval-run skills --dev-mode
+# Commit both the skill changes and the updated results.yaml
+```
+
+## Scoring
+
+Tests write per-check rewards to `/logs/verifier/reward.json` as a dict of named scores (0.0 or 1.0 each).
+
+**Q/A tasks** use LLM-as-judge with a rubric:
+
+| Score | Meaning |
+|-------|---------|
+| 1.0   | Covers all rubric areas accurately with good depth |
+| 0.75  | Covers most areas well, or all with minor gaps |
+| 0.5   | Covers about half, or has some inaccuracies |
+| 0.25  | Covers little, or is mostly vague |
+| 0.0   | Missing, wrong, or off-topic |
@@ -0,0 +1,3 @@
+# Baseline eval results (no skills enabled).
+# Re-run when eval tasks or Harbor change, not on every skill change.
+# Generated by: temporal_evals.activities.record
@@ -0,0 +1,27 @@
+FROM ubuntu:24.04
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+RUN apt-get update && apt-get install -y \
+    curl \
+    git \
+    python3.12 \
+    python3.12-venv \
+    python3-pip \
+    && rm -rf /var/lib/apt/lists/*
+
+# Node.js 20.x
+RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash - \
+    && apt-get install -y nodejs \
+    && rm -rf /var/lib/apt/lists/*
+
+# Temporal CLI
+RUN curl -sSf https://temporal.download/cli.sh | sh
+ENV PATH="/root/.temporalio/bin:${PATH}"
+
+# uv (for running Python scripts with inline deps)
+RUN curl -LsSf https://astral.sh/uv/install.sh | sh
+ENV PATH="/root/.local/bin:${PATH}"
+
+RUN mkdir -p /workspace /logs/verifier
+WORKDIR /workspace
@@ -0,0 +1,14 @@
+Create a multi-step workflow using Temporal in Python that orchestrates 3 or more
+sequential activities with distinct processing steps (e.g., data validation,
+processing, and notification). Include comprehensive logging at each step so the
+workflow's progress can be traced.
+
+The workflow should accept an input string and process it through the activity chain.
+
+Write all code in a single file called run_workflow.py. It should be runnable with:
+`uv run python run_workflow.py "test input"`
+
+The program should print output showing each step's execution and the final result.
+
+The temporal dev server is available on this system, but you may need to add
+dependencies to this project.
@@ -0,0 +1,20 @@
+version = "1.0"
+
+[metadata]
+author_name = "Temporal Developer Skill"
+difficulty = "hard"
+category = "programming"
+tags = ["temporal", "python", "workflow"]
+
+[agent]
+timeout_sec = 300.0
+
+[verifier]
+timeout_sec = 120.0
+
+[environment]
+build_timeout_sec = 300.0
+cpus = 2
+memory_mb = 8192
+storage_mb = 10240
+allow_internet = true
@@ -0,0 +1,81 @@
+"""Validates multi-step workflow logging implementation."""
+
+import json
+import os
+import py_compile
+import re
+import subprocess
+import asyncio
+from pathlib import Path
+
+import subprocess as _sp
+_sp.run(["pip", "install", "--break-system-packages", "-q", "temporalio"], check=True)
+del _sp
+
+from temporalio.testing import WorkflowEnvironment
+
+WORKSPACE = Path("/workspace")
+REWARD_PATH = Path("/logs/verifier/reward.json")
+rewards: dict[str, float] = {}
+
+
+def check(key: str, passed: bool, label: str) -> None:
+    rewards[key] = 1.0 if passed else 0.0
+    status = "PASS" if passed else "FAIL"
+    print(f"{status}: {label}")
+
+
+def run(cmd: list[str], timeout: int = 30) -> tuple[str, str, int]:
+    try:
+        result = subprocess.run(
+            cmd, capture_output=True, text=True, timeout=timeout, cwd=WORKSPACE
+        )
+        return result.stdout, result.stderr, result.returncode
+    except subprocess.TimeoutExpired:
+        return "", "TIMEOUT", -1
+
+
+def any_py_contains(pattern: str) -> bool:
+    for f in WORKSPACE.rglob("*.py"):
+        try:
+            if re.search(pattern, f.read_text()):
+                return True
+        except Exception:
+            pass
+    return False
+
+
+os.chdir(WORKSPACE)
+
+
+async def main() -> None:
+    async with await WorkflowEnvironment.start_local(port=7233) as env:
+        _, _, sync_rc = run(["uv", "sync"], timeout=60)
+        check("deps_installed", sync_rc == 0, "uv sync succeeds")
+
+        stdout, stderr, rc = run(
+            ["uv", "run", "python", "run_workflow.py", "test input"], timeout=15
+        )
+        check("runs_successfully", rc == 0, "program exits cleanly")
+        check("correct_output", len(stdout.strip()) > 0, "produces output")
+        check(
+            "has_workflow_defn",
+            any_py_contains(r"@workflow\.defn"),
+            "@workflow.defn decorator",
+        )
+
+        py_files = list(WORKSPACE.rglob("*.py"))
+        all_compile = True
+        for f in py_files:
+            try:
+                py_compile.compile(str(f), doraise=True)
+            except py_compile.PyCompileError:
+                all_compile = False
+                break
+        check("syntax_valid", all_compile, "all .py files compile")
+
+    REWARD_PATH.parent.mkdir(parents=True, exist_ok=True)
+    REWARD_PATH.write_text(json.dumps(rewards))
+
+
+asyncio.run(main())
@@ -0,0 +1,2 @@
+#!/bin/bash
+exec python3 /tests/test.py
@@ -0,0 +1,27 @@
+FROM ubuntu:24.04
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+RUN apt-get update && apt-get install -y \
+    curl \
+    git \
+    python3.12 \
+    python3.12-venv \
+    python3-pip \
+    && rm -rf /var/lib/apt/lists/*
+
+# Node.js 20.x
+RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash - \
+    && apt-get install -y nodejs \
+    && rm -rf /var/lib/apt/lists/*
+
+# Temporal CLI
+RUN curl -sSf https://temporal.download/cli.sh | sh
+ENV PATH="/root/.temporalio/bin:${PATH}"
+
+# uv (for running Python scripts with inline deps)
+RUN curl -LsSf https://astral.sh/uv/install.sh | sh
+ENV PATH="/root/.local/bin:${PATH}"
+
+RUN mkdir -p /workspace /logs/verifier
+WORKDIR /workspace
@@ -0,0 +1,19 @@
+Create an order management system using Temporal in Python with a distributed queue
+workflow. The system should support:
+
+- Push: Add an order to the queue (each order has order_id, item_name, quantity)
+- Pop: Remove and return the next order from the queue
+- GetQueueLength: Return the current number of orders in the queue
+- GetAllOrders: Return all orders currently in the queue
+
+The workflow should run indefinitely, allowing multiple clients to interact with it.
+
+Create two files:
+- worker.py: Contains the workflow definition and starts the worker
+- example_client.py: Demonstrates pushing orders, querying the queue, and popping orders
+
+Running `uv run python example_client.py` should demonstrate the full lifecycle
+(push some orders, check length, get all, pop one).
+
+The temporal dev server is available on this system, but you may need to add
+dependencies to this project.
@@ -0,0 +1,20 @@
+version = "1.0"
+
+[metadata]
+author_name = "Temporal Developer Skill"
+difficulty = "hard"
+category = "programming"
+tags = ["temporal", "python", "workflow"]
+
+[agent]
+timeout_sec = 300.0
+
+[verifier]
+timeout_sec = 120.0
+
+[environment]
+build_timeout_sec = 300.0
+cpus = 2
+memory_mb = 8192
+storage_mb = 10240
+allow_internet = true