Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 132 additions & 0 deletions .claude/commands/recipe-ify.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Recipe-ify

Generate a complete, PR-ready recipe for the AI Cookbook from a pattern description.

Usage: `/project:recipe-ify <pattern description or proposal card>`

The input can be:
- A proposal card from `/project:recipe-scout`
- A freeform description ("generate a RAG pipeline recipe using OpenAI")
- Anything in between

---

## What you do

You are an expert at writing reference-quality AI Cookbook recipes. Generate ALL files for the recipe described in `$ARGUMENTS`, following the conventions below exactly. Produce complete, runnable files — not stubs or placeholders.

**Audience reminder:** Recipes target AI Engineers who are comfortable with LLMs and agents but are new to Temporal. The AI pattern is the hero; Temporal is the invisible durability layer underneath. Don't over-explain Temporal mechanics — focus on making the AI concept clear.

---

## Cookbook conventions

**Directory:** `{category}/{recipe-name}_python/`
Categories: `foundations` (single LLM call or simple pattern), `agents` (agentic loops, tool use), `deep_research` (multi-agent), `mcp` (MCP servers)

**Naming:**
- Task queue: `{recipe-name}-task-queue`
- Workflow class: `PascalCaseWorkflow`
- Activity functions: `snake_case`
- Request/response models: `ActivityNameRequest`, `ActivityNameResponse`

**Always:**
- LLM clients: `max_retries=0` — Temporal handles retries, not the client
- Data converter: `pydantic_data_converter` everywhere — in `Client.connect()`, in `WorkflowEnvironment.start_time_skipping()`
- Activity timeouts: always specify `start_to_close_timeout` (30s default; increase for research/LLM tasks)
- Non-retryable errors: catch them and raise `ApplicationError(..., non_retryable=True)`
- Python: `>=3.10,<3.14`
- Temporalio: `>=1.15.0,<2`

---

## Files to generate

### `README.md`

Must open with this exact front matter block:
```
<!--
description: One-sentence description of what the recipe demonstrates.
tags: [category, python, provider]
priority: 500
-->
```
Then: title, 1–2 paragraph overview of what the recipe teaches, prerequisites, how to run:
```
uv sync
uv run python -m worker # terminal 1
uv run python -m start_workflow # terminal 2
```
End with what to expect in the output.

### `pyproject.toml`
```toml
[project]
name = "cookbook-{recipe-name}-python"
version = "0.1"
description = "..."
authors = [{ name = "Temporal Technologies Inc", email = "sdk@temporal.io" }]
requires-python = ">=3.10,<3.14"
readme = "README.md"
license = "MIT"
dependencies = [
"temporalio>=1.15.0,<2",
# LLM provider SDK
]

[dependency-groups]
dev = [
"pytest>=9.0.3",
"pytest-timeout>=2.4.0",
"pytest-asyncio>=0.26.0",
]

[tool.pytest.ini_options]
pythonpath = ["."]
```

### `worker.py`
```python
import asyncio
from temporalio.client import Client
from temporalio.worker import Worker
from temporalio.contrib.pydantic import pydantic_data_converter

async def main():
client = await Client.connect("localhost:7233", data_converter=pydantic_data_converter)
worker = Worker(client, task_queue="...-task-queue", workflows=[...], activities=[...])
await worker.run()

if __name__ == "__main__":
asyncio.run(main())
```

### `start_workflow.py`
Connect with `pydantic_data_converter`, call `execute_workflow`, print result.

### `workflows/{name}.py`
- `@workflow.defn` class with `@workflow.run` method
- Calls activities via `workflow.execute_activity(fn, request, start_to_close_timeout=...)`
- Pure orchestration — no LLM calls, no I/O

### `activities/{name}.py`
- `@activity.defn` functions
- LLM client initialized with `max_retries=0`
- Request model defined at top of file (dataclass or Pydantic `BaseModel`)
- Catch non-retryable API errors → `ApplicationError(..., non_retryable=True)`

### `tests/test_{name}.py`
- `@pytest.mark.asyncio` and `@pytest.mark.timeout(30)` on every test
- Use `WorkflowEnvironment.start_time_skipping(data_converter=pydantic_data_converter)`
- Register mock activities in the Worker to avoid real API calls
- Cover at minimum: happy path, and the key edge case the recipe is about

---

## After generating files

1. Report what was created and the directory path
2. Show how to run: `cd {dir} && uv sync && uv run pytest tests/`
3. List any env vars needed (API keys, etc.) and where to set them
4. Note any deliberate simplifications made to keep the recipe bite-sized
90 changes: 90 additions & 0 deletions .claude/commands/recipe-scout.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Recipe Scout

Analyze an external project and identify which parts would make good AI Cookbook recipes.

Usage: `/project:recipe-scout <github-url>`

## What you do

You are an expert at spotting teachable, self-contained AI patterns in real-world projects. Your job is to produce proposal cards that a reviewer — who may never have seen the source project — can use to decide what's worth building into a recipe.

**Audience reminder:** The AI Cookbook targets AI Engineers who are comfortable with LLMs and agents but are new to Temporal. Recipes should teach *AI building blocks* — patterns for how agents think, decide, call tools, and coordinate — with Temporal providing durability underneath. Do NOT propose patterns that are primarily about Temporal orchestration, distributed systems, or infrastructure; those belong in Temporal's own documentation, not here.

---

### Step 1 — Fetch and analyze the project

Fetch the repository at `$ARGUMENTS`. Collect:
- The README (for intent and architecture overview)
- The full file tree (via GitHub API: `https://api.github.com/repos/{owner}/{repo}/git/trees/main?recursive=1`)
- Key source files: LLM integration code, agent/tool patterns, prompt construction, workflow definitions

Look specifically for these **AI building block** patterns, which make strong recipes:
- **Agentic loop** — LLM called in a loop until a stop condition (tool use, stop sequence, empty tool calls)
- **Forced completion** — On the final loop iteration, `tool_choice` is constrained to a specific tool so the agent must commit to a decision rather than looping forever
- **Tool calling** — LLM invokes structured tools; results fed back into the conversation
- **Parallel tool calls** — LLM requests multiple tools simultaneously; all results must be collected before the next turn
- **Multi-agent coordination / agent supervisor** — One agent spawns or delegates to sub-agents; results are aggregated
- **Structured output** — LLM output is parsed and validated against a Pydantic schema
- **Human-in-the-loop** — Workflow pauses and waits for a human decision before continuing
- **Streaming output** — Activity emits incremental tokens/chunks rather than waiting for full completion
- **RAG (retrieval-augmented generation)** — Retrieved context injected into the prompt before calling the LLM
- **Short-term memory** — Conversation history carried across turns within a single workflow run
- **Long-term memory** — Facts or summaries persisted across workflow runs and retrieved on demand
- **Context summarization** — Long conversation history compressed (e.g., via `continue_as_new`) to stay within context limits
- **Guardrails** — LLM output checked against a policy before being acted on; rejected outputs are blocked or re-requested
- **Chain-of-thought / tree-of-thought** — LLM explicitly reasons through steps before producing a final answer
- **Prompt injection prevention** — Untrusted external data is isolated from control instructions (e.g., XML tags, separate message turns)
- **Dynamic system prompts** — System instructions constructed at runtime from context (user prefs, retrieved docs, current state)
- **Cost/token tracking** — Token usage recorded per workflow run for budgeting or rate-limiting
- **Multi-provider LLM abstraction** — Single interface that dispatches to Anthropic, OpenAI, LiteLLM, or local models

Ignore patterns that are primarily about Temporal internals (workflow ID policies, heartbeats, signal/query handlers, replay determinism) unless they are a natural, invisible part of an AI pattern above.

---

### Step 2 — Produce proposal cards

The cookbook has a wishlist of use cases not yet covered. Patterns that fill one of these gaps should be ranked higher:
- RAG pipeline
- Streaming output
- Short-term or long-term memory
- Context summarization (ContinueAsNew)
- Agent supervisor / multi-agent swarm
- Guardrails
- Chain-of-thought / tree-of-thought
- Cost/token tracking
- Trigger-based AI (event-driven or timer-based)
- Web crawler

For each candidate pattern you find, evaluate:
1. **Is it an AI building block?** Would an AI engineer recognize this as a useful pattern for their LLM/agent work, independent of what orchestrator they use?
2. **Is it well-engineered, not a demo?** The cookbook publishes reference-quality code, not flashy one-offs.
3. **Is it self-contained?** Can it stand alone as a 200–400 line recipe without pulling in the entire project?
4. **Is it teachable?** Does it demonstrate a single clear concept a developer can learn from?
5. **Is it novel vs. existing recipes?** Check existing recipes in this repo (foundations/, agents/, deep_research/, mcp/).
6. **Does it fill a wishlist gap?** Cross-reference against the coverage wishlist above.

Rank the top 2–4 patterns. For each, write a proposal card with the following sections — written so a reviewer who has never seen the source project can evaluate it:

**Proposed recipe:** `{category}/{recipe-name}_python`

**One-line description:** _(the README front matter `description` field)_

**The problem it solves:** In 2–3 sentences: what goes wrong if a developer doesn't know this pattern? What mistake do they typically make, and what does that cost them?

**The pattern in the source:** A short code excerpt (10–25 lines) from the source project that shows the pattern at its clearest. If the source isn't Python or doesn't translate directly, show equivalent pseudocode. This is the "exhibit A" that justifies the recipe.

**How the recipe would be structured:** A brief outline — what the workflow does, what the key activity does, what tool or API is involved. Not full code, 5–10 bullet points.

**Closest existing recipe and what's different:** Name the most similar recipe already in the cookbook and state specifically what this adds or changes. If there's no close match, say so.

**Wishlist gap filled:** Which item from the coverage wishlist does this address, if any? If none, say so.

**Estimated size:** Rough line count for the finished recipe (all files combined). Flag anything over 400 lines as potentially too complex for a single recipe.

---

After the proposal cards, add an **Excluded patterns** section listing any patterns that were interesting but filtered out, with a one-line reason for each.

To generate a recipe from one of these proposals, use `/project:recipe-ify` and paste in the proposal card.
52 changes: 52 additions & 0 deletions agents/guardrails_hard_rules_python/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
<!--
description: Demonstrates a post-LLM guardrail layer that uses deterministic hard rules to override an LLM's verdict, ensuring policy-critical decisions can never be bypassed by hallucination or prompt injection.
tags: [agents, python, anthropic]
priority: 500
-->

# Guardrails: Hard Rules

This recipe shows how to combine an LLM classifier with a deterministic guardrail layer. The LLM provides nuanced judgment for ambiguous cases; hard rules act as a safety net for unambiguous policy violations, overriding the LLM's verdict regardless of what it concluded.

The pattern answers a real problem: LLMs can be manipulated via prompt injection or simply hallucinate. For any decision with real consequences — content moderation, access control, transaction approval — you shouldn't rely on the LLM alone. Hard rules catch clear-cut cases deterministically; the LLM handles everything in the grey zone. Critically, when a hard rule fires, the LLM's original reasoning is preserved inside the override so every decision remains auditable.

The recipe uses a content moderation scenario: user-submitted text is classified as `safe`, `review`, or `block`. Hard rules override to `block` when contact information or banned keywords are detected, regardless of what the LLM concluded.

## Prerequisites

- Python 3.10+
- [uv](https://docs.astral.sh/uv/)
- A running Temporal server: `temporal server start-dev`
- `ANTHROPIC_API_KEY` environment variable set

## Run it

```bash
uv sync

# Terminal 1 — start the worker
uv run python -m worker

# Terminal 2 — submit two example workflows
uv run python -m start_workflow
```

## Expected output

```
--- Example 1: Hard rule override ---
Input: 'Great product! Contact me at john.doe@example.com for a special deal.'
Classification: block
Overridden by hard rule: True
Reasoning: Hard rule: contains email address (privacy policy violation).

[LLM classified as 'safe' — reasoning: The message is promotional but does not appear harmful.]

--- Example 2: LLM verdict stands ---
Input: 'I really enjoyed the hiking trail last weekend. The views were amazing!'
Classification: safe
Overridden by hard rule: False
Reasoning: Positive personal experience with no policy concerns.
```

In Example 1, the LLM's classification and reasoning are preserved inside brackets — the override is fully auditable.
52 changes: 52 additions & 0 deletions agents/guardrails_hard_rules_python/activities/classify.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
import anthropic
from temporalio import activity
from temporalio.exceptions import ApplicationError
from pydantic import BaseModel

from models.signals import ContentSignals
from models.verdict import LLMVerdict, Verdict
from guardrails.hard_rules import apply_hard_rules

_SYSTEM = """You are a content moderation assistant. Classify the submitted text as:
- safe: acceptable content with no policy concerns
- review: borderline content that a human should check
- block: clear policy violation (hate speech, harassment, explicit content, obvious spam)

When uncertain, use 'review' — it's better to flag for human review than to miss a violation."""

_SUBMIT_VERDICT_TOOL = {
"name": "submit_verdict",
"description": "Submit your content moderation classification.",
"input_schema": LLMVerdict.model_json_schema(),
}


class ClassifyRequest(BaseModel):
signals: ContentSignals
model: str = "claude-sonnet-4-6"


@activity.defn
async def classify(request: ClassifyRequest) -> Verdict:
client = anthropic.AsyncAnthropic(max_retries=0)

try:
response = await client.messages.create(
model=request.model,
max_tokens=512,
system=_SYSTEM,
messages=[
{"role": "user", "content": f"Classify this content:\n\n{request.signals.text}"}
],
tools=[_SUBMIT_VERDICT_TOOL],
tool_choice={"type": "tool", "name": "submit_verdict"},
)
except anthropic.AuthenticationError as exc:
raise ApplicationError(str(exc), type="AuthenticationError", non_retryable=True) from exc
except anthropic.BadRequestError as exc:
raise ApplicationError(str(exc), type="BadRequestError", non_retryable=True) from exc

tool_block = next(b for b in response.content if b.type == "tool_use")
llm_verdict = Verdict.model_validate(tool_block.input)

return apply_hard_rules(request.signals, llm_verdict)
62 changes: 62 additions & 0 deletions agents/guardrails_hard_rules_python/guardrails/hard_rules.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
import re
from models.signals import ContentSignals
from models.verdict import Verdict

_BANNED_KEYWORDS = ["buy now", "click here", "free money", "guaranteed winner"]


def _hard_block(signals: ContentSignals) -> Verdict | None:
"""Return a block Verdict if any hard rule matches, otherwise None."""
text_lower = signals.text.lower()

for keyword in _BANNED_KEYWORDS:
if keyword in text_lower:
return Verdict(
classification="block",
confidence=1.0,
reasoning=f"Hard rule: contains banned keyword '{keyword}'.",
overridden_by_hard_rule=True,
)

if re.search(r"\b\d{3}[-.()]?\d{3}[-.]?\d{4}\b", signals.text):
return Verdict(
classification="block",
confidence=1.0,
reasoning="Hard rule: contains phone number (privacy policy violation).",
overridden_by_hard_rule=True,
)

if re.search(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b", signals.text):
return Verdict(
classification="block",
confidence=1.0,
reasoning="Hard rule: contains email address (privacy policy violation).",
overridden_by_hard_rule=True,
)

return None


def apply_hard_rules(signals: ContentSignals, llm_verdict: Verdict) -> Verdict:
"""Post-filter: override the LLM verdict if a hard rule matches.

When a rule fires, the LLM's original reasoning is embedded in the
returned verdict so the override is auditable.
"""
if llm_verdict.classification == "block":
return llm_verdict

hard = _hard_block(signals)
if hard is None:
return llm_verdict

return Verdict(
classification=hard.classification,
confidence=hard.confidence,
overridden_by_hard_rule=True,
reasoning=(
f"{hard.reasoning}\n\n"
f"[LLM classified as '{llm_verdict.classification}' — "
f"reasoning: {llm_verdict.reasoning}]"
),
)
Loading
Loading