name	tdd-golden-examples
description	Testing methodology skill. Use when defining and validating any function in a governed architecture — especially code+LLM combinations. Derives testable golden examples from step specifications, applies the Red-Green-Refactor cycle adapted for semantic outputs, and ensures every function is independently verifiable before integration. Product, architecture, and technology agnostic.

TDD with Golden Examples

Test-Driven Development adapted for governed AI applications. Every function — whether pure code, code+LLM, or sub-agent — gets a test before it gets an implementation. For LLM-involving functions, golden examples replace exact assertions with evaluation criteria that validate output quality without requiring deterministic string matching.

The principle: if you can't write the test, you haven't defined the function.

How This Skill Works

1. Start from Step Specifications

This skill consumes the artifacts produced by the Value Chain Mapping skill. Each step specification already defines:

Intention — what the step achieves
Input — what it receives
Output — what it produces
Completion criteria — what "done" looks like

These four elements are everything you need to write a test.

2. Derive Golden Examples Before Writing Code

For each function linked to a step:

Take a real or realistic input (from the step specification or stakeholder interviews)
Define what must be true about the output (from completion criteria)
Write the evaluation criteria (how to judge pass/fail)
Include edge cases (missing data, ambiguous input, boundary conditions)

This is the test. Write it first. It will fail — there's no implementation yet. That's the point.

3. Red-Green-Refactor Cycle (Adapted)

The classic TDD cycle, adapted for functions that may include LLM calls:

Phase	Traditional TDD	Governed Architecture TDD
Red	Write a test that fails	Write golden examples with evaluation criteria — no implementation exists yet
Green	Write the minimum code to pass	Write the code + prompt + skill config that produces output meeting the criteria
Refactor	Clean up without breaking tests	Refine the prompt, adjust the skill config, optimise the code — golden examples must still pass

Key difference: In traditional TDD, "Green" means the assertion passes exactly. In governed architecture TDD, "Green" means the evaluation criteria are satisfied — the output has the right structure, contains the required elements, and is faithful to the input.

4. Run, Evaluate, Iterate

For pure code functions: run the function, assert the output. Standard.

For code+LLM functions:

Run the function with the golden example input
Capture the output
Evaluate against the criteria (automated where possible, human review where necessary)
If criteria met → Green. If not → adjust code, prompt, or skill config → re-run.

For sub-agent functions: run each sub-agent independently with its golden input, evaluate each output separately, then test the merge step.

Golden Example Structure

One golden example set per function. Each set contains multiple examples covering the happy path, edge cases, and failure modes.

Example Entry

# Golden Example: [Function Name]
# Step: [Step # from value chain map]
# Function Type: [pure code | code+LLM | sub-agent]
# Version: [N] | Last validated: [date]

## Example 1: [descriptive name — e.g., "complete input, clear match"]

### Input
[The concrete input data — real or realistic, sourced from stakeholder interviews or domain knowledge]

### Expected Output Characteristics
- [What must be TRUE about the output]
- [What must be PRESENT]
- [What must NOT be present]
- [Structural requirements — format, fields, completeness]

### Evaluation Criteria
| Criterion | Type | Required |
|-----------|------|----------|
| [e.g., "Every requirement addressed"] | completeness | yes |
| [e.g., "Evidence quoted from source, not fabricated"] | faithfulness | yes |
| [e.g., "Output is valid JSON matching schema"] | structure | yes |
| [e.g., "Confidence scores within 0-1 range"] | validity | yes |
| [e.g., "Processing completes within 10 seconds"] | performance | no |

### Edge Cases
| Variant | Input Modification | Expected Behaviour |
|---------|-------------------|-------------------|
| Missing data | [remove a key field] | [graceful handling — report gap, don't fabricate] |
| Ambiguous input | [add conflicting information] | [flag ambiguity, don't silently pick one] |
| Empty input | [provide nothing] | [clear error or "insufficient data" response] |
| Oversized input | [exceed expected volume] | [process or truncate with notice — don't crash] |

Evaluation Criteria Types

When testing LLM-involving functions, evaluation criteria fall into these categories. Use the minimum set needed — not every function needs all types.

Type	What It Checks	How to Evaluate
Completeness	All required elements present in output	Checklist: is each expected element present?
Faithfulness	Output reflects the input accurately — no fabrication	Compare output claims against input source material
Structure	Output matches the expected format/schema	Schema validation (automated)
Validity	Values are within expected ranges and types	Range checks, type checks (automated)
Relevance	Output addresses the step's intention, no tangents	Human review or secondary LLM evaluation
Consistency	Same input produces compatible outputs across runs	Run N times, compare structural similarity
Absence	Forbidden content is NOT in the output	Negative checks: no hallucination, no PII leakage, no off-topic content

✅ "Every requirement from the job description has a corresponding evidence entry, each evidence entry quotes text found in the CV" — completeness + faithfulness, verifiable.

❌ "The output should be good and accurate" — vague, untestable. If you can't define what "good" means in concrete criteria, you haven't defined the function.

Artifacts

1. Golden Example Set

One per function. Contains all test cases for that function. Format as shown above.

# Golden Example Set: [Function Name]
# Step: [#] | Type: [pure code | code+LLM | sub-agent]
# Examples: [count] | Edge cases: [count] | Last validated: [date]

## Example 1: [happy path]
[input, expected output characteristics, evaluation criteria]

## Example 2: [edge case — missing data]
[input variant, expected behaviour]

## Example 3: [edge case — ambiguous input]
[input variant, expected behaviour]

## Example N: [failure mode]
[input that should produce a graceful error or explicit "cannot process"]

2. Test Specification

One per step. Links the step to its golden examples and defines how to run the tests.

# Test Specification: [Step Name]
# Step: [#] | Function: [name] | Type: [pure code | code+LLM | sub-agent]
# Version: [N]

## What Is Being Tested
[One sentence — the function's intention from the step specification]

## Golden Examples
[Link or reference to the golden example set]

## Pass Criteria
- All "required" evaluation criteria must be met
- [Threshold, if applicable — e.g., "8 of 10 requirements matched"]
- Edge cases must produce graceful handling, not crashes or fabrication

## How to Run
- [Isolated — no dependency on other steps running first]
- [Input: provided directly from golden example, not from live data]
- [Output: captured and stored for evaluation]

## Dependencies
- [Skill config required: name and version]
- [Prompt template required: reference]
- [Test data: source of golden example inputs]

3. Evaluation Rubric

For code+LLM functions where automated evaluation isn't sufficient. Defines how a human (or a secondary evaluation) judges the output.

# Evaluation Rubric: [Function Name]
# Version: [N]

| Criterion | Weight | Pass | Fail |
|-----------|--------|------|------|
| [criterion] | [required/optional] | [what pass looks like] | [what fail looks like] |

The Testing Hierarchy

Different function types need different testing approaches. Match the approach to the type.

Function Type	Test Approach	Evaluation	Deterministic?
Pure code	Standard unit tests — exact input/output assertions	Automated — assertEqual, schema validation	Yes
Code + LLM	Golden examples with evaluation criteria	Criteria-based — completeness, faithfulness, structure	No — criteria are deterministic, output varies
Sub-agents	Test each agent independently, then test the merge	Per-agent evaluation + merge correctness	Per-agent: No. Merge: Yes

Testing Sub-Agents

When a step uses parallel sub-agents (each with its own skill and dataset):

Test each sub-agent in isolation — its own golden example set, its own evaluation criteria
Test the merge function separately — given known sub-agent outputs, does the merge produce the correct combined result?
Never test sub-agents through the merge — if the merged output is wrong, you can't tell which agent failed

✅ Three sub-agents tested independently, merge tested with known inputs. Any failure points to exactly one component.

❌ One end-to-end test that runs all sub-agents and checks the final output. Failure gives you no information about which component broke.

Anti-Patterns

Mistake	Why It Fails	Fix
Writing code before golden examples	You don't know what "correct" looks like yet — you'll test against your assumptions, not against requirements	Golden examples first — derived from step specifications, not from implementation
Exact string matching on LLM output	LLM output varies across runs — tests become flaky and meaningless	Use evaluation criteria: structure, completeness, faithfulness — not exact text
Testing multiple steps together	When it fails, you can't isolate the cause — which step broke?	One golden example set per function, tested independently
Golden examples without edge cases	Happy path works, but missing data crashes the function in production	Every golden example set includes: missing data, ambiguous input, empty input
Vague evaluation criteria	"Output should be good" — untestable, undebugable	Every criterion must be binary: met or not met. If you can't judge it, it's not a criterion
Testing only once, then never again	Prompt changes, skill config changes, model upgrades — tests go stale	Run golden examples after every change to code, prompt, or skill config
Fabricating golden example inputs	Fake inputs don't expose real-world edge cases	Use real or realistic inputs from stakeholder interviews and value chain mapping
Skipping sub-agent isolation	Merged output hides individual failures	Test each sub-agent independently, test merge separately

Validation Checklist

Before considering a function tested:

Artifact Governance

When creating artifacts from this skill:

Use the templates in .claude/skills/artifact-template.md — copy the relevant section (golden-example-set, test-spec, or eval-rubric)
Fill in the YAML frontmatter with project, version, status, and source
Register the artifact in DESIGN.md (project root) — add a row to the Artifact Registry table

Referenced Skills

Skill	Relationship
Value Chain Mapping	Upstream — provides the step specifications that golden examples are derived from
Data Evolution	Peer — fields that are transformed need golden examples for the transformation
Governed Architecture	Peer — defines the function types (pure code, code+LLM, sub-agents) and their contracts
Configure Claude Code	Foundation — the development environment where tests are implemented and run

Write the test first. If you can't test it, you haven't defined it. Golden examples are the contract between intention and implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TDD with Golden Examples

How This Skill Works

1. Start from Step Specifications

2. Derive Golden Examples Before Writing Code

3. Red-Green-Refactor Cycle (Adapted)

4. Run, Evaluate, Iterate

Golden Example Structure

Example Entry

Evaluation Criteria Types

Artifacts

1. Golden Example Set

2. Test Specification

3. Evaluation Rubric

The Testing Hierarchy

Testing Sub-Agents

Anti-Patterns

Validation Checklist

Artifact Governance

Referenced Skills

FilesExpand file tree

tdd-golden-examples.md

Latest commit

History

tdd-golden-examples.md

File metadata and controls

TDD with Golden Examples

How This Skill Works

1. Start from Step Specifications

2. Derive Golden Examples Before Writing Code

3. Red-Green-Refactor Cycle (Adapted)

4. Run, Evaluate, Iterate

Golden Example Structure

Example Entry

Evaluation Criteria Types

Artifacts

1. Golden Example Set

2. Test Specification

3. Evaluation Rubric

The Testing Hierarchy

Testing Sub-Agents

Anti-Patterns

Validation Checklist

Artifact Governance

Referenced Skills