Skip to content

Commit d1129c7

Browse files
committed
update tests/agents
1 parent 45075c4 commit d1129c7

File tree

1 file changed

+11
-35
lines changed

1 file changed

+11
-35
lines changed

tests/AGENTS.md

Lines changed: 11 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,17 @@ async def test_something(model: Model):
135135
]
136136
```
137137

138+
## Best Practices
139+
140+
- Test through public APIs, not private methods (prefixed with `_`) or helpers — validates actual user-facing behavior and prevents brittle tests tied to implementation details
141+
- Prefer feature-centric parametrized test files (e.g. `test_multimodal_tool_returns.py`) over appending to monolithic `test_<provider>.py` files — the legacy per-provider files are large and hard for agents to navigate; new features should get their own test file with a `Case` class and parametrized providers
142+
- Use `snapshot()` for complex structured outputs (objects, message sequences, API responses, nested dicts) — catches unexpected changes more reliably than field-by-field assertions; use `IsStr` and similar matchers for variable values
143+
- Assert the core aspect of the change being introduced — use whatever means necessary: patching clients to inspect request payloads, tapping into pydantic-ai internals, snapshot comparisons. Snapshots are valuable for catching structural drift in objects and message arrays, but only use `result.all_messages()` or output assertions when the structure demonstrates behavior you care about keeping consistent
144+
- Test both positive and negative cases for optional capabilities (model features, server features, streaming) — ensures features work when supported AND fail gracefully when absent
145+
- Ensure test assertions match test names and docstrings — tests without proper assertions or that verify opposite behavior create false positives
146+
- Test MCP against real `tests.mcp_server` instance, not mocks — extend test server with helper tools to expose runtime context (instructions, client info, session state)
147+
- Remove stale test docstrings, comments, and historical provider bug notes when behavior changes
148+
138149
## Directory Structure
139150

140151
```
@@ -154,38 +165,3 @@ tests/
154165
│ └── test_*.py # provider initialization tests (unit)
155166
└── test_*.py # feature tests (prefer VCR + parametrize)
156167
```
157-
<!-- braindump: rules extracted from PR review patterns -->
158-
159-
# tests/ Guidelines
160-
161-
## Testing
162-
163-
<!-- rule:177 -->
164-
- Test through public APIs, not private methods (prefixed with `_`) or helpers — Prevents brittle tests tied to implementation details, reduces maintenance burden when refactoring internals, and validates actual user-facing behavior rather than isolated units
165-
<!-- rule:173 -->
166-
- Maintain 1:1 correspondence between test files and source modules (`test_{module}.py`) — consolidate related tests instead of splitting by feature, config, or test type — Prevents test suite fragmentation and makes tests easier to locate by matching source structure; use fixtures/markers to distinguish test types within the file
167-
<!-- rule:86 -->
168-
- Use `snapshot()` for complex structured outputs (objects, message sequences, API responses, nested dicts, span attributes) — prevents brittle field-by-field assertions and improves test maintainability — Snapshot testing catches unexpected changes in complex structures more reliably than manual assertions, and `IsStr` matchers handle variable values gracefully
169-
<!-- rule:318 -->
170-
- Use `pytest-vcr` cassettes (not mocks) in `tests/models/` — records real HTTP interactions for deterministic replay, captures both success and error cases — Ensures integration tests validate real API behavior without live calls on every run, making tests faster and preventing flakiness from network issues or rate limits
171-
<!-- rule:334 -->
172-
- Assert meaningful behavior in tests, not just code execution or type checks — validates correctness and data flow — Prevents false confidence from tests that pass without verifying actual functionality works as intended
173-
<!-- rule:194 -->
174-
- In agent/model/stream tests, assert on final output AND snapshot `result.all_messages()` — validates complete execution trace, not just end result — Catches regressions in tool calls, intermediate steps, and message flow that final output assertions miss
175-
<!-- rule:363 -->
176-
- Test through real APIs, not mocks — mock only slow/external dependencies outside your control — Improves refactoring safety, documents real usage patterns, and catches integration issues — use lightweight local infrastructure (test servers, in-memory DBs) for systems you control (provider APIs, Temporal workflows, frameworks) in files like `test_{provider}.py`; reserve mocks for third-party HTTP APIs and unreliable external services
177-
<!-- rule:11 -->
178-
- Parametrize tests across all providers that support the feature (or at minimum OpenAI, Anthropic, Google) — catches provider-specific regressions and ensures cross-provider compatibility — Prevents breaking unchanged providers when modifying shared model logic, and surfaces integration issues across different provider APIs before they reach production
179-
<!-- rule:385 -->
180-
- Ensure test assertions match test names and docstrings — prevents false confidence in test coverage and catches actual regressions — Tests without proper assertions or that verify opposite behavior create false positives and fail to catch bugs they claim to prevent.
181-
<!-- rule:89 -->
182-
- Test both positive and negative cases for optional capabilities (model features, server features, streaming) — ensures features work when supported AND fail gracefully when absent — Prevents false confidence from tests that only check unsupported cases, catching both implementation bugs and missing error handling
183-
<!-- rule:630 -->
184-
- Test MCP against real `tests.mcp_server` instance, not mocks — extend test server with helper tools to expose runtime context (instructions, client info, session state) — Verifies actual data flow and integration behavior rather than just testing mock interfaces, catching real-world issues that mocks would miss
185-
186-
## General
187-
188-
<!-- rule:463 -->
189-
- Remove stale test docstrings, comments, and historical provider bug notes when behavior changes — Outdated test documentation misleads developers about what's actually being tested and why
190-
191-
<!-- /braindump -->

0 commit comments

Comments
 (0)