Skip to content

Add experimental compaction menu: ClearToolResults, DeduplicateFileReads, TieredCompaction; rename CompactionSummarizingCompaction#191

Merged
dsfaccini merged 8 commits into
mainfrom
capability/compaction
Jun 5, 2026
Merged

Add experimental compaction menu: ClearToolResults, DeduplicateFileReads, TieredCompaction; rename CompactionSummarizingCompaction#191
dsfaccini merged 8 commits into
mainfrom
capability/compaction

Conversation

@DouweM

@DouweM DouweM commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

Adds a menu of context-compaction capabilities, offered as experimental under a new pydantic_ai_harness.experimental namespace.

Experimental gating (new convention)

There was no experimental convention in the harness yet, so this establishes one, mirroring pydantic (which has pydantic.experimental + PydanticExperimentalWarning):

  • The package physically lives at pydantic_ai_harness/experimental/compaction/ — the import path itself signals experimental, and there is no top-level export.
  • Importing a capability emits a HarnessExperimentalWarning. The message hands the user a single, category-wide silence filter — one line mutes every experimental capability, not one per capability:
import warnings
from pydantic_ai_harness.experimental import HarnessExperimentalWarning

warnings.filterwarnings('ignore', category=HarnessExperimentalWarning)

from pydantic_ai_harness.experimental.compaction import TieredCompaction

Importing the experimental package on its own does not warn, so the warning class can be imported to silence first.

Capabilities

Class Cost What it does
SlidingWindow zero-LLM Drop oldest whole messages down to a tail
ClearToolResults zero-LLM Blank old tool results in place, keep last keep_pairs (Anthropic clear_tool_uses)
DeduplicateFileReads zero-LLM Keep only the latest read per file (via a required file_key seam)
SummarizingCompaction one LLM call Structured-section summary of older messages (renamed from Compaction)
TieredCompaction escalates Cheap passes first; summarize only if still over target_tokens
LimitWarner zero-LLM Inject URGENT/CRITICAL warnings as limits approach

Shared CompactionStrategy.compact(messages, ctx) seam (exported) lets TieredCompaction drive tiers directly and re-measure between them — true escalation. All strategies preserve tool-call/return pairing.

Layout

One module per capability under experimental/compaction/, with cross-capability utilities (token estimation, the CompactionStrategy protocol, tool-pair-safe cutoffs, in-place clearing) in _shared.py.

Notes

  • SummarizingCompaction threads ctx.usage into the summary call (tokens and the request fold into the run — honest, consistent, and a runaway summary can't evade a request limit).
  • Per-result truncation was intentionally dropped in favour of a future overflow-to-file capability.

Quality

100% branch coverage; ruff + pyright strict clean; tests use TestModel/mocks only. Capability README.md carries an experimental banner with the import path and silence snippet.

🤖 Generated with Claude Code

DouweM and others added 3 commits April 2, 2026 05:28
Implements three compaction-related capabilities for managing conversation
context in long-running agents:

- SlidingWindow: zero-cost message trimming that preserves tool-call pairs
- LimitWarner: injects warnings when approaching iteration/token limits
- Compaction: LLM-powered summarization of older messages

All three use the before_model_request hook to modify request_context.messages
transparently. The safe cutoff logic ensures tool-call / tool-return pairs are
never orphaned, preventing HTTP 400 errors from LLM providers.

Closes #21

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add explicit `set[str]` type annotations and replace unnecessary
`isinstance` checks with plain `else` branches.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion

Implements three improvements from the audit findings on PR #140:

- Optional `tokenizer: Callable[[str], int] | None` parameter on SlidingWindow,
  Compaction, estimate_token_count, and _find_token_cutoff. When provided,
  enables accurate token counting; the 4-chars/token heuristic stays as fallback.

- `preserve_first_user_message: bool = True` on SlidingWindow and Compaction.
  When True, the first ModelRequest containing a UserPromptPart is always
  retained after trimming/compaction, preserving the original task context.

- `incremental: bool = True` on Compaction. When True and a prior compaction
  summary exists in the message history, it is included in the summarization
  prompt via a <previous_summary> tag so the LLM extends it rather than
  regenerating from scratch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@DouweM

DouweM commented Apr 10, 2026

Copy link
Copy Markdown
Contributor Author

Originally posted by @DouweM in #140 comment (PR was recreated)

Note: This PR implements client-side compaction (LLM summarization + sliding window). Provider-side compaction (OpenAI/Anthropic) additionally requires the core primitive in #141 (CompactionPart message type + compact_messages on Model).

@DouweM

DouweM commented Apr 10, 2026

Copy link
Copy Markdown
Contributor Author

Originally posted by @DouweM in #140 comment (PR was recreated)

Audit vs prior art: Compaction

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment on lines +726 to +734
system_parts = _extract_system_prompts(messages)
to_summarize = messages[:cutoff]
preserved = messages[cutoff:]

previous_summary = _extract_previous_summary(messages) if self.incremental else None
summary = await self._summarize(to_summarize, previous_summary=previous_summary)

summary_part = SystemPromptPart(content=f'{_SUMMARY_PREFIX}{summary}')
summary_message = ModelRequest(parts=[*system_parts, summary_part])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Old compaction summaries accumulate as SystemPromptParts across multiple compaction cycles

After the first compaction, the summary message contains [SystemPromptPart('original sys prompt'), SystemPromptPart('Summary of previous conversation:\n\n...')]. When a second compaction triggers, _extract_system_prompts(messages) at line 726 extracts ALL leading SystemPromptParts from this message — including the old summary part (since it's also a SystemPromptPart). The old summary is then re-included in the new summary message at line 734 alongside the new summary. After N compactions, the summary message contains N stale summary parts plus the new one, growing the context unboundedly and defeating the purpose of compaction.

Trace through two compaction cycles

After first compaction, result.messages[0] = ModelRequest(parts=[SystemPromptPart('sys'), SystemPromptPart('Summary of previous conversation:\n\nfirst summary')]).

When second compaction triggers, _extract_system_prompts (src/pydantic_harness/compaction.py:594-605) sees both parts are SystemPromptPart, extracts both. Then line 734 creates ModelRequest(parts=[SystemPromptPart('sys'), SystemPromptPart('...first summary'), SystemPromptPart('...second summary')]). The old summary is never removed.

Suggested change
system_parts = _extract_system_prompts(messages)
to_summarize = messages[:cutoff]
preserved = messages[cutoff:]
previous_summary = _extract_previous_summary(messages) if self.incremental else None
summary = await self._summarize(to_summarize, previous_summary=previous_summary)
summary_part = SystemPromptPart(content=f'{_SUMMARY_PREFIX}{summary}')
summary_message = ModelRequest(parts=[*system_parts, summary_part])
system_parts = [
p for p in _extract_system_prompts(messages)
if not p.content.startswith(_SUMMARY_PREFIX)
]
to_summarize = messages[:cutoff]
preserved = messages[cutoff:]
previous_summary = _extract_previous_summary(messages) if self.incremental else None
summary = await self._summarize(to_summarize, previous_summary=previous_summary)
summary_part = SystemPromptPart(content=f'{_SUMMARY_PREFIX}{summary}')
summary_message = ModelRequest(parts=[*system_parts, summary_part])
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread pyproject.toml Outdated

[tool.coverage.report]
fail_under = 100
fail_under = 98

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Coverage threshold lowered from 100% to 98%

The fail_under threshold in pyproject.toml:96 was reduced from 100 to 98, with the commit noting 'due to branch coverage of elif chains'. This permanently lowers the bar for the entire project. Consider using # pragma: no branch on specific elif chains instead of lowering the global threshold.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

# Token estimation
# ---------------------------------------------------------------------------

_CHARS_PER_TOKEN = 4

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will underestimate for Anthropic models unfortunately. It works pretty well for OpenAI ones. I settled on 2.5 in Code Puppy to give a lot of slack (to avoid the errors in Vertex).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this, I suggest we make it configurable somehow? Perhaps an environment var (yucky, but there should be a way to override it for power users)

segments.append(str(part.content))
else:
for part in msg.parts:
if isinstance(part, TextPart):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't want to include ThinkingPart?

# Safe cutoff logic — preserves tool-call / tool-return pairs
# ---------------------------------------------------------------------------

_TOOL_PAIR_SEARCH_RANGE = 5

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding this correctly, it could fail if your model performs any number > 5 parallel tool calls.

"""Number of tail messages to preserve after compaction (message-count trigger)."""

keep_tokens: int | None = None
"""Target token budget to preserve after compaction (token-count trigger).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this <3 - I used this strategy in Code Puppy and the agent keeps coherence very nicely. It can get expensive though.

@mpfaffenberger mpfaffenberger left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Hope they're helpful.

Comment thread src/pydantic_harness/compaction.py Outdated
Comment on lines +647 to +648
model: str
"""Model to use for generating summaries (e.g. ``'openai:gpt-4o-mini'``)."""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should likely include KnownModelName and Model, and use infer_model under the hood.

@DouweM DouweM added this to the 2026-05 milestone Apr 23, 2026
@dsfaccini dsfaccini changed the title Add compaction capabilities: SlidingWindow, LimitWarner, Compaction Add compaction capabilities: SlidingWindow, LimitWarner, Compaction Jun 2, 2026
…Compaction → SummarizingCompaction

Extend the compaction menu with the SOTA-aligned strategies missing from the initial
cut, and align naming with the "compaction" umbrella term.

- ClearToolResults: zero-LLM in-place clearing of old tool results (Anthropic
  clear_tool_uses); keep_pairs, exclude_tools, clear_tool_inputs (JSON-valid args),
  min_clear_tokens to protect the prompt cache.
- DeduplicateFileReads: zero-LLM; keep only the latest read per file via a required
  file_key seam.
- TieredCompaction: escalation orchestrator — cheap passes first, summarize only if
  still over target_tokens. CompactionStrategy protocol exported for custom tiers.
- Rename Compaction → SummarizingCompaction; structured-section summary prompt; model
  now optional (inherits the run's model); summary-call usage folded into the parent run.

All strategies preserve tool-call/return pairing. compaction.py at 100% branch
coverage; pyright strict + ruff clean. Adds capability README (compaction.md).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dsfaccini dsfaccini changed the title Add compaction capabilities: SlidingWindow, LimitWarner, Compaction Add compaction menu: ClearToolResults, DeduplicateFileReads, TieredCompaction; rename Compaction → SummarizingCompaction Jun 3, 2026
Resolve the repo restructure (package renamed pydantic_harness → pydantic_ai_harness,
moved to per-capability subpackages). Migrate compaction into the new layout:

- pydantic_ai_harness/compaction/{__init__,_capability}.py + README.md
- root __init__.py exposes the 6 compaction capabilities via lazy __getattr__
- tests moved to tests/compaction/; imports split public vs ._capability
- drop now-unused noqa (main's ruff config does not enforce D102/D105)
- widen _call_args test helper for the new ToolCallPart.args union (ToolSearchArgs)
- pyproject coverage config kept identical to main (protocol stub excluded via pragma)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dsfaccini dsfaccini changed the title Add compaction menu: ClearToolResults, DeduplicateFileReads, TieredCompaction; rename Compaction → SummarizingCompaction Add compaction menu: ClearToolResults, DeduplicateFileReads, TieredCompaction; rename CompactionSummarizingCompaction Jun 3, 2026
The summary call's tokens and request are folded into the run's usage by design —
consistent, cost-honest, and runaway-safe. Make that explicit in the README.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
David SF and others added 2 commits June 3, 2026 22:20
Split the monolithic _capability.py into a module per capability plus a _shared module
for cross-capability utilities (token estimation, the CompactionStrategy protocol,
tool-pair-safe cutoffs, first-user preservation, in-place tool-result clearing).

- _shared.py exposes its package-internal API without leading underscores (the module
  itself is private); genuinely module-local helpers keep their underscore.
- _sliding_window / _clear_tool_results / _deduplicate_file_reads / _limit_warner /
  _summarizing_compaction / _tiered_compaction each own one capability.
- __init__.py re-exports the public API; tests import privates from their new homes.

No behavior change. Every compaction module at 100% branch coverage; ruff + pyright strict clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ental

Compaction is offered as experimental rather than a top-level capability. Since no
experimental convention existed in the harness, this establishes one, mirroring pydantic:

- Physically moved to pydantic_ai_harness/experimental/compaction/ — the import path itself
  signals experimental; there is no top-level export (reverted root __init__).
- Importing a capability emits HarnessExperimentalWarning; its message hands the user a single
  category-wide filter that silences every experimental capability at once (no per-capability
  suppression). Importing the `experimental` package alone does not warn, so the warning class
  can be pulled in to silence first.
- README gains a prominent experimental banner with the import path + silence snippet.
- pytest filterwarnings ignores the category; a dedicated test asserts the message + that one
  filter mutes all experimental warnings.

100% coverage; ruff + pyright strict clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dsfaccini dsfaccini changed the title Add compaction menu: ClearToolResults, DeduplicateFileReads, TieredCompaction; rename CompactionSummarizingCompaction Add experimental compaction menu: ClearToolResults, DeduplicateFileReads, TieredCompaction; rename CompactionSummarizingCompaction Jun 4, 2026
@dsfaccini dsfaccini marked this pull request as ready for review June 5, 2026 00:00
@dsfaccini dsfaccini merged commit 442e9ff into main Jun 5, 2026
18 checks passed
dsfaccini pushed a commit that referenced this pull request Jun 11, 2026
Move both packages under `pydantic_ai_harness.experimental`, matching the
convention introduced for compaction/planning/subagents (#191): the old
`pydantic_ai_harness.step_persistence` / `.media` paths and the top-level
re-exports are gone, so the only import path is now

    from pydantic_ai_harness.experimental.step_persistence import StepPersistence
    from pydantic_ai_harness.experimental.media import S3MediaStore

Both package __init__s call `warn_experimental(...)`, so importing either
emits a `HarnessExperimentalWarning` (silenced category-wide by one
filter). This keeps us from committing to a public surface before the
capability has real usage.

README gains the standard experimental banner; warning tests cover both
new packages. 100% branch coverage retained.
dsfaccini added a commit that referenced this pull request Jun 11, 2026
…egates (#251)

* Add StepPersistence capability for step-event durability across delegates

Supersedes PR #176 (SessionPersistence): orchestrators like pydanty need
visible event trails for delegate runs that may time out before a
"save full session after the run" hook can fire, and need to continue or
fork a delegate's prior investigation without rediscovering context. A
single after-run snapshot is too coarse for that use case.

The capability now records (a) append-only StepEvents at every boundary
(run/model-request/tool-call start, completion, failure), (b) a
ContinuableSnapshot only when message history is provider-valid (every
ToolCallPart has a matching ToolReturnPart / RetryPromptPart) — saved
mid-run after CallToolsNode and at after_run, and (c) a ToolEffectRecord
ledger so a run killed between before_tool_execute and
after_tool_execute leaves an `unknown_after_crash`-style record rather
than a falsely-continuable snapshot. Lineage metadata
(parent_run_id, agent_name) ties delegate runs back to their
orchestrator. `continue_run` / `fork_run` helpers load the latest
continuable snapshot for a run.

Backends: InMemoryStepStore (tests) and FileStepStore (JSONL events +
JSON snapshots, with run_id path-traversal validation and
anyio.to_thread for blocking I/O).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address pydanty review + ergonomics: tighten correctness and identity model

Correctness fixes from pydanty's PR review:

- FileStepStore: snapshot filenames are now a per-run monotonic counter,
  not `ctx.run_step` — `run_step` resets each Agent.run, so re-using a
  `run_id` across calls would let an earlier run's higher step-index
  snapshot mask a later run's lower-step-index one.
- StepStore.get_tool_effect now takes both `run_id` and `tool_call_id`.
  TestModel and other providers can reuse deterministic tool-call ids
  across runs; the previous unscoped lookup let one run's effect leak
  into another's record (including `started_at`).
- is_provider_valid now rejects orphan, duplicate, and out-of-order
  tool returns — the old `set.discard` pattern silently accepted any
  return regardless of whether a matching call was open.

Identity model:

- `run_id` resolution: explicit > `{agent_name}-{8-char-hex}` > UUID.
  Materialised per Agent.run in `for_run`, so reusing one capability
  instance never silently merges runs.
- `parent_run_id` auto-inferred via a module-level ContextVar set in
  `wrap_run`, so an orchestrator's tool that synchronously calls
  `delegate.run(...)` produces a delegate `RunRecord.parent_run_id`
  pointing at the orchestrator's `run_id` with zero threading. Explicit
  `parent_run_id=` still wins.
- `conversation_id` propagated to `StepEvent` and `RunRecord`;
  `store.list_runs(conversation_id=..., parent_run_id=...)` supports
  filtering by either or both. Mirrors pydantic_ai's three-level
  identity (conversation -> run -> step) so "run 1, run 2, run 3" of
  one dialogue is queryable as a group via `conversation_id`.
- `continue_from=` field dropped from the capability. Continuation is
  now only via `continue_run(store, run_id=...)` -> standard
  `Agent.run(message_history=...)`. One way to pass history into
  pydantic_ai, no parallel capability flag.

README rewritten around the final API. New sections: three-level
identity, run lineage with auto-inferred parent, inspecting a run tree,
failure recovery.

Tests: 168 total (up from 64), 100% branch coverage on the package.
New coverage for the snapshot seq counter, cross-run tool-effect
isolation, orphan/duplicate/out-of-order return rejection, ContextVar
parent inference across nested agent.run, conversation_id propagation,
and the agent_name-derived run_id default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address pydanty round-2 review: retry validity, list ordering, effect metadata

Correctness:

- is_provider_valid no longer rejects non-tool RetryPromptParts. Pydantic
  AI emits `RetryPromptPart(tool_name=None)` for output-validation
  failures and providers map those as plain user messages, not tool
  results. The previous check required every RetryPromptPart to resolve
  an open tool call, so a run with one output retry produced no final
  continuable snapshot despite being fully valid.
- StepStore.list_runs now guarantees chronological (started_at ascending)
  ordering across both backends. FileStepStore was previously returning
  directory-name order (lexicographic), so the README's `[-1]` pattern
  for "latest run in conversation" could pick the older run when run ids
  did not sort by recency.
- after_tool_execute and on_tool_execute_error preserve idempotency_key
  and effect_summary from the prior `started` record. Previously the
  terminal record was written without those fields, so any annotation
  the tool body wrote was lost on completion.
- from_spec raises ValueError for unknown backends instead of silently
  falling back to in-memory storage. For a persistence capability,
  turning a typo into accidental non-durability is the wrong failure
  mode.

API:

- New annotate_tool_effect(store, ctx, *, idempotency_key=None,
  effect_summary=None) helper. Tool bodies that write external state
  call it to attach idempotency + effect metadata to the in-flight
  ToolEffectRecord without knowing the (run_id, tool_call_id) plumbing.
  Resolves run_id from a ContextVar set by wrap_run; reads tool_call_id
  / tool_name from RunContext.
- ContextVar moved from `_capability.py` into a new `_context.py` module
  so the helper and the capability can share it without circular imports
  and without crossing the private-name barrier.

Docs: README fixes a non-existent `list_runs(agent_name=None)` call,
documents the chronological-ordering guarantee, and replaces the
hand-wavy "populate fields on the ToolEffectRecord" line with a concrete
`annotate_tool_effect` example.

Tests: 178 total (was 168), 100% branch coverage on the package.
Added coverage for non-tool retry acceptance, chronological list_runs on
both backends, metadata preservation across completed/failed
transitions, annotate_tool_effect under realistic agent.tool, and
from_spec backend validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(step_persistence): align FileStepStore docstring with seq-counter layout

The class docstring still showed snapshots/<step_index>.json from the
pre-fix layout, but both the README and _next_snapshot_seq document the
monotonic counter. Bring the class docstring in line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(step_persistence): fix broken README examples + clarify run_id sharing

Pydanty round-3 review:

- README continuation and lineage examples queried `list_runs(conversation_id=...)`
  on conversations the earlier `.run(...)` calls never set, so the
  examples crashed with IndexError on `[-1]`. Pass the conversation_id to
  the earlier calls so the lookup actually works.
- The capability docstring claimed reusing a `StepPersistence` instance
  across `Agent.run` calls does NOT share the id. That is true only for
  the auto-derived (`agent_name`-prefixed or `ctx.run_id`) cases — an
  explicit `run_id=` is shared across every `.run()` by design, since
  that is the orchestrator pattern where the caller owns one logical
  identity across turns. Rewrite the resolution-order docs to spell out
  which cases share and which don't, and when to pick each.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(step_persistence): align run_id semantics with pydantic_ai (per-call, not shared)

Pydanty round-4 review: the prior round documented explicit `run_id` as
shared across `.run()` calls on one capability instance — that framing
caused real correctness gaps. The `ToolEffectRecord` ledger is keyed by
`(run_id, tool_call_id)` and providers reuse deterministic tool-call ids
(e.g. `TestModel` emits `pyd_ai_tool_call_id__{name}`), so a second
`.run()` overwrites the first's effect record under the same key — the
`unknown_after_crash` signal from turn 1 disappears when turn 2 lands.

Realign:

- `run_id` is per-`Agent.run`, matching `pydantic_ai.RunContext.run_id`.
- For multi-turn logical grouping, use `conversation_id=` on
  `Agent.run(...)` — that is the pyai-native primitive. The orchestrator
  pattern is `conversation_id='orch'` with each turn auto-deriving its
  own `run_id`.
- Explicit `run_id=` remains supported but is documented as single-shot
  (testing, replay, debugging). Reusing it across calls is a caller
  contract violation, not an implementation feature.

Code is unchanged — the implementation was already correct under the
right contract. Only the docs were misleading.

Tests:
- `TestRunIdIsPerCall::test_multi_turn_orchestrator_uses_conversation_id`
  exercises the recommended pattern: three turns sharing a
  `conversation_id`, three distinct auto-derived `run_id`s, all
  queryable as a group.
- `TestRunIdIsPerCall::test_explicit_run_id_reuse_collides_ledger` locks
  down the misuse contract: reusing one explicit `run_id` across two
  `.run()` calls produces colliding effect records under the
  `(run_id, tool_call_id)` key. The behavior is documented; the test
  exists so a future refactor cannot silently change it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(step_persistence): explain why for_run uses a local ContextVar

Pyai-aligned review flagged this as a P3 explainer: pydantic_ai already
has three single-slot cross-run signals (RUN_ID_BAGGAGE_KEY, ctx.run_id,
_CURRENT_RUN_CONTEXT). All three get overwritten by the inner
Instrumentation.wrap_run before any nested capability can see the parent
identity. A separate harness-local ContextVar, snapshotted before our
own wrap_run rebinds it, is the only correct mechanism today. Spell
this out so the next reader doesn't try to 'simplify' it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(step_persistence): enforce explicit-run_id reuse with ValueError

Pydanty round-5 review accepted the docs-only contract but flagged that
"documented but not enforced" is a soft spot. Enforce it: `before_run`
calls `store.get_run(run_id=...)` when the user supplied an explicit
`run_id`, and raises `ValueError` if a record with that id already
exists. The auto-derived cases cannot trigger this check (each call
materialises a fresh id in `for_run`).

The check is one extra store read per Agent.run when an explicit run_id
is set, only. The error message points the caller at `conversation_id`
for multi-turn grouping.

Test renamed from `test_explicit_run_id_reuse_collides_ledger` to
`test_explicit_run_id_reuse_raises` — asserts the second `.run()` raises
and the first run's records survive untouched.

README + capability docstring updated: the misuse path is now "raises"
not "caller's contract."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: gitignore branch-context skill artifacts + AGENTS.local.md

Two patterns that match the existing CLAUDE.local.md ignore convention:

- AGENTS.local.md — canonical local-instructions file (CLAUDE.local.md
  is symlinked to it where the worktree follows the same
  AGENTS.md/CLAUDE.md symlink pattern).
- .agents/skills/branch-context/ — per-worktree decisions log
  (`pr-decisions.md`) and the skill's local SKILL.md. Pattern lifted
  from `~/pydantic/ai/base/.claude/skills/branch-context/` where pyai
  uses an identical setup.

Neither is intended to land in PRs — they record cross-iteration design
calls so future AICA sessions in this worktree don't silently undo them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(step_persistence): add SqliteStepStore + MediaStore externalization

Adds a new `pydantic_ai_harness.media` package (MediaStore protocol +
DiskMediaStore / SqliteMediaStore / S3MediaStore) and wires it into the
file/sqlite step-persistence backends so large BinaryContent payloads
get externalized out of snapshot JSON / table rows by default.

Defaults are zero-config: FileStepStore writes blobs under
`<root>/media/<sha256>.bin`; SqliteStepStore writes them to a sibling
`media` table in the same DB. Threshold is 64 KiB and URI scheme is
`media+sha256://<hex>` so blobs are content-addressed across stores.
Pass `media_store=None` to keep bytes inline, or a custom `MediaStore`
to redirect (e.g. `S3MediaStore` for R2 / AWS / MinIO).

S3MediaStore handrolls SigV4 over httpx to avoid a botocore/boto3
dependency. Verified working against Cloudflare R2.

`StepPersistence.from_spec(backend='sqlite', database=...)` now resolves.

180 → 261 tests, 100% branch coverage maintained.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(media): VCR cassettes for S3MediaStore against R2

Adds replay-driven tests under `tests/media/test_s3_cassettes.py` that
exercise `S3MediaStore.put/get/exists` against pre-recorded Cloudflare
R2 responses. CI runs them without any S3 creds via the committed
cassettes under `tests/media/cassettes/`.

Sanitisation policy:

- `before_record_request`/`before_record_response` swap the real R2
  account-id subdomain and bucket name for fixed placeholders
  (`account.r2.cloudflarestorage.com`, `harness-test-bucket`)
- `Authorization` and `x-amz-date` filtered to `REDACTED`
- CF-RAY, x-amz-version-id, x-amz-checksum-*, x-amz-request-id headers
  dropped (none load-bearing for tests; some carry identifying info)
- Non-2xx response bodies blanked (R2's gzipped XML error envelope
  leaks the bucket name; our code only checks status code)

The `s3_credentials` fixture uses `os.environ.get(NAME, PLACEHOLDER)`
per field, so real R2 creds are used when recording locally with `.env`
loaded, and the placeholder constants match the scrubbed cassettes
during replay. Because the placeholders are fixed, any scrubber miss
during a future re-record shows up as a replay URL mismatch — built-in
canary against credential / private-data leakage in committed cassettes.

Adds `pytest-recording` (pulls `vcrpy`) to the dev deps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(media): public_url resolver — protocol method + per-store callable

Adds `MediaStore.public_url(uri) -> str | None` plus a `public_url=`
constructor parameter on every concrete store. The parameter accepts a
sync or async callable; the store auto-detects and awaits if needed.

This is the bottom-layer primitive for the forthcoming `MediaExternalizer`
capability — that capability will call `store.public_url(...)` per
externalized blob and swap `BinaryContent` for `ImageUrl` / `AudioUrl`
parts before the model sees the message. The callable shape covers both
static URLs (public bucket / CDN — use `make_static_public_url`
helper) and dynamic URLs (presigned, per-request signing — pass any
async callable with TTL captured in its closure).

Why a callable rather than a static config: a public bucket's URL host
is not derivable from the bucket creds (R2 public buckets use
`pub-<hash>.r2.dev`, AWS public buckets use a different scheme than the
path-style endpoint we sign for). The URL is always user-supplied
information, so a callable is the right primitive — same shape for the
static and presigned cases, and `get` stays untouched (it serves the
harness's internal byte fetch, not the model's external HTTP fetch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(media): introduce MediaContext + key_strategy for extensible operations

Adds `MediaContext` (frozen, kw-only dataclass with `media_type`, `filename`,
`metadata`) and threads it through every `MediaStore` method and both user
callables (`PublicUrlResolver`, `KeyStrategy`). New context fields can be
added non-breakingly; existing call sites and resolvers keep working.

Also adds:
- `KeyStrategy = Callable[[str, MediaContext], str]` for per-store layout
  control. Default `default_key_strategy` produces `<sha256>.bin`. Disk store
  validates the result against `..` traversal.
- `metadata` persistence on `SqliteMediaStore` (new JSON column) and
  `S3MediaStore` (signed `x-amz-meta-*` headers, ASCII key validation).
  Disk store explicitly does NOT persist metadata in v1 — sidecar / xattr
  options each have load-bearing drawbacks; we ship nothing rather than a
  half-true persistence promise.
- `make_static_public_url(...)` updated to the new `(uri, ctx)` signature.

The shift is motivated by the same principle as pydantic_ai's
`RunContext`: extension via fields on a context bag rather than via
breaking signature changes. Every new requirement (TTL hints for
presigned URLs, audit ids, response-header overrides, etc.) becomes a
field addition, not an API revision.

Cassettes from the previous commit replay unchanged — match-on does not
include the signed headers and the request URLs are stable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(media): close metadata round-trip; drop vestigial sqlite key_strategy

Adds `MediaStore.get_metadata(uri) -> Mapping[str, str]` to the protocol
and implements it on all three concrete stores:

- `DiskMediaStore`: writes a sidecar `<resolved>.meta.json` alongside
  the blob on put (atomic tmp + rename), reads it back on
  `get_metadata`. Returns `{}` when no metadata was supplied. v1 had
  documented this as a deliberate gap — sidecar JSON is straightforward
  and the xattr / ADS drawbacks don't apply.
- `SqliteMediaStore`: `SELECT metadata FROM <table> WHERE sha256=?` +
  `json.loads`. Raises `FileNotFoundError` for unknown URIs.
- `S3MediaStore`: HEAD + collects `x-amz-meta-*` response headers,
  strips the prefix. Reuses the existing 404 / non-2xx error shape.

Drops `key_strategy=` from `SqliteMediaStore`. The digest is the primary
key by content-addressing construction — a user-chosen key would either
break dedup or be a no-op. Kept on Disk + S3 where bucket / directory
layout is a real concern.

README + branch-context entries updated to reflect: all three stores
round-trip metadata; key_strategy is Disk + S3 only.

Coverage stays at 100% branch.

* fix(media): reject absolute paths from DiskMediaStore key_strategy

`Path(root) / absolute_path` returns `absolute_path` — the root is
silently discarded — so a custom `key_strategy` returning `/etc/passwd`
(or similar) escapes the store directory even though the previous check
only blocked `..`. Tighten the validator to reject both shapes.

Caught by pydanty during its #251 integration review.

* fix(step_persistence): dedupe terminal snapshot; document sqlite thread-affinity

The terminal CallToolsNode already saves the final provider-valid snapshot
with the correct step_index. after_run was re-saving the same tail stamped
with step_index=0 (ctx.run_step is reset by then), so latest_snapshot
reported a misleading step and every run wrote a duplicate. Track whether a
node snapshot was taken via a task-local ContextVar and make after_run a
fallback that only fires when the run reached no provider-valid boundary.

Also document that a caller-owned sqlite connection= must set
check_same_thread=False (store SQL runs on anyio worker threads), on both
SqliteStepStore and SqliteMediaStore, and correct the WAL-on-every-connection
claim.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(media): SigV4 wire path matches signed path; walker preserves unknown fields

S3MediaStore signed `_canonical_uri(path)` (each segment percent-encoded) but
sent the raw path, letting httpx apply looser encoding. A custom key_prefix /
key_strategy emitting reserved chars (`@`, `(`, `=`, ...) diverged from the
signed path -> SignatureDoesNotMatch. Send the canonical bytes via httpx
`raw_path` so signer and sender agree. Default `<hex>.bin` keys are unaffected.

The externalize/restore walker hardcoded the BinaryContent key set, silently
dropping any field pydantic_ai adds upstream. Copy the node and swap only
`data` <-> marker keys so unknown fields round-trip. Adds tests for reserved-char
path agreement, unknown-field preservation, and restore over a pruned blob.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor: gate step_persistence + media behind experimental namespace

Move both packages under `pydantic_ai_harness.experimental`, matching the
convention introduced for compaction/planning/subagents (#191): the old
`pydantic_ai_harness.step_persistence` / `.media` paths and the top-level
re-exports are gone, so the only import path is now

    from pydantic_ai_harness.experimental.step_persistence import StepPersistence
    from pydantic_ai_harness.experimental.media import S3MediaStore

Both package __init__s call `warn_experimental(...)`, so importing either
emits a `HarnessExperimentalWarning` (silenced category-wide by one
filter). This keeps us from committing to a public surface before the
capability has real usage.

README gains the standard experimental banner; warning tests cover both
new packages. 100% branch coverage retained.

* docs(media): fix metadata/key_strategy drift, convert em-dashes

Subagent review of the experimental move surfaced doc drift:

- README "Persistence by store" implied `get_metadata` returns
  `media_type`; it returns only the user `metadata` mapping. Reworded.
- `KeyStrategy` docstring still listed `SqliteMediaStore` / "DB primary
  key" as a user; sqlite has no `key_strategy`. Dropped it.
- README understated `DiskMediaStore` traversal protection (it rejects
  absolute paths as well as `..`).
- README had a stale paragraph that read as `key_strategy` but described
  `public_url`, and omitted S3. Rewritten to name `public_url` and all
  three stores.
- README `MediaContext` method list now includes `get_metadata`.
- Converted em-dashes to `--` across the step_persistence + media trees
  per the writing-style rule (#270). Other experimental packages'
  pre-existing em-dashes are left for a separate sweep.

No code behavior change; 100% branch coverage retained.

* docs(experimental): add warning banner to planning/subagents, finish em-dash sweep

Uniformity pass across the experimental packages:

- Add the standard `[!WARNING]` experimental banner to planning/README.md
  and subagents/README.md (compaction and step_persistence already had
  it; these two lacked it).
- Convert remaining em-dashes to `--` in the compaction package and the
  shared `_warn.py`, per the writing-style rule (#270). The whole
  experimental source tree is now em-dash-free.

Docs only; 100% branch coverage retained.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: David SF <david.sanchez@pydantic.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants