Add experimental compaction menu: `ClearToolResults`, `DeduplicateFileReads`, `TieredCompaction`; rename `Compaction` → `SummarizingCompaction` by DouweM · Pull Request #191 · pydantic/pydantic-ai-harness

DouweM · 2026-04-10T15:13:25Z

Adds a menu of context-compaction capabilities, offered as experimental under a new pydantic_ai_harness.experimental namespace.

Experimental gating (new convention)

There was no experimental convention in the harness yet, so this establishes one, mirroring pydantic (which has pydantic.experimental + PydanticExperimentalWarning):

The package physically lives at pydantic_ai_harness/experimental/compaction/ — the import path itself signals experimental, and there is no top-level export.
Importing a capability emits a HarnessExperimentalWarning. The message hands the user a single, category-wide silence filter — one line mutes every experimental capability, not one per capability:

import warnings
from pydantic_ai_harness.experimental import HarnessExperimentalWarning

warnings.filterwarnings('ignore', category=HarnessExperimentalWarning)

from pydantic_ai_harness.experimental.compaction import TieredCompaction

Importing the experimental package on its own does not warn, so the warning class can be imported to silence first.

Capabilities

Class	Cost	What it does
`SlidingWindow`	zero-LLM	Drop oldest whole messages down to a tail
`ClearToolResults`	zero-LLM	Blank old tool results in place, keep last `keep_pairs` (Anthropic `clear_tool_uses`)
`DeduplicateFileReads`	zero-LLM	Keep only the latest read per file (via a required `file_key` seam)
`SummarizingCompaction`	one LLM call	Structured-section summary of older messages (renamed from `Compaction`)
`TieredCompaction`	escalates	Cheap passes first; summarize only if still over `target_tokens`
`LimitWarner`	zero-LLM	Inject URGENT/CRITICAL warnings as limits approach

Shared CompactionStrategy.compact(messages, ctx) seam (exported) lets TieredCompaction drive tiers directly and re-measure between them — true escalation. All strategies preserve tool-call/return pairing.

Layout

One module per capability under experimental/compaction/, with cross-capability utilities (token estimation, the CompactionStrategy protocol, tool-pair-safe cutoffs, in-place clearing) in _shared.py.

Notes

SummarizingCompaction threads ctx.usage into the summary call (tokens and the request fold into the run — honest, consistent, and a runaway summary can't evade a request limit).
Per-result truncation was intentionally dropped in favour of a future overflow-to-file capability.

Quality

100% branch coverage; ruff + pyright strict clean; tests use TestModel/mocks only. Capability README.md carries an experimental banner with the import path and silence snippet.

🤖 Generated with Claude Code

Implements three compaction-related capabilities for managing conversation context in long-running agents: - SlidingWindow: zero-cost message trimming that preserves tool-call pairs - LimitWarner: injects warnings when approaching iteration/token limits - Compaction: LLM-powered summarization of older messages All three use the before_model_request hook to modify request_context.messages transparently. The safe cutoff logic ensures tool-call / tool-return pairs are never orphaned, preventing HTTP 400 errors from LLM providers. Closes #21 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add explicit `set[str]` type annotations and replace unnecessary `isinstance` checks with plain `else` branches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion Implements three improvements from the audit findings on PR #140: - Optional `tokenizer: Callable[[str], int] | None` parameter on SlidingWindow, Compaction, estimate_token_count, and _find_token_cutoff. When provided, enables accurate token counting; the 4-chars/token heuristic stays as fallback. - `preserve_first_user_message: bool = True` on SlidingWindow and Compaction. When True, the first ModelRequest containing a UserPromptPart is always retained after trimming/compaction, preserving the original task context. - `incremental: bool = True` on Compaction. When True and a prior compaction summary exists in the message history, it is included in the summarization prompt via a <previous_summary> tag so the LLM extends it rather than regenerating from scratch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

DouweM · 2026-04-10T15:13:45Z

Originally posted by @DouweM in #140 comment (PR was recreated)

Note: This PR implements client-side compaction (LLM summarization + sliding window). Provider-side compaction (OpenAI/Anthropic) additionally requires the core primitive in #141 (CompactionPart message type + compact_messages on Model).

DouweM · 2026-04-10T15:13:45Z

Originally posted by @DouweM in #140 comment (PR was recreated)

Audit vs prior art: Compaction

devin-ai-integration

Devin Review found 2 potential issues.

View 5 additional findings in Devin Review.

devin-ai-integration · 2026-04-10T15:20:18Z

+        system_parts = _extract_system_prompts(messages)
+        to_summarize = messages[:cutoff]
+        preserved = messages[cutoff:]
+
+        previous_summary = _extract_previous_summary(messages) if self.incremental else None
+        summary = await self._summarize(to_summarize, previous_summary=previous_summary)
+
+        summary_part = SystemPromptPart(content=f'{_SUMMARY_PREFIX}{summary}')
+        summary_message = ModelRequest(parts=[*system_parts, summary_part])


🔴 Old compaction summaries accumulate as SystemPromptParts across multiple compaction cycles

After the first compaction, the summary message contains [SystemPromptPart('original sys prompt'), SystemPromptPart('Summary of previous conversation:\n\n...')]. When a second compaction triggers, _extract_system_prompts(messages) at line 726 extracts ALL leading SystemPromptParts from this message — including the old summary part (since it's also a SystemPromptPart). The old summary is then re-included in the new summary message at line 734 alongside the new summary. After N compactions, the summary message contains N stale summary parts plus the new one, growing the context unboundedly and defeating the purpose of compaction.

Trace through two compaction cycles

After first compaction, result.messages[0] = ModelRequest(parts=[SystemPromptPart('sys'), SystemPromptPart('Summary of previous conversation:\n\nfirst summary')]).

When second compaction triggers, _extract_system_prompts (src/pydantic_harness/compaction.py:594-605) sees both parts are SystemPromptPart, extracts both. Then line 734 creates ModelRequest(parts=[SystemPromptPart('sys'), SystemPromptPart('...first summary'), SystemPromptPart('...second summary')]). The old summary is never removed.

Suggested change

system_parts = _extract_system_prompts(messages)

to_summarize = messages[:cutoff]

preserved = messages[cutoff:]

previous_summary = _extract_previous_summary(messages) if self.incremental else None

summary = await self._summarize(to_summarize, previous_summary=previous_summary)

summary_part = SystemPromptPart(content=f'{_SUMMARY_PREFIX}{summary}')

summary_message = ModelRequest(parts=[*system_parts, summary_part])

system_parts = [

p for p in _extract_system_prompts(messages)

if not p.content.startswith(_SUMMARY_PREFIX)

]

to_summarize = messages[:cutoff]

preserved = messages[cutoff:]

previous_summary = _extract_previous_summary(messages) if self.incremental else None

summary = await self._summarize(to_summarize, previous_summary=previous_summary)

summary_part = SystemPromptPart(content=f'{_SUMMARY_PREFIX}{summary}')

summary_message = ModelRequest(parts=[*system_parts, summary_part])

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-04-10T15:20:20Z


 [tool.coverage.report]
-fail_under = 100
+fail_under = 98


🚩 Coverage threshold lowered from 100% to 98%

The fail_under threshold in pyproject.toml:96 was reduced from 100 to 98, with the commit noting 'due to branch coverage of elif chains'. This permanently lowers the bar for the entire project. Consider using # pragma: no branch on specific elif chains instead of lowering the global threshold.

Was this helpful? React with 👍 or 👎 to provide feedback.

mpfaffenberger · 2026-04-13T13:09:05Z

+# Token estimation
+# ---------------------------------------------------------------------------
+
+_CHARS_PER_TOKEN = 4


This will underestimate for Anthropic models unfortunately. It works pretty well for OpenAI ones. I settled on 2.5 in Code Puppy to give a lot of slack (to avoid the errors in Vertex).

Coming back to this, I suggest we make it configurable somehow? Perhaps an environment var (yucky, but there should be a way to override it for power users)

mpfaffenberger · 2026-04-13T13:09:45Z

+                    segments.append(str(part.content))
+        else:
+            for part in msg.parts:
+                if isinstance(part, TextPart):


You don't want to include ThinkingPart?

mpfaffenberger · 2026-04-13T13:10:54Z

+# Safe cutoff logic — preserves tool-call / tool-return pairs
+# ---------------------------------------------------------------------------
+
+_TOOL_PAIR_SEARCH_RANGE = 5


If I'm understanding this correctly, it could fail if your model performs any number > 5 parallel tool calls.

mpfaffenberger · 2026-04-13T13:12:02Z

+    """Number of tail messages to preserve after compaction (message-count trigger)."""
+
+    keep_tokens: int | None = None
+    """Target token budget to preserve after compaction (token-count trigger).


Love this <3 - I used this strategy in Code Puppy and the agent keeps coherence very nicely. It can get expensive though.

mpfaffenberger

Left a few comments. Hope they're helpful.

grahamcracker1234 · 2026-04-14T18:35:30Z

+    model: str
+    """Model to use for generating summaries (e.g. ``'openai:gpt-4o-mini'``)."""


This should likely include KnownModelName and Model, and use infer_model under the hood.

…Compaction → SummarizingCompaction Extend the compaction menu with the SOTA-aligned strategies missing from the initial cut, and align naming with the "compaction" umbrella term. - ClearToolResults: zero-LLM in-place clearing of old tool results (Anthropic clear_tool_uses); keep_pairs, exclude_tools, clear_tool_inputs (JSON-valid args), min_clear_tokens to protect the prompt cache. - DeduplicateFileReads: zero-LLM; keep only the latest read per file via a required file_key seam. - TieredCompaction: escalation orchestrator — cheap passes first, summarize only if still over target_tokens. CompactionStrategy protocol exported for custom tiers. - Rename Compaction → SummarizingCompaction; structured-section summary prompt; model now optional (inherits the run's model); summary-call usage folded into the parent run. All strategies preserve tool-call/return pairing. compaction.py at 100% branch coverage; pyright strict + ruff clean. Adds capability README (compaction.md). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Resolve the repo restructure (package renamed pydantic_harness → pydantic_ai_harness, moved to per-capability subpackages). Migrate compaction into the new layout: - pydantic_ai_harness/compaction/{__init__,_capability}.py + README.md - root __init__.py exposes the 6 compaction capabilities via lazy __getattr__ - tests moved to tests/compaction/; imports split public vs ._capability - drop now-unused noqa (main's ruff config does not enforce D102/D105) - widen _call_args test helper for the new ToolCallPart.args union (ToolSearchArgs) - pyproject coverage config kept identical to main (protocol stub excluded via pragma) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The summary call's tokens and request are folded into the run's usage by design — consistent, cost-honest, and runaway-safe. Make that explicit in the README. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Split the monolithic _capability.py into a module per capability plus a _shared module for cross-capability utilities (token estimation, the CompactionStrategy protocol, tool-pair-safe cutoffs, first-user preservation, in-place tool-result clearing). - _shared.py exposes its package-internal API without leading underscores (the module itself is private); genuinely module-local helpers keep their underscore. - _sliding_window / _clear_tool_results / _deduplicate_file_reads / _limit_warner / _summarizing_compaction / _tiered_compaction each own one capability. - __init__.py re-exports the public API; tests import privates from their new homes. No behavior change. Every compaction module at 100% branch coverage; ruff + pyright strict clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ental Compaction is offered as experimental rather than a top-level capability. Since no experimental convention existed in the harness, this establishes one, mirroring pydantic: - Physically moved to pydantic_ai_harness/experimental/compaction/ — the import path itself signals experimental; there is no top-level export (reverted root __init__). - Importing a capability emits HarnessExperimentalWarning; its message hands the user a single category-wide filter that silences every experimental capability at once (no per-capability suppression). Importing the `experimental` package alone does not warn, so the warning class can be pulled in to silence first. - README gains a prominent experimental banner with the import path + silence snippet. - pytest filterwarnings ignores the category; a dedicated test asserts the message + that one filter mutes all experimental warnings. 100% coverage; ruff + pyright strict clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Move both packages under `pydantic_ai_harness.experimental`, matching the convention introduced for compaction/planning/subagents (#191): the old `pydantic_ai_harness.step_persistence` / `.media` paths and the top-level re-exports are gone, so the only import path is now from pydantic_ai_harness.experimental.step_persistence import StepPersistence from pydantic_ai_harness.experimental.media import S3MediaStore Both package __init__s call `warn_experimental(...)`, so importing either emits a `HarnessExperimentalWarning` (silenced category-wide by one filter). This keeps us from committing to a public surface before the capability has real usage. README gains the standard experimental banner; warning tests cover both new packages. 100% branch coverage retained.

…egates (#251) * Add StepPersistence capability for step-event durability across delegates Supersedes PR #176 (SessionPersistence): orchestrators like pydanty need visible event trails for delegate runs that may time out before a "save full session after the run" hook can fire, and need to continue or fork a delegate's prior investigation without rediscovering context. A single after-run snapshot is too coarse for that use case. The capability now records (a) append-only StepEvents at every boundary (run/model-request/tool-call start, completion, failure), (b) a ContinuableSnapshot only when message history is provider-valid (every ToolCallPart has a matching ToolReturnPart / RetryPromptPart) — saved mid-run after CallToolsNode and at after_run, and (c) a ToolEffectRecord ledger so a run killed between before_tool_execute and after_tool_execute leaves an `unknown_after_crash`-style record rather than a falsely-continuable snapshot. Lineage metadata (parent_run_id, agent_name) ties delegate runs back to their orchestrator. `continue_run` / `fork_run` helpers load the latest continuable snapshot for a run. Backends: InMemoryStepStore (tests) and FileStepStore (JSONL events + JSON snapshots, with run_id path-traversal validation and anyio.to_thread for blocking I/O). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address pydanty review + ergonomics: tighten correctness and identity model Correctness fixes from pydanty's PR review: - FileStepStore: snapshot filenames are now a per-run monotonic counter, not `ctx.run_step` — `run_step` resets each Agent.run, so re-using a `run_id` across calls would let an earlier run's higher step-index snapshot mask a later run's lower-step-index one. - StepStore.get_tool_effect now takes both `run_id` and `tool_call_id`. TestModel and other providers can reuse deterministic tool-call ids across runs; the previous unscoped lookup let one run's effect leak into another's record (including `started_at`). - is_provider_valid now rejects orphan, duplicate, and out-of-order tool returns — the old `set.discard` pattern silently accepted any return regardless of whether a matching call was open. Identity model: - `run_id` resolution: explicit > `{agent_name}-{8-char-hex}` > UUID. Materialised per Agent.run in `for_run`, so reusing one capability instance never silently merges runs. - `parent_run_id` auto-inferred via a module-level ContextVar set in `wrap_run`, so an orchestrator's tool that synchronously calls `delegate.run(...)` produces a delegate `RunRecord.parent_run_id` pointing at the orchestrator's `run_id` with zero threading. Explicit `parent_run_id=` still wins. - `conversation_id` propagated to `StepEvent` and `RunRecord`; `store.list_runs(conversation_id=..., parent_run_id=...)` supports filtering by either or both. Mirrors pydantic_ai's three-level identity (conversation -> run -> step) so "run 1, run 2, run 3" of one dialogue is queryable as a group via `conversation_id`. - `continue_from=` field dropped from the capability. Continuation is now only via `continue_run(store, run_id=...)` -> standard `Agent.run(message_history=...)`. One way to pass history into pydantic_ai, no parallel capability flag. README rewritten around the final API. New sections: three-level identity, run lineage with auto-inferred parent, inspecting a run tree, failure recovery. Tests: 168 total (up from 64), 100% branch coverage on the package. New coverage for the snapshot seq counter, cross-run tool-effect isolation, orphan/duplicate/out-of-order return rejection, ContextVar parent inference across nested agent.run, conversation_id propagation, and the agent_name-derived run_id default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address pydanty round-2 review: retry validity, list ordering, effect metadata Correctness: - is_provider_valid no longer rejects non-tool RetryPromptParts. Pydantic AI emits `RetryPromptPart(tool_name=None)` for output-validation failures and providers map those as plain user messages, not tool results. The previous check required every RetryPromptPart to resolve an open tool call, so a run with one output retry produced no final continuable snapshot despite being fully valid. - StepStore.list_runs now guarantees chronological (started_at ascending) ordering across both backends. FileStepStore was previously returning directory-name order (lexicographic), so the README's `[-1]` pattern for "latest run in conversation" could pick the older run when run ids did not sort by recency. - after_tool_execute and on_tool_execute_error preserve idempotency_key and effect_summary from the prior `started` record. Previously the terminal record was written without those fields, so any annotation the tool body wrote was lost on completion. - from_spec raises ValueError for unknown backends instead of silently falling back to in-memory storage. For a persistence capability, turning a typo into accidental non-durability is the wrong failure mode. API: - New annotate_tool_effect(store, ctx, *, idempotency_key=None, effect_summary=None) helper. Tool bodies that write external state call it to attach idempotency + effect metadata to the in-flight ToolEffectRecord without knowing the (run_id, tool_call_id) plumbing. Resolves run_id from a ContextVar set by wrap_run; reads tool_call_id / tool_name from RunContext. - ContextVar moved from `_capability.py` into a new `_context.py` module so the helper and the capability can share it without circular imports and without crossing the private-name barrier. Docs: README fixes a non-existent `list_runs(agent_name=None)` call, documents the chronological-ordering guarantee, and replaces the hand-wavy "populate fields on the ToolEffectRecord" line with a concrete `annotate_tool_effect` example. Tests: 178 total (was 168), 100% branch coverage on the package. Added coverage for non-tool retry acceptance, chronological list_runs on both backends, metadata preservation across completed/failed transitions, annotate_tool_effect under realistic agent.tool, and from_spec backend validation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(step_persistence): align FileStepStore docstring with seq-counter layout The class docstring still showed snapshots/<step_index>.json from the pre-fix layout, but both the README and _next_snapshot_seq document the monotonic counter. Bring the class docstring in line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(step_persistence): fix broken README examples + clarify run_id sharing Pydanty round-3 review: - README continuation and lineage examples queried `list_runs(conversation_id=...)` on conversations the earlier `.run(...)` calls never set, so the examples crashed with IndexError on `[-1]`. Pass the conversation_id to the earlier calls so the lookup actually works. - The capability docstring claimed reusing a `StepPersistence` instance across `Agent.run` calls does NOT share the id. That is true only for the auto-derived (`agent_name`-prefixed or `ctx.run_id`) cases — an explicit `run_id=` is shared across every `.run()` by design, since that is the orchestrator pattern where the caller owns one logical identity across turns. Rewrite the resolution-order docs to spell out which cases share and which don't, and when to pick each. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(step_persistence): align run_id semantics with pydantic_ai (per-call, not shared) Pydanty round-4 review: the prior round documented explicit `run_id` as shared across `.run()` calls on one capability instance — that framing caused real correctness gaps. The `ToolEffectRecord` ledger is keyed by `(run_id, tool_call_id)` and providers reuse deterministic tool-call ids (e.g. `TestModel` emits `pyd_ai_tool_call_id__{name}`), so a second `.run()` overwrites the first's effect record under the same key — the `unknown_after_crash` signal from turn 1 disappears when turn 2 lands. Realign: - `run_id` is per-`Agent.run`, matching `pydantic_ai.RunContext.run_id`. - For multi-turn logical grouping, use `conversation_id=` on `Agent.run(...)` — that is the pyai-native primitive. The orchestrator pattern is `conversation_id='orch'` with each turn auto-deriving its own `run_id`. - Explicit `run_id=` remains supported but is documented as single-shot (testing, replay, debugging). Reusing it across calls is a caller contract violation, not an implementation feature. Code is unchanged — the implementation was already correct under the right contract. Only the docs were misleading. Tests: - `TestRunIdIsPerCall::test_multi_turn_orchestrator_uses_conversation_id` exercises the recommended pattern: three turns sharing a `conversation_id`, three distinct auto-derived `run_id`s, all queryable as a group. - `TestRunIdIsPerCall::test_explicit_run_id_reuse_collides_ledger` locks down the misuse contract: reusing one explicit `run_id` across two `.run()` calls produces colliding effect records under the `(run_id, tool_call_id)` key. The behavior is documented; the test exists so a future refactor cannot silently change it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(step_persistence): explain why for_run uses a local ContextVar Pyai-aligned review flagged this as a P3 explainer: pydantic_ai already has three single-slot cross-run signals (RUN_ID_BAGGAGE_KEY, ctx.run_id, _CURRENT_RUN_CONTEXT). All three get overwritten by the inner Instrumentation.wrap_run before any nested capability can see the parent identity. A separate harness-local ContextVar, snapshotted before our own wrap_run rebinds it, is the only correct mechanism today. Spell this out so the next reader doesn't try to 'simplify' it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(step_persistence): enforce explicit-run_id reuse with ValueError Pydanty round-5 review accepted the docs-only contract but flagged that "documented but not enforced" is a soft spot. Enforce it: `before_run` calls `store.get_run(run_id=...)` when the user supplied an explicit `run_id`, and raises `ValueError` if a record with that id already exists. The auto-derived cases cannot trigger this check (each call materialises a fresh id in `for_run`). The check is one extra store read per Agent.run when an explicit run_id is set, only. The error message points the caller at `conversation_id` for multi-turn grouping. Test renamed from `test_explicit_run_id_reuse_collides_ledger` to `test_explicit_run_id_reuse_raises` — asserts the second `.run()` raises and the first run's records survive untouched. README + capability docstring updated: the misuse path is now "raises" not "caller's contract." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: gitignore branch-context skill artifacts + AGENTS.local.md Two patterns that match the existing CLAUDE.local.md ignore convention: - AGENTS.local.md — canonical local-instructions file (CLAUDE.local.md is symlinked to it where the worktree follows the same AGENTS.md/CLAUDE.md symlink pattern). - .agents/skills/branch-context/ — per-worktree decisions log (`pr-decisions.md`) and the skill's local SKILL.md. Pattern lifted from `~/pydantic/ai/base/.claude/skills/branch-context/` where pyai uses an identical setup. Neither is intended to land in PRs — they record cross-iteration design calls so future AICA sessions in this worktree don't silently undo them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(step_persistence): add SqliteStepStore + MediaStore externalization Adds a new `pydantic_ai_harness.media` package (MediaStore protocol + DiskMediaStore / SqliteMediaStore / S3MediaStore) and wires it into the file/sqlite step-persistence backends so large BinaryContent payloads get externalized out of snapshot JSON / table rows by default. Defaults are zero-config: FileStepStore writes blobs under `<root>/media/<sha256>.bin`; SqliteStepStore writes them to a sibling `media` table in the same DB. Threshold is 64 KiB and URI scheme is `media+sha256://<hex>` so blobs are content-addressed across stores. Pass `media_store=None` to keep bytes inline, or a custom `MediaStore` to redirect (e.g. `S3MediaStore` for R2 / AWS / MinIO). S3MediaStore handrolls SigV4 over httpx to avoid a botocore/boto3 dependency. Verified working against Cloudflare R2. `StepPersistence.from_spec(backend='sqlite', database=...)` now resolves. 180 → 261 tests, 100% branch coverage maintained. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(media): VCR cassettes for S3MediaStore against R2 Adds replay-driven tests under `tests/media/test_s3_cassettes.py` that exercise `S3MediaStore.put/get/exists` against pre-recorded Cloudflare R2 responses. CI runs them without any S3 creds via the committed cassettes under `tests/media/cassettes/`. Sanitisation policy: - `before_record_request`/`before_record_response` swap the real R2 account-id subdomain and bucket name for fixed placeholders (`account.r2.cloudflarestorage.com`, `harness-test-bucket`) - `Authorization` and `x-amz-date` filtered to `REDACTED` - CF-RAY, x-amz-version-id, x-amz-checksum-*, x-amz-request-id headers dropped (none load-bearing for tests; some carry identifying info) - Non-2xx response bodies blanked (R2's gzipped XML error envelope leaks the bucket name; our code only checks status code) The `s3_credentials` fixture uses `os.environ.get(NAME, PLACEHOLDER)` per field, so real R2 creds are used when recording locally with `.env` loaded, and the placeholder constants match the scrubbed cassettes during replay. Because the placeholders are fixed, any scrubber miss during a future re-record shows up as a replay URL mismatch — built-in canary against credential / private-data leakage in committed cassettes. Adds `pytest-recording` (pulls `vcrpy`) to the dev deps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(media): public_url resolver — protocol method + per-store callable Adds `MediaStore.public_url(uri) -> str | None` plus a `public_url=` constructor parameter on every concrete store. The parameter accepts a sync or async callable; the store auto-detects and awaits if needed. This is the bottom-layer primitive for the forthcoming `MediaExternalizer` capability — that capability will call `store.public_url(...)` per externalized blob and swap `BinaryContent` for `ImageUrl` / `AudioUrl` parts before the model sees the message. The callable shape covers both static URLs (public bucket / CDN — use `make_static_public_url` helper) and dynamic URLs (presigned, per-request signing — pass any async callable with TTL captured in its closure). Why a callable rather than a static config: a public bucket's URL host is not derivable from the bucket creds (R2 public buckets use `pub-<hash>.r2.dev`, AWS public buckets use a different scheme than the path-style endpoint we sign for). The URL is always user-supplied information, so a callable is the right primitive — same shape for the static and presigned cases, and `get` stays untouched (it serves the harness's internal byte fetch, not the model's external HTTP fetch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(media): introduce MediaContext + key_strategy for extensible operations Adds `MediaContext` (frozen, kw-only dataclass with `media_type`, `filename`, `metadata`) and threads it through every `MediaStore` method and both user callables (`PublicUrlResolver`, `KeyStrategy`). New context fields can be added non-breakingly; existing call sites and resolvers keep working. Also adds: - `KeyStrategy = Callable[[str, MediaContext], str]` for per-store layout control. Default `default_key_strategy` produces `<sha256>.bin`. Disk store validates the result against `..` traversal. - `metadata` persistence on `SqliteMediaStore` (new JSON column) and `S3MediaStore` (signed `x-amz-meta-*` headers, ASCII key validation). Disk store explicitly does NOT persist metadata in v1 — sidecar / xattr options each have load-bearing drawbacks; we ship nothing rather than a half-true persistence promise. - `make_static_public_url(...)` updated to the new `(uri, ctx)` signature. The shift is motivated by the same principle as pydantic_ai's `RunContext`: extension via fields on a context bag rather than via breaking signature changes. Every new requirement (TTL hints for presigned URLs, audit ids, response-header overrides, etc.) becomes a field addition, not an API revision. Cassettes from the previous commit replay unchanged — match-on does not include the signed headers and the request URLs are stable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(media): close metadata round-trip; drop vestigial sqlite key_strategy Adds `MediaStore.get_metadata(uri) -> Mapping[str, str]` to the protocol and implements it on all three concrete stores: - `DiskMediaStore`: writes a sidecar `<resolved>.meta.json` alongside the blob on put (atomic tmp + rename), reads it back on `get_metadata`. Returns `{}` when no metadata was supplied. v1 had documented this as a deliberate gap — sidecar JSON is straightforward and the xattr / ADS drawbacks don't apply. - `SqliteMediaStore`: `SELECT metadata FROM <table> WHERE sha256=?` + `json.loads`. Raises `FileNotFoundError` for unknown URIs. - `S3MediaStore`: HEAD + collects `x-amz-meta-*` response headers, strips the prefix. Reuses the existing 404 / non-2xx error shape. Drops `key_strategy=` from `SqliteMediaStore`. The digest is the primary key by content-addressing construction — a user-chosen key would either break dedup or be a no-op. Kept on Disk + S3 where bucket / directory layout is a real concern. README + branch-context entries updated to reflect: all three stores round-trip metadata; key_strategy is Disk + S3 only. Coverage stays at 100% branch. * fix(media): reject absolute paths from DiskMediaStore key_strategy `Path(root) / absolute_path` returns `absolute_path` — the root is silently discarded — so a custom `key_strategy` returning `/etc/passwd` (or similar) escapes the store directory even though the previous check only blocked `..`. Tighten the validator to reject both shapes. Caught by pydanty during its #251 integration review. * fix(step_persistence): dedupe terminal snapshot; document sqlite thread-affinity The terminal CallToolsNode already saves the final provider-valid snapshot with the correct step_index. after_run was re-saving the same tail stamped with step_index=0 (ctx.run_step is reset by then), so latest_snapshot reported a misleading step and every run wrote a duplicate. Track whether a node snapshot was taken via a task-local ContextVar and make after_run a fallback that only fires when the run reached no provider-valid boundary. Also document that a caller-owned sqlite connection= must set check_same_thread=False (store SQL runs on anyio worker threads), on both SqliteStepStore and SqliteMediaStore, and correct the WAL-on-every-connection claim. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(media): SigV4 wire path matches signed path; walker preserves unknown fields S3MediaStore signed `_canonical_uri(path)` (each segment percent-encoded) but sent the raw path, letting httpx apply looser encoding. A custom key_prefix / key_strategy emitting reserved chars (`@`, `(`, `=`, ...) diverged from the signed path -> SignatureDoesNotMatch. Send the canonical bytes via httpx `raw_path` so signer and sender agree. Default `<hex>.bin` keys are unaffected. The externalize/restore walker hardcoded the BinaryContent key set, silently dropping any field pydantic_ai adds upstream. Copy the node and swap only `data` <-> marker keys so unknown fields round-trip. Adds tests for reserved-char path agreement, unknown-field preservation, and restore over a pruned blob. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: gate step_persistence + media behind experimental namespace Move both packages under `pydantic_ai_harness.experimental`, matching the convention introduced for compaction/planning/subagents (#191): the old `pydantic_ai_harness.step_persistence` / `.media` paths and the top-level re-exports are gone, so the only import path is now from pydantic_ai_harness.experimental.step_persistence import StepPersistence from pydantic_ai_harness.experimental.media import S3MediaStore Both package __init__s call `warn_experimental(...)`, so importing either emits a `HarnessExperimentalWarning` (silenced category-wide by one filter). This keeps us from committing to a public surface before the capability has real usage. README gains the standard experimental banner; warning tests cover both new packages. 100% branch coverage retained. * docs(media): fix metadata/key_strategy drift, convert em-dashes Subagent review of the experimental move surfaced doc drift: - README "Persistence by store" implied `get_metadata` returns `media_type`; it returns only the user `metadata` mapping. Reworded. - `KeyStrategy` docstring still listed `SqliteMediaStore` / "DB primary key" as a user; sqlite has no `key_strategy`. Dropped it. - README understated `DiskMediaStore` traversal protection (it rejects absolute paths as well as `..`). - README had a stale paragraph that read as `key_strategy` but described `public_url`, and omitted S3. Rewritten to name `public_url` and all three stores. - README `MediaContext` method list now includes `get_metadata`. - Converted em-dashes to `--` across the step_persistence + media trees per the writing-style rule (#270). Other experimental packages' pre-existing em-dashes are left for a separate sweep. No code behavior change; 100% branch coverage retained. * docs(experimental): add warning banner to planning/subagents, finish em-dash sweep Uniformity pass across the experimental packages: - Add the standard `[!WARNING]` experimental banner to planning/README.md and subagents/README.md (compaction and step_persistence already had it; these two lacked it). - Convert remaining em-dashes to `--` in the compaction package and the shared `_warn.py`, per the writing-style rule (#270). The whole experimental source tree is now em-dash-free. Docs only; 100% branch coverage retained. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: David SF <david.sanchez@pydantic.dev>

DouweM and others added 3 commits April 2, 2026 05:28

Fix pyright strict mode errors in test_compaction

af347ff

Add explicit `set[str]` type annotations and replace unnecessary `isinstance` checks with plain `else` branches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

DouweM requested review from Kludex, adtyavrdhn, dmontagu, dsfaccini and samuelcolvin as code owners April 10, 2026 15:13

DouweM removed request for Kludex, adtyavrdhn, dmontagu, dsfaccini and samuelcolvin April 10, 2026 15:13

DouweM marked this pull request as draft April 10, 2026 15:13

devin-ai-integration Bot reviewed Apr 10, 2026

View reviewed changes

mpfaffenberger reviewed Apr 13, 2026

View reviewed changes

grahamcracker1234 reviewed Apr 14, 2026

View reviewed changes

DouweM added this to the 2026-05 milestone Apr 23, 2026

dsfaccini changed the title ~~Add compaction capabilities: SlidingWindow, LimitWarner, Compaction~~ Add compaction capabilities: SlidingWindow, LimitWarner, Compaction Jun 2, 2026

dsfaccini changed the title ~~Add compaction capabilities: SlidingWindow, LimitWarner, Compaction~~ Add compaction menu: ClearToolResults, DeduplicateFileReads, TieredCompaction; rename Compaction → SummarizingCompaction Jun 3, 2026

dsfaccini changed the title ~~Add compaction menu: ClearToolResults, DeduplicateFileReads, TieredCompaction; rename Compaction → SummarizingCompaction~~ Add compaction menu: ClearToolResults, DeduplicateFileReads, TieredCompaction; rename Compaction → SummarizingCompaction Jun 3, 2026

docs(compaction): document summary-call usage accounting

fe9ccb6

The summary call's tokens and request are folded into the run's usage by design — consistent, cost-honest, and runaway-safe. Make that explicit in the README. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

David SF and others added 2 commits June 3, 2026 22:20

dsfaccini marked this pull request as ready for review June 5, 2026 00:00

dsfaccini merged commit 442e9ff into main Jun 5, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add experimental compaction menu: `ClearToolResults`, `DeduplicateFileReads`, `TieredCompaction`; rename `Compaction` → `SummarizingCompaction`#191

Add experimental compaction menu: `ClearToolResults`, `DeduplicateFileReads`, `TieredCompaction`; rename `Compaction` → `SummarizingCompaction`#191
dsfaccini merged 8 commits into
mainfrom
capability/compaction

DouweM commented Apr 10, 2026 •

edited by dsfaccini

Loading

Uh oh!

DouweM commented Apr 10, 2026

Uh oh!

DouweM commented Apr 10, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Apr 10, 2026

Uh oh!

devin-ai-integration Bot Apr 10, 2026

Uh oh!

mpfaffenberger Apr 13, 2026

Uh oh!

mpfaffenberger Apr 13, 2026

Uh oh!

mpfaffenberger Apr 13, 2026

Uh oh!

mpfaffenberger Apr 13, 2026

Uh oh!

mpfaffenberger Apr 13, 2026

Uh oh!

mpfaffenberger left a comment

Uh oh!

grahamcracker1234 Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		model: str
		"""Model to use for generating summaries (e.g. ``'openai:gpt-4o-mini'``)."""

Conversation

DouweM commented Apr 10, 2026 • edited by dsfaccini Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Experimental gating (new convention)

Capabilities

Layout

Notes

Quality

Uh oh!

DouweM commented Apr 10, 2026

Uh oh!

DouweM commented Apr 10, 2026

Audit vs prior art: Compaction

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

mpfaffenberger Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

mpfaffenberger Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

mpfaffenberger Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

mpfaffenberger Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

mpfaffenberger Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

mpfaffenberger left a comment

Choose a reason for hiding this comment

Uh oh!

grahamcracker1234 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DouweM commented Apr 10, 2026 •

edited by dsfaccini

Loading