Skip to content

Commit 604d385

Browse files
committed
feat(log): add paged introspection prelude support
Add paged log/ introspection envelopes, MCP --prelude wiring across eval/session/task paths, and a large-file MCP-backed log prelude example with e2e coverage. Verified with: mix test test/ptc_runner_mcp/application_phase0_test.exs test/ptc_runner_mcp/agentic/orchestration_test.exs test/ptc_runner_mcp/sessions_lifecycle_test.exs (from mcp_server); mix test --include e2e test/ptc_runner/upstream_runtime_test.exs; mix precommit; independent Codex review.
1 parent bd7dba6 commit 604d385

28 files changed

Lines changed: 2282 additions & 425 deletions

docs/guides/subagent-observability.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -332,13 +332,13 @@ alias PtcRunner.TraceLog.Introspection
332332

333333
# `source` is a MemorySink pid, a JSONL path, or a list of event maps.
334334
PtcRunner.Lisp.run(
335-
~S|(count (log/turns "investigation"))|,
335+
~S|(count (get (log/turns "investigation") "items"))|,
336336
prelude: Introspection.prelude_source(),
337337
tools: Introspection.tools(source)
338338
)
339339
```
340340

341-
The exports (`log/sessions`, `log/turns`, `log/programs`, `log/tool-calls`) fail closed with `:prelude_attach_failed` when the host does not grant the matching tools. Recorded sessions are untrusted data — analyze them as evidence, not instructions.
341+
The exports (`log/sessions`, `log/turns`, `log/programs`, `log/tool-calls`) return page maps with `"items"`, `"next_cursor"`, `"has_more"`, and `"limit"`. Call them with no opts for the default first page, or pass `{:limit n :cursor c}` and follow `"next_cursor"` for large logs. The `*-all` helpers are explicit eager scans for small local logs. All exports fail closed with `:prelude_attach_failed` when the host does not grant the matching tools. Recorded sessions are untrusted data — analyze them as evidence, not instructions.
342342

343343
### REPL
344344

@@ -348,7 +348,7 @@ The exports (`log/sessions`, `log/turns`, `log/programs`, `log/tool-calls`) fail
348348
ptc> (def x 1)
349349
ptc> :turns
350350
<id> (session): 1 turns, 1 committed, 0 failed, 0 tool calls
351-
ptc> (log/programs "<id>")
351+
ptc> (get (log/programs "<id>") "items")
352352
```
353353

354354
## Telemetry Events

docs/mcp-server-cli.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -167,7 +167,11 @@ Wire the coding agent to PtcRunner:
167167
"start",
168168
"--sessions",
169169
"--upstreams-config",
170-
"/absolute/path/to/upstreams.json"
170+
"/absolute/path/to/upstreams.json",
171+
"--prelude",
172+
"/absolute/path/to/analysis-prelude.clj",
173+
"--turn-log-dir",
174+
"/absolute/path/to/turn-log"
171175
],
172176
"env": {
173177
"API_SERVICE_TOKEN": "..."
@@ -177,6 +181,12 @@ Wire the coding agent to PtcRunner:
177181
}
178182
```
179183

184+
`--prelude` is optional. When supplied, the file is read once at server boot
185+
and attached to every `lisp_eval`, `lisp_session_eval`, and agentic `lisp_task`
186+
run. This is the recommended way to run a verified analysis prelude in
187+
Codex/Claude Code: start a fresh MCP server process with the prelude attached,
188+
record the run with `--turn-log-dir`, and analyze those logs in a later session.
189+
180190
Ask the agent, or use the REPL, to smoke-test discovery and one call:
181191

182192
```clojure

docs/mcp-server-configuration.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ All configuration is read once at boot, either from a CLI flag or the equivalent
2323
| `--trace-payloads` | `PTC_RUNNER_MCP_TRACE_PAYLOADS` | `summary` | One of `none`, `summary`, `full`. Controls program / context / result inclusion in traces. |
2424
| `--trace-max-files` | `PTC_RUNNER_MCP_TRACE_MAX_FILES` | `1000` | Rolling-deletion cap on `--trace-dir`. |
2525
| `--turn-log-dir` | `PTC_RUNNER_MCP_TURN_LOG_DIR` | unset | Directory for the canonical stateful-session turn log. When set, all accepted `lisp_session_eval` attempts write `event: "turn"` records to one JSONL file. |
26+
| `--prelude` | `PTC_RUNNER_MCP_PRELUDE` | unset | Path to a Capability Prelude source file attached to every `lisp_eval`, `lisp_session_eval`, and agentic `lisp_task` run. The file is read once at boot; attach-time `requires` still fail closed against configured upstreams and granted tools. |
2627
| `--aggregator-read-only` | `PTC_RUNNER_MCP_AGGREGATOR_READ_ONLY` | `false` | Aggregator-mode annotation override for upstream configs that are read-only by construction. |
2728
| `--agentic` | `PTC_RUNNER_MCP_AGENTIC` | `false` | Expose the experimental `lisp_task` tool when aggregator mode is active. |
2829
| `--agentic-model` | `PTC_RUNNER_MCP_AGENTIC_MODEL` | `gemini-flash-lite` | Planner model alias or provider-qualified model id. |

docs/plans/chunked-tool-results-and-data-prelude.md

Lines changed: 78 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,14 @@
11
# Paginated Reads and Data Prelude — Plan
22

3+
**Status (2026-06-14): partially implemented / still relevant.** The first
4+
core blocker is shipped: in-eval tool-call ledger compaction landed in
5+
`209b4bdf`, with large paged-result coverage in `bd7dba65`. A concrete
6+
large-file MCP smoke path also exists through
7+
`examples/large_file_log_introspection/` and e2e tests. What remains is the
8+
general `data/` prelude, page-source conventions as a reusable API, M2 A/B
9+
measurement, and a rooted/chunk-capable file source suitable for benchmark
10+
integrity.
11+
312
## Context
413

514
The planted-anomaly pilot (`~/ptc-bench-comparison/notes/planted-pilot-results-2026-06-13.md`)
@@ -14,8 +23,8 @@ killed the design we first reached for:
1423
- A 540 KB JSONL file parses to ~1.6 MB of maps; eager `json/parse-lines` of the
1524
whole content blows the 10 MB sandbox max_heap. **You cannot read a large
1625
result whole.**
17-
- The first design tried to hold the raw result *off-budget* as a refc binary
18-
and slice it. **That is false against the code:** the sandbox arms
26+
- The first design tried to hold a raw result *off-budget* and slice it later.
27+
**That is false against the code:** the sandbox arms
1928
`max_heap_size` with `include_shared_binaries: true`
2029
(`lib/ptc_runner/sandbox.ex:181`), so binaries acquired *during* eval are
2130
billed to the eval; the rebaseline only exempts data present *before* eval
@@ -24,33 +33,33 @@ killed the design we first reached for:
2433
binary bytes included. Plus the transport already decodes and caps responses
2534
at 2 MiB (`lib/ptc_runner/upstream/runtime.ex:14`), and the evaluator records
2635
every tool result in `eval_ctx.tool_calls` — so hiding it from the return
27-
value is not enough. Host-side capture + cursor + slicing is a large, risky
28-
change built on a wrong assumption.
36+
value is not enough.
2937

30-
So this plan does **not** add a generic cursor/`tool/next`/result-capture
31-
mechanism. Pagination is a **tool concern**: read a large source through a
32-
paginated upstream tool, and fold over the pages in PTC-Lisp.
38+
Therefore pagination is a **tool concern**: read a large source through a
39+
paginated upstream tool, and fold over the pages in PTC-Lisp. Page position is
40+
ordinary program data: an offset, chunk index, or continuation token passed in
41+
the next `tool/call`.
3342

34-
A second codex review (round 2) caught the matching retention bug on the
35-
fold side and is the reason this plan needs **one** core change: `(tool/call
36-
...)` stores the **full result value** of every call in the in-eval tool ledger
43+
The matching retention bug on the fold side was caught before implementation:
44+
`(tool/call ...)` stores the **full result value** of every call in the in-eval tool ledger
3745
(`tool_call = %{..., result: result}`, `lib/ptc_runner/lisp/eval.ex:1216`,
3846
appended to `eval_ctx.tool_calls`). So a fold "discards" a page from its
3947
variables but the ledger keeps it — N pages become O(total bytes) of live eval
4048
state, billed to max_heap. Paging does not bound memory until the ledger stops
41-
retaining full values (see "The one core change"). Everything else is pure
42-
tool-arg threading.
49+
retaining full values. That core change is now shipped; everything else in this
50+
plan is pure tool-arg threading plus Prelude V1 library code.
4351

4452
This is the M2 candidate for
4553
[`turn-log-and-prelude-derivation.md`](turn-log-and-prelude-derivation.md):
4654
a human-written prelude should pay for itself before P4 derivation starts.
4755

48-
## The one core change: bound the in-eval tool ledger
56+
## Shipped core change: bound the in-eval tool ledger
4957

50-
The in-eval `eval_ctx.tool_calls` ledger retains every call's full result value
51-
(`eval.ex:1216`) and is only compacted *after* the eval (for the response
58+
The in-eval `eval_ctx.tool_calls` ledger used to retain every call's full
59+
result value (`eval.ex:1216`) and is only compacted *after* the eval (for the response
5260
envelope). Fatal for a page fold: each page stays live in the ledger. The
53-
change: **compact the ledger as it grows** — past a per-eval bytes/entries cap,
61+
change, now shipped in `209b4bdf`: **compact the ledger as it grows** — past
62+
a per-eval bytes/entries cap,
5463
keep each call's metadata (name, args hash, outcome, duration) and a bounded
5564
preview, and drop the full result value **and large `:args`**. The program holds
5665
the value as the `tool/call` return, so the ledger never needed it.
@@ -74,11 +83,10 @@ not blanket-dropped:
7483
metadata** (needed for the trace hierarchy) while dropping bulk result bytes.
7584

7685
This is a **public `Step.tool_calls` contract change**: `call.result` may now be
77-
a bounded preview, not the full value. Acceptable in 0.x, but update the tests
78-
and envelope expectations that assume full `result`. Byte accounting must be
86+
a bounded preview, not the full value. The matching tests and envelope
87+
expectations were updated with the shipped change. Byte accounting must remain
7988
real (e.g. `:erlang.external_size/1` over result + args + previews + list
80-
overhead), not "bounded in name only." Likely site: `EvalContext.append_tool_call`,
81-
reusing the existing `max_session_tool_call_bytes`/`_entries` budgets in-eval.
89+
overhead), not "bounded in name only."
8290

8391
## Core design: paginate at the tool, fold in Lisp
8492

@@ -94,10 +102,9 @@ whole — read it a page at a time through a paginated tool.**
94102
3. The whole parsed population never exists. Each page's result is small —
95103
well under the 2 MiB transport cap and well under max_heap when parsed.
96104

97-
The "cursor" is **ordinary program state the fold carries** (the next offset, or
98-
the continuation token from the previous page). There is **no host-side cursor,
99-
no `tool/next`, no result capture, no off-budget hold.** Each page is a normal
100-
upstream tool call.
105+
The page position is ordinary program state the fold carries: the next offset,
106+
chunk index, or continuation token from the previous page. Each page is a
107+
normal upstream tool call.
101108

102109
A program that ignores this and reads the whole source fails closed at max_heap
103110
(demonstrated) — and that fail-closed teaches the model to use the paginated
@@ -117,21 +124,35 @@ The one gap for the M2 benchmark: the default filesystem MCP server
117124
(`@modelcontextprotocol/server-filesystem`) has only `head`/`tail`, **no
118125
offset** — it cannot forward-page. So M2 needs **one** chunk-capable line-read
119126
tool that returns a bounded page per call (offset/limit, or a chunk-index +
120-
lines-per-chunk equivalent). Such MCP servers exist off the shelf — a probe of
121-
one (chunk-index + lines-per-chunk, total-chunks/total-lines in every page so
122-
the fold bounds its loop exactly and `:done` is trivial) confirmed the
123-
paginated-read + Lisp-fold design works end to end. It is a bounded tool, not
124-
host infrastructure, and authority stays behind the normal tool grant.
127+
lines-per-chunk equivalent). `@willianpinho/large-file-mcp` provides this
128+
shape through `read_large_file_chunk` (`filePath`, `chunkIndex`,
129+
`linesPerChunk`) plus file search/navigation helpers. The repo now has an e2e
130+
smoke path using that server for turn-log introspection, confirming the
131+
paginated-read + Lisp-fold design works end to end. It is a bounded upstream
132+
tool, not host infrastructure, and authority stays behind the normal tool
133+
grant.
125134

126135
**Hard integrity requirement: the read tool must be rooted to the corpus.**
127-
The probed server took unrestricted absolute paths (it read `/etc/hosts`),
136+
`large-file-mcp` takes absolute paths and the probe read outside the corpus,
128137
which would let an agent read the manifests/scorer by path and defeat the A/B.
129138
The chosen tool must confine reads to the corpus directory (like
130139
`server-filesystem`'s allowed-dir), or be wrapped/sandboxed to it.
131140

141+
## Relationship to P3b large-file `log/` backend
142+
143+
[`turn-log-and-prelude-derivation.md`](turn-log-and-prelude-derivation.md)'s
144+
P3b is the narrow proving lane for this architecture. It keeps the existing
145+
semantic `log/` API and swaps only the backend: instead of the host-bound
146+
`TraceLog.Introspection.tools/1` backend, an example prelude reads turn-log
147+
JSONL pages through `@willianpinho/large-file-mcp` and projects
148+
`log/sessions`, `log/programs`, and `log/tool-calls` in PTC-Lisp.
149+
150+
This plan is the generalization: a reusable `data/` prelude over paginated
151+
sources. There is no conflict. P3b should avoid growing one-off paging helpers
152+
that cannot later be factored into the `data/` source-spec/fold conventions.
153+
132154
## Non-Goals
133155

134-
- No generic cursor / `tool/next` / host-held result buffer.
135156
- No capture of full tool results off-budget (the sandbox bills them anyway).
136157
- No change to ordinary `(tool/call ...)` *call* semantics or the 2 MiB
137158
transport cap. (The one core change is to ledger *retention*, not call
@@ -175,8 +196,8 @@ each call.
175196

176197
## Page size — the central tuning knob
177198

178-
With no host cursor, **each page is a real upstream call**, so page size trades
179-
two limits against each other:
199+
Each page is a real upstream call, so page size trades two limits against each
200+
other:
180201

181202
- **Too small** → many calls. Two ceilings, and the timeout is the tighter one:
182203
the per-eval upstream-call cap (default 50, the upstream `RunContext` cap at
@@ -198,14 +219,12 @@ prelude can size pages **adaptively** — `linesPerChunk ≈ ceil(totalLines / N
198219
for a target page count N under the call cap, capped so a parsed page fits
199220
max_heap — rather than a fixed default.
200221

201-
This is the honest cost of dropping the host cursor: pagination is N upstream
202-
round-trips per fold, bounded by the call-cap and (more tightly) the 1 s
203-
timeout. Lean to **few large pages**, not many small ones. Fine for the
204-
benchmark sizes if a parsed page fits max_heap; the scaling limit for very large
205-
sources. Mitigations if needed: raise the per-fold call cap and/or the eval
206-
timeout for paged reads, or a host-side cached paginated source (deferred — that
207-
is where a host cursor would re-enter, and it needs the off-budget accounting the
208-
sandbox does not give mid-eval today).
222+
Pagination is N upstream round-trips per fold, bounded by the call cap and
223+
(more tightly) the 1 s timeout. Lean to **few large pages**, not many small
224+
ones. Fine for the benchmark sizes if a parsed page fits max_heap; this is the
225+
scaling limit for very large sources. Mitigations if needed: raise the per-fold
226+
call cap and/or the eval timeout for paged reads, or move a specific workload
227+
to a specialized upstream that performs more aggregation server-side.
209228

210229
## Data Prelude
211230

@@ -367,18 +386,15 @@ rediscover the page-fold pattern from recorded runs.
367386
## Sequencing
368387

369388
1. **Chunk-capable read-lines tool.** Use an existing chunked-read MCP server
370-
(offset/limit or chunk-index + lines-per-chunk; off-the-shelf ones exist —
371-
one was probed and works). **Must be rooted to the corpus** (the probed one
372-
was not — integrity requirement above). Gates everything else (the default
373-
fileserver cannot page). Note the page envelope may be double-wrapped (MCP
374-
text block holding a JSON string whose field holds the `\n`-joined lines), so
375-
the prelude's row extraction is: unwrap → `json/parse-string` → take the
376-
lines field → `json/parse-lines`.
377-
2. **Core change: bound the in-eval tool ledger** (drop full result values past
378-
a bytes/entries cap, keep metadata + preview). Without this, paging does not
379-
bound memory — the ledger retains every page. Smallest, highest-leverage
380-
item; also a latent-bug fix for any tool-heavy eval. Test: a fold of many
381-
page calls stays within max_heap (it does not today).
389+
(offset/limit or chunk-index + lines-per-chunk). The e2e smoke uses
390+
`@willianpinho/large-file-mcp`, but the benchmark source must be rooted to
391+
the corpus or wrapped/sandboxed before M2. Note the page envelope may be
392+
double-wrapped (MCP text block holding a JSON string whose field holds the
393+
`\n`-joined lines), so the prelude's row extraction is: unwrap →
394+
`json/parse-string` → take the lines field → `json/parse-lines`.
395+
2. **Done: bound the in-eval tool ledger** (drop full result values past a
396+
bytes/entries cap, keep metadata + preview). This landed in `209b4bdf`, with
397+
large paged-result coverage in `bd7dba65`.
382398
3. **`data/` prelude** (fold + offset/token conventions in `:args` + field-first
383399
helpers), tested against a fake paginated tool. Authority is runtime-enforced
384400
(call fails closed if the tool is not granted), not attach-proven, for the
@@ -395,7 +411,7 @@ rediscover the page-fold pattern from recorded runs.
395411

396412
**Verified sound (rounds 2–3):**
397413

398-
- No host-side cursor/capture is added.
414+
- Page position is ordinary upstream/tool state threaded through `:args`.
399415
- The in-eval ledger retains full result values (`eval.ex:1216`); **no in-eval
400416
code re-reads them** (result returned directly at `eval.ex:1264`; ledger is
401417
side-effect state at `context.ex:326`) — so the compaction is
@@ -412,23 +428,22 @@ rediscover the page-fold pattern from recorded runs.
412428
- Dynamic `(tool/call (page-call ...))` is not attach-proven (literal-only
413429
inference, `compiler.ex:814`).
414430

415-
**Still unproven (resolve during implementation):**
431+
**Still unproven / remaining (resolve during implementation):**
416432

417433
- That a fold of the needed page count fits the 1 s timeout for the benchmark's
418434
local stdio tool — **measure before relying on it.**
419435
- That a chosen page size keeps every parsed page under max_heap for the corpus
420436
(depends on parse-expansion ratio) — measure; it fails closed if wrong.
421-
- That the in-eval ledger bound, once added, fully bounds a long fold's memory —
422-
test with a many-page fold over a multi-MB source.
423-
- That metadata + preview is enough for every `Step.tool_calls` consumer — this
424-
is a **public contract change** (`call.result` may be a preview); update tests
425-
and envelope expectations. `tool_cache` and `child_step` retain full data via
426-
separate paths and are out of scope / preserved respectively.
437+
- That the in-eval ledger bound fully bounds the intended M2 data-prelude fold
438+
under realistic page sizes and stdio latency. The core mechanism is covered;
439+
the benchmark workload still needs measurement.
440+
- That metadata + preview is enough for every downstream `Step.tool_calls`
441+
consumer outside the tests already updated. `tool_cache` and `child_step`
442+
retain full data via separate paths and are out of scope / preserved
443+
respectively.
427444

428445
## Explicitly deferred
429446

430-
- Host-side cached paginated source (would re-introduce a host cursor and needs
431-
off-budget mid-eval accounting the sandbox does not give today).
432447
- Approximate-state structures and host-side accumulator spill for O(n)
433448
analyses.
434449
- Raising the per-fold upstream-call cap for very large sources (only if a real

0 commit comments

Comments
 (0)