[fix] inspect ai by tkukurin · Pull Request #10 · tkukurin/llmbda

tkukurin · 2026-04-30T19:10:30Z

Vibe progressing to glory.

- step_check(name, predicate) scores a step via its StepResult (value + metadata) - predicate returns float | bool | Score - complements step_scorer which delegates to an inner Inspect scorer - add gsm8k benchmark example using litellm + inspect ai eval - fix E731 in calendar_booking.py

# Conflicts: # examples/calendar_booking.py # src/tk/llmbda/inspect.py # tests/test_inspect.py

@lm

- skill_solver rebinds @lm steps to use Inspect's model when model != none - sync-async bridge via run_in_executor + run_coroutine_threadsafe - _rebind_skill deep-copies skill tree preserving system_prompt - lazy model resolution (_get_model) avoids errors for deterministic skills - gsm8k/skill.py: update to new StepResult API (meta, keyword args) - gsm8k/scoring.py: use .meta in step_check predicate - 5 new tests: model routing, system prompt preservation, mixed pipelines Known limitation: Inspect transcript doesn't capture individual model request/response pairs (thread context doesn't propagate); trace values and per-step scores ARE logged via StateEvent.

@lm

- arun_skill / aiter_skill / afst_match: async walker, handles mixed sync+async fns - @lm detects async def, produces async wrapper - inspect: _rebind_skill_async + _await_in_context propagate contextvars via create_task(context=) - skill_solver uses arun_skill directly, no ThreadPoolExecutor - ModelEvent now appears in per-sample transcript with input/output/tokens - scoring.py: INSPECT_MODEL env var to route through Inspect model Previously skill_solver ran the skill tree in a ThreadPoolExecutor; @lm steps bridged back via run_coroutine_threadsafe which drops contextvars. Inspect uses those to track which sample/task a model call belongs to, so model calls worked but were invisible in the transcript. Fix: run the walker async on the event loop and schedule model coroutines with create_task(coro, context=captured_ctx) (3.11+).

- passthrough_model(fn) registers any LMCaller as Inspect ModelAPI - _make_async_caller returns message_log; solver appends to state.messages

- scoring.py defaults to passthrough_model(scripted_crag_model) - model events + messages visible in inspect view without API keys

- export call_lm, default INSPECT_MODEL to passthrough_model(call_lm) - log_dir points to repo-root logs/

- log_dir points to <repo>/logs/ like other examples

@lm

* feat(inspect): add step_check for predicate-based step scoring - step_check(name, predicate) scores a step via its StepResult (value + metadata) - predicate returns float | bool | Score - complements step_scorer which delegates to an inner Inspect scorer - add gsm8k benchmark example using litellm + inspect ai eval - fix E731 in calendar_booking.py * fix naming * feat(inspect): route @lm calls through Inspect model, add gsm8k example - skill_solver rebinds @lm steps to use Inspect's model when model != none - sync-async bridge via run_in_executor + run_coroutine_threadsafe - _rebind_skill deep-copies skill tree preserving system_prompt - lazy model resolution (_get_model) avoids errors for deterministic skills - gsm8k/skill.py: update to new StepResult API (meta, keyword args) - gsm8k/scoring.py: use .meta in step_check predicate - 5 new tests: model routing, system prompt preservation, mixed pipelines Known limitation: Inspect transcript doesn't capture individual model request/response pairs (thread context doesn't propagate); trace values and per-step scores ARE logged via StateEvent. * docs(gsm8k): update skill to new API (meta, keyword StepResult args) * feat(core)!: add arun_skill, async @lm, full inspect transcript logging - arun_skill / aiter_skill / afst_match: async walker, handles mixed sync+async fns - @lm detects async def, produces async wrapper - inspect: _rebind_skill_async + _await_in_context propagate contextvars via create_task(context=) - skill_solver uses arun_skill directly, no ThreadPoolExecutor - ModelEvent now appears in per-sample transcript with input/output/tokens - scoring.py: INSPECT_MODEL env var to route through Inspect model Previously skill_solver ran the skill tree in a ThreadPoolExecutor; @lm steps bridged back via run_coroutine_threadsafe which drops contextvars. Inspect uses those to track which sample/task a model call belongs to, so model calls worked but were invisible in the transcript. Fix: run the walker async on the event loop and schedule model coroutines with create_task(coro, context=captured_ctx) (3.11+). * fmt * feat(inspect): add passthrough_model, collect messages for Messages tab - passthrough_model(fn) registers any LMCaller as Inspect ModelAPI - _make_async_caller returns message_log; solver appends to state.messages * feat(crag): add example with full inspect transcript logging - scoring.py defaults to passthrough_model(scripted_crag_model) - model events + messages visible in inspect view without API keys * fix(gsm8k): route through passthrough_model so messages appear in UI - export call_lm, default INSPECT_MODEL to passthrough_model(call_lm) - log_dir points to repo-root logs/ * fix(triage): use consistent log_dir at repo root - log_dir points to <repo>/logs/ like other examples * fmt * cleanup * rm unused

tkukurin added 14 commits April 29, 2026 23:38

Merge branch 'main' into tk/gsm8k

7562864

# Conflicts: # examples/calendar_booking.py # src/tk/llmbda/inspect.py # tests/test_inspect.py

fix naming

172fa1a

docs(gsm8k): update skill to new API (meta, keyword StepResult args)

b0c7a31

fmt

fc6a119

feat(inspect): add passthrough_model, collect messages for Messages tab

a121792

- passthrough_model(fn) registers any LMCaller as Inspect ModelAPI - _make_async_caller returns message_log; solver appends to state.messages

feat(crag): add example with full inspect transcript logging

5fc7f75

- scoring.py defaults to passthrough_model(scripted_crag_model) - model events + messages visible in inspect view without API keys

fix(gsm8k): route through passthrough_model so messages appear in UI

c6fe0ac

- export call_lm, default INSPECT_MODEL to passthrough_model(call_lm) - log_dir points to repo-root logs/

fix(triage): use consistent log_dir at repo root

f489982

- log_dir points to <repo>/logs/ like other examples

fmt

597833a

cleanup

1094d1b

rm unused

93e0a35

Base automatically changed from tk/gsm8k to main May 3, 2026 15:07

Merge branch 'main' of github.com:tkukurin/llmbda into tk/inspectfix

5698889

tkukurin merged commit 1642ac4 into main May 3, 2026
4 checks passed

tkukurin deleted the tk/inspectfix branch May 3, 2026 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] inspect ai#10

[fix] inspect ai#10
tkukurin merged 15 commits into
mainfrom
tk/inspectfix

tkukurin commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tkukurin commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant