Skip to content

[fix] inspect ai#10

Merged
tkukurin merged 15 commits into
mainfrom
tk/inspectfix
May 3, 2026
Merged

[fix] inspect ai#10
tkukurin merged 15 commits into
mainfrom
tk/inspectfix

Conversation

@tkukurin

Copy link
Copy Markdown
Owner

Vibe progressing to glory.

tkukurin added 14 commits April 29, 2026 23:38
- step_check(name, predicate) scores a step via its StepResult (value + metadata)
- predicate returns float | bool | Score
- complements step_scorer which delegates to an inner Inspect scorer
- add gsm8k benchmark example using litellm + inspect ai eval
- fix E731 in calendar_booking.py
# Conflicts:
#	examples/calendar_booking.py
#	src/tk/llmbda/inspect.py
#	tests/test_inspect.py
- skill_solver rebinds @lm steps to use Inspect's model when model != none
- sync-async bridge via run_in_executor + run_coroutine_threadsafe
- _rebind_skill deep-copies skill tree preserving system_prompt
- lazy model resolution (_get_model) avoids errors for deterministic skills
- gsm8k/skill.py: update to new StepResult API (meta, keyword args)
- gsm8k/scoring.py: use .meta in step_check predicate
- 5 new tests: model routing, system prompt preservation, mixed pipelines

Known limitation: Inspect transcript doesn't capture individual model
request/response pairs (thread context doesn't propagate); trace values
and per-step scores ARE logged via StateEvent.
- arun_skill / aiter_skill / afst_match: async walker, handles mixed sync+async fns
- @lm detects async def, produces async wrapper
- inspect: _rebind_skill_async + _await_in_context propagate contextvars via create_task(context=)
- skill_solver uses arun_skill directly, no ThreadPoolExecutor
- ModelEvent now appears in per-sample transcript with input/output/tokens
- scoring.py: INSPECT_MODEL env var to route through Inspect model

Previously skill_solver ran the skill tree in a ThreadPoolExecutor; @lm steps
bridged back via run_coroutine_threadsafe which drops contextvars. Inspect uses
those to track which sample/task a model call belongs to, so model calls worked
but were invisible in the transcript. Fix: run the walker async on the event loop
and schedule model coroutines with create_task(coro, context=captured_ctx) (3.11+).
- passthrough_model(fn) registers any LMCaller as Inspect ModelAPI
- _make_async_caller returns message_log; solver appends to state.messages
- scoring.py defaults to passthrough_model(scripted_crag_model)
- model events + messages visible in inspect view without API keys
- export call_lm, default INSPECT_MODEL to passthrough_model(call_lm)
- log_dir points to repo-root logs/
- log_dir points to <repo>/logs/ like other examples
Base automatically changed from tk/gsm8k to main May 3, 2026 15:07
@tkukurin tkukurin merged commit 1642ac4 into main May 3, 2026
4 checks passed
@tkukurin tkukurin deleted the tk/inspectfix branch May 3, 2026 16:00
tkukurin added a commit that referenced this pull request May 5, 2026
* feat(inspect): add step_check for predicate-based step scoring

- step_check(name, predicate) scores a step via its StepResult (value + metadata)
- predicate returns float | bool | Score
- complements step_scorer which delegates to an inner Inspect scorer
- add gsm8k benchmark example using litellm + inspect ai eval
- fix E731 in calendar_booking.py

* fix naming

* feat(inspect): route @lm calls through Inspect model, add gsm8k example

- skill_solver rebinds @lm steps to use Inspect's model when model != none
- sync-async bridge via run_in_executor + run_coroutine_threadsafe
- _rebind_skill deep-copies skill tree preserving system_prompt
- lazy model resolution (_get_model) avoids errors for deterministic skills
- gsm8k/skill.py: update to new StepResult API (meta, keyword args)
- gsm8k/scoring.py: use .meta in step_check predicate
- 5 new tests: model routing, system prompt preservation, mixed pipelines

Known limitation: Inspect transcript doesn't capture individual model
request/response pairs (thread context doesn't propagate); trace values
and per-step scores ARE logged via StateEvent.

* docs(gsm8k): update skill to new API (meta, keyword StepResult args)

* feat(core)!: add arun_skill, async @lm, full inspect transcript logging

- arun_skill / aiter_skill / afst_match: async walker, handles mixed sync+async fns
- @lm detects async def, produces async wrapper
- inspect: _rebind_skill_async + _await_in_context propagate contextvars via create_task(context=)
- skill_solver uses arun_skill directly, no ThreadPoolExecutor
- ModelEvent now appears in per-sample transcript with input/output/tokens
- scoring.py: INSPECT_MODEL env var to route through Inspect model

Previously skill_solver ran the skill tree in a ThreadPoolExecutor; @lm steps
bridged back via run_coroutine_threadsafe which drops contextvars. Inspect uses
those to track which sample/task a model call belongs to, so model calls worked
but were invisible in the transcript. Fix: run the walker async on the event loop
and schedule model coroutines with create_task(coro, context=captured_ctx) (3.11+).

* fmt

* feat(inspect): add passthrough_model, collect messages for Messages tab

- passthrough_model(fn) registers any LMCaller as Inspect ModelAPI
- _make_async_caller returns message_log; solver appends to state.messages

* feat(crag): add example with full inspect transcript logging

- scoring.py defaults to passthrough_model(scripted_crag_model)
- model events + messages visible in inspect view without API keys

* fix(gsm8k): route through passthrough_model so messages appear in UI

- export call_lm, default INSPECT_MODEL to passthrough_model(call_lm)
- log_dir points to repo-root logs/

* fix(triage): use consistent log_dir at repo root

- log_dir points to <repo>/logs/ like other examples

* fmt

* cleanup

* rm unused
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant