[fix] inspect ai#10
Merged
Merged
Conversation
- step_check(name, predicate) scores a step via its StepResult (value + metadata) - predicate returns float | bool | Score - complements step_scorer which delegates to an inner Inspect scorer - add gsm8k benchmark example using litellm + inspect ai eval - fix E731 in calendar_booking.py
# Conflicts: # examples/calendar_booking.py # src/tk/llmbda/inspect.py # tests/test_inspect.py
- skill_solver rebinds @lm steps to use Inspect's model when model != none - sync-async bridge via run_in_executor + run_coroutine_threadsafe - _rebind_skill deep-copies skill tree preserving system_prompt - lazy model resolution (_get_model) avoids errors for deterministic skills - gsm8k/skill.py: update to new StepResult API (meta, keyword args) - gsm8k/scoring.py: use .meta in step_check predicate - 5 new tests: model routing, system prompt preservation, mixed pipelines Known limitation: Inspect transcript doesn't capture individual model request/response pairs (thread context doesn't propagate); trace values and per-step scores ARE logged via StateEvent.
- arun_skill / aiter_skill / afst_match: async walker, handles mixed sync+async fns - @lm detects async def, produces async wrapper - inspect: _rebind_skill_async + _await_in_context propagate contextvars via create_task(context=) - skill_solver uses arun_skill directly, no ThreadPoolExecutor - ModelEvent now appears in per-sample transcript with input/output/tokens - scoring.py: INSPECT_MODEL env var to route through Inspect model Previously skill_solver ran the skill tree in a ThreadPoolExecutor; @lm steps bridged back via run_coroutine_threadsafe which drops contextvars. Inspect uses those to track which sample/task a model call belongs to, so model calls worked but were invisible in the transcript. Fix: run the walker async on the event loop and schedule model coroutines with create_task(coro, context=captured_ctx) (3.11+).
- passthrough_model(fn) registers any LMCaller as Inspect ModelAPI - _make_async_caller returns message_log; solver appends to state.messages
- scoring.py defaults to passthrough_model(scripted_crag_model) - model events + messages visible in inspect view without API keys
- export call_lm, default INSPECT_MODEL to passthrough_model(call_lm) - log_dir points to repo-root logs/
- log_dir points to <repo>/logs/ like other examples
tkukurin
added a commit
that referenced
this pull request
May 5, 2026
* feat(inspect): add step_check for predicate-based step scoring - step_check(name, predicate) scores a step via its StepResult (value + metadata) - predicate returns float | bool | Score - complements step_scorer which delegates to an inner Inspect scorer - add gsm8k benchmark example using litellm + inspect ai eval - fix E731 in calendar_booking.py * fix naming * feat(inspect): route @lm calls through Inspect model, add gsm8k example - skill_solver rebinds @lm steps to use Inspect's model when model != none - sync-async bridge via run_in_executor + run_coroutine_threadsafe - _rebind_skill deep-copies skill tree preserving system_prompt - lazy model resolution (_get_model) avoids errors for deterministic skills - gsm8k/skill.py: update to new StepResult API (meta, keyword args) - gsm8k/scoring.py: use .meta in step_check predicate - 5 new tests: model routing, system prompt preservation, mixed pipelines Known limitation: Inspect transcript doesn't capture individual model request/response pairs (thread context doesn't propagate); trace values and per-step scores ARE logged via StateEvent. * docs(gsm8k): update skill to new API (meta, keyword StepResult args) * feat(core)!: add arun_skill, async @lm, full inspect transcript logging - arun_skill / aiter_skill / afst_match: async walker, handles mixed sync+async fns - @lm detects async def, produces async wrapper - inspect: _rebind_skill_async + _await_in_context propagate contextvars via create_task(context=) - skill_solver uses arun_skill directly, no ThreadPoolExecutor - ModelEvent now appears in per-sample transcript with input/output/tokens - scoring.py: INSPECT_MODEL env var to route through Inspect model Previously skill_solver ran the skill tree in a ThreadPoolExecutor; @lm steps bridged back via run_coroutine_threadsafe which drops contextvars. Inspect uses those to track which sample/task a model call belongs to, so model calls worked but were invisible in the transcript. Fix: run the walker async on the event loop and schedule model coroutines with create_task(coro, context=captured_ctx) (3.11+). * fmt * feat(inspect): add passthrough_model, collect messages for Messages tab - passthrough_model(fn) registers any LMCaller as Inspect ModelAPI - _make_async_caller returns message_log; solver appends to state.messages * feat(crag): add example with full inspect transcript logging - scoring.py defaults to passthrough_model(scripted_crag_model) - model events + messages visible in inspect view without API keys * fix(gsm8k): route through passthrough_model so messages appear in UI - export call_lm, default INSPECT_MODEL to passthrough_model(call_lm) - log_dir points to repo-root logs/ * fix(triage): use consistent log_dir at repo root - log_dir points to <repo>/logs/ like other examples * fmt * cleanup * rm unused
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Vibe progressing to glory.