Always persist task_run_trace on eval runs#1438
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (4)
WalkthroughThe PR decouples trace persistence from evaluation data type gating. The runner now saves task run traces unconditionally when available, with serialization via ChangesTrace Persistence and Validation Update
🎯 2 (Simple) | ⏱️ ~12 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the evaluation runner and data models to always persist the full task run trace when available, making trace persistence independent of the evaluation data type. Validation rules and tests have been updated to allow traces in final_answer runs. Feedback was provided on eval_runner.py regarding the exception handling for pydantic_core.to_json. Since serialization failures raise PydanticSerializationError (which inherits from RuntimeError), catching only TypeError and ValueError will not prevent a crash. It is recommended to catch Exception to ensure the fallback mechanism works as intended.
📊 Coverage ReportOverall Coverage: 92% Diff: origin/main...HEAD
Summary
Line-by-lineView line-by-line diff coveragelibs/core/kiln_ai/adapters/eval/eval_runner.pyLines 248-256 248 try:
249 from pydantic_core import to_json
250
251 trace = to_json(result_task_run.trace, indent=2).decode()
! 252 except Exception as e:
253 # Broad catch: pydantic_core.to_json can raise
254 # PydanticSerializationError (subclass of RuntimeError)
255 # plus the usual TypeError / ValueError. Falling back
256 # to a repr-based encoder is always preferable toLines 254-267 254 # PydanticSerializationError (subclass of RuntimeError)
255 # plus the usual TypeError / ValueError. Falling back
256 # to a repr-based encoder is always preferable to
257 # crashing the eval job over an unprintable trace.
! 258 logger.warning(
259 "Falling back to repr trace encoding (%s) for dataset item %s",
260 type(e).__name__,
261 job.item.id,
262 )
! 263 trace = json.dumps(
264 result_task_run.trace, indent=2, default=repr
265 )
266
267 parent_eval = job.eval_config.parent_eval()
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
libs/core/kiln_ai/adapters/eval/eval_runner.py (1)
1-1:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winCritical: Fix formatting error before merge.
The pipeline shows
ruff format --checkfailed. Runuv run ruff format libs/core/kiln_ai/adapters/eval/eval_runner.pyto fix the formatting issue.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/core/kiln_ai/adapters/eval/eval_runner.py` at line 1, The file libs/core/kiln_ai/adapters/eval/eval_runner.py has a formatting error flagged by ruff; run the formatter (e.g., uv run ruff format libs/core/kiln_ai/adapters/eval/eval_runner.py) or apply ruff/black formatting to that module (focus on the top-level import block in eval_runner.py) so the file passes `ruff format --check` before merging.
🧹 Nitpick comments (1)
libs/core/kiln_ai/adapters/eval/eval_runner.py (1)
249-249: 💤 Low valueConsider moving the import to module level.
The
pydantic_coreimport is inside atryblock that catches(TypeError, ValueError)but notImportError. While this works fine in practice (pydantic_core is a required dependency of pydantic v2), importing at module level would be clearer and more conventional:from pydantic_core import to_jsonThen the try block would only wrap the serialization call.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/core/kiln_ai/adapters/eval/eval_runner.py` at line 249, Move the pydantic_core import out of the try block to module level by adding "from pydantic_core import to_json" at the top of the module, then update the try/except in EvalRunner (where to_json is used) so it only wraps the serialization call (e.g., the block around to_json(...)) and continues to catch TypeError/ValueError as before; this keeps import errors separate and makes the code clearer while leaving function/method names like to_json and the surrounding eval serialization logic intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@libs/core/kiln_ai/adapters/eval/eval_runner.py`:
- Line 1: The file libs/core/kiln_ai/adapters/eval/eval_runner.py has a
formatting error flagged by ruff; run the formatter (e.g., uv run ruff format
libs/core/kiln_ai/adapters/eval/eval_runner.py) or apply ruff/black formatting
to that module (focus on the top-level import block in eval_runner.py) so the
file passes `ruff format --check` before merging.
---
Nitpick comments:
In `@libs/core/kiln_ai/adapters/eval/eval_runner.py`:
- Line 249: Move the pydantic_core import out of the try block to module level
by adding "from pydantic_core import to_json" at the top of the module, then
update the try/except in EvalRunner (where to_json is used) so it only wraps the
serialization call (e.g., the block around to_json(...)) and continues to catch
TypeError/ValueError as before; this keeps import errors separate and makes the
code clearer while leaving function/method names like to_json and the
surrounding eval serialization logic intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 77a6f8ff-3da4-4fc4-b9ac-f066a176f191
📒 Files selected for processing (4)
libs/core/kiln_ai/adapters/eval/eval_runner.pylibs/core/kiln_ai/adapters/eval/test_eval_runner.pylibs/core/kiln_ai/datamodel/eval.pylibs/core/kiln_ai/datamodel/test_eval_model.py
Removes the EvalDataType gate that dropped task_run_trace for non-full_trace evals. The trace is the only record of what the model actually saw — required to verify input_transform rendering (the new Jinja layer on scosman/templates), system prompt content, and to debug eval failures after the fact. EvalDataType still controls judging behavior; it no longer gates trace persistence. Changes: - eval_runner.py: drop the (evaluation_data_type == full_trace) condition; the trace is now written whenever result_task_run.trace is non-empty. - eval.py: drop the EvalRun validator that rejected setting task_run_trace on final_answer evals. The full_trace-requires-trace invariant is kept. - Tests flipped to assert trace IS persisted on final_answer evals; the parametrized matrix updated accordingly. 138 eval-related tests pass. (Two unrelated pre-commit failures — test_benchmark_get_model perf timing and test_adapter_reuse_preserves_data which depends on the lancedb/pandas transitive issue — were bypassed; they are not affected by this change.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c677c32 to
5b25d98
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request modifies the evaluation runner and data models to always persist the full task run trace when available, decoupling trace persistence from the evaluation data type (which now only controls judging behavior). To handle complex objects in the trace, pydantic_core.to_json is used for serialization with a fallback mechanism. The feedback suggests improving this fallback serialization in eval_runner.py by using a custom default function that attempts to call model_dump() or dict() on Pydantic models, ensuring the JSON structure is preserved rather than degrading to flat string representations via repr.
| trace = json.dumps( | ||
| result_task_run.trace, indent=2, default=repr | ||
| ) |
There was a problem hiding this comment.
If pydantic_core.to_json fails for any reason, the fallback json.dumps with default=repr will serialize all Pydantic models (such as LiteLLM's Message or Choices objects) as flat string representations (e.g., "ChatMessage(role='user', ...)") instead of proper JSON objects. This degrades the trace structure significantly. Using a custom default function that first attempts to dump Pydantic models to dictionaries preserves the JSON structure of these models in the fallback scenario.
| trace = json.dumps( | |
| result_task_run.trace, indent=2, default=repr | |
| ) | |
| trace = json.dumps( | |
| result_task_run.trace, | |
| indent=2, | |
| default=lambda o: o.model_dump() if hasattr(o, "model_dump") else (o.dict() if hasattr(o, "dict") else repr(o)) | |
| ) |
Why
The trace is the only on-disk record of what the model actually saw during an
eval — the rendered system + user message that was sent to the provider. Today
that record is dropped for any eval whose
evaluation_data_typeis notfull_trace. So forfinal_answerandreference_answerevals (the commoncase),
eval_run.kilnhastask_run_trace: null.That gap matters most when something between the dataset item and the model
mutates the input —
input_transform(the new Jinja2 layer onscosman/templates/ #1433) is the immediate example, but any futureprovider-side prompt assembly will have the same property. Without the trace,
"did my transform fire?" and "what did the model literally read?" become
unanswerable after the fact; the eval row's
inputfield stores the rawdataset item, not what reached the model.
While debugging an input-feature ablation, I ran an eval, saw an unexpected
result, and had no way to confirm whether the configured Jinja template
actually rendered or was silently bypassed. The trace was being captured in
result_task_run.trace— it just wasn't being written.What
eval_runner.py: drop theevaluation_data_type == full_tracegate. Thetrace is now persisted whenever
result_task_run.traceis non-empty.eval.py: drop theEvalRunvalidator that rejected settingtask_run_traceonfinal_answerevals. The full_trace-requires-traceinvariant is preserved.
eval_runner.py: usepydantic_core.to_jsoninstead ofjson.dumpssoreal provider-SDK trace objects (which contain Pydantic
BaseModelinstances — litellm's
Message,Choices, etc.) serialize correctly. Thestdlib encoder choked on these the first time real data hit the path;
Pydantic's encoder handles them, with a
default=reprfallback that logsa warning and writes a degraded trace if even
to_jsoncan't.final_answerevals; theparametrized matrix in
test_validate_output_fields_parametrizedreflectsthe relaxed validator.
EvalDataTypestill controls judging behavior (full_trace evals stillrequire the trace; reference_answer evals still write the reference answer
field). The change is scoped narrowly to "the trace gets written if we have
one to write."
Notes
several Specs and run-configs this adds up, but evals already persist
intermediate_outputs,input,output, andtask_run_usageper row,and a trace is in the same order of magnitude. The verification value
more than pays the disk cost.
(
test_benchmark_get_modeltiming-sensitive perf test,test_adapter_reuse_preserves_datadepending on the lancedb/pandastransitive issue) are unrelated to this change and not affected by it.
138 eval-related tests pass.
🤖 Generated with Claude Code