Always persist task_run_trace on eval runs by tawnymanticore · Pull Request #1438 · Kiln-AI/Kiln

tawnymanticore · 2026-06-01T16:38:07Z

Why

The trace is the only on-disk record of what the model actually saw during an
eval — the rendered system + user message that was sent to the provider. Today
that record is dropped for any eval whose evaluation_data_type is not
full_trace. So for final_answer and reference_answer evals (the common
case), eval_run.kiln has task_run_trace: null.

That gap matters most when something between the dataset item and the model
mutates the input — input_transform (the new Jinja2 layer on
scosman/templates / #1433) is the immediate example, but any future
provider-side prompt assembly will have the same property. Without the trace,
"did my transform fire?" and "what did the model literally read?" become
unanswerable after the fact; the eval row's input field stores the raw
dataset item, not what reached the model.

While debugging an input-feature ablation, I ran an eval, saw an unexpected
result, and had no way to confirm whether the configured Jinja template
actually rendered or was silently bypassed. The trace was being captured in
result_task_run.trace — it just wasn't being written.

What

eval_runner.py: drop the evaluation_data_type == full_trace gate. The
trace is now persisted whenever result_task_run.trace is non-empty.
eval.py: drop the EvalRun validator that rejected setting
task_run_trace on final_answer evals. The full_trace-requires-trace
invariant is preserved.
eval_runner.py: use pydantic_core.to_json instead of json.dumps so
real provider-SDK trace objects (which contain Pydantic BaseModel
instances — litellm's Message, Choices, etc.) serialize correctly. The
stdlib encoder choked on these the first time real data hit the path;
Pydantic's encoder handles them, with a default=repr fallback that logs
a warning and writes a degraded trace if even to_json can't.
Tests updated to assert traces are preserved on final_answer evals; the
parametrized matrix in test_validate_output_fields_parametrized reflects
the relaxed validator.

EvalDataType still controls judging behavior (full_trace evals still
require the trace; reference_answer evals still write the reference answer
field). The change is scoped narrowly to "the trace gets written if we have
one to write."

Notes

Storage: traces are typically 5–15 KB each. On a 1000-row dataset across
several Specs and run-configs this adds up, but evals already persist
intermediate_outputs, input, output, and task_run_usage per row,
and a trace is in the same order of magnitude. The verification value
more than pays the disk cost.
The two pre-existing pre-commit failures on this branch
(test_benchmark_get_model timing-sensitive perf test,
test_adapter_reuse_preserves_data depending on the lancedb/pandas
transitive issue) are unrelated to this change and not affected by it.
138 eval-related tests pass.

🤖 Generated with Claude Code

coderabbitai · 2026-06-01T16:38:22Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 20f252ab-0b02-4c20-ad9b-eb3923a83495

📥 Commits

Reviewing files that changed from the base of the PR and between c677c32 and 5b25d98.

📒 Files selected for processing (4)

libs/core/kiln_ai/adapters/eval/eval_runner.py
libs/core/kiln_ai/adapters/eval/test_eval_runner.py
libs/core/kiln_ai/datamodel/eval.py
libs/core/kiln_ai/datamodel/test_eval_model.py

🚧 Files skipped from review as they are similar to previous changes (4)

libs/core/kiln_ai/adapters/eval/test_eval_runner.py
libs/core/kiln_ai/datamodel/test_eval_model.py
libs/core/kiln_ai/adapters/eval/eval_runner.py
libs/core/kiln_ai/datamodel/eval.py

Walkthrough

The PR decouples trace persistence from evaluation data type gating. The runner now saves task run traces unconditionally when available, with serialization via pydantic_core.to_json() and fallback to json.dumps(). Validation is relaxed to allow traces on final_answer runs while maintaining requirements for full_trace runs.

Changes

Trace Persistence and Validation Update

Layer / File(s)	Summary
EvalRunner trace persistence `libs/core/kiln_ai/adapters/eval/eval_runner.py`, `libs/core/kiln_ai/adapters/eval/test_eval_runner.py`	`EvalRunner.run_job` now persists `result_task_run.trace` whenever available, using `pydantic_core.to_json()` with `json.dumps(default=repr)` fallback. Test updated to document and assert trace persistence for `final_answer` evaluation type.
EvalRun validation update `libs/core/kiln_ai/datamodel/eval.py`, `libs/core/kiln_ai/datamodel/test_eval_model.py`	`EvalRun.validate_output_fields` relaxes `task_run_trace` constraints for `final_answer` runs, permitting optional trace storage while maintaining requirements for `full_trace` runs. Tests updated to validate trace preservation and parametrized expectations adjusted accordingly.

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

leonardmq
scosman
chiang-daniel

🐰
Traces now flow free where they roam and play,
No longer locked by the type of the day,
Debug lines sparkle in JSON light,
Hopping through evals from morning to night,
A rabbit cheers builds that help us find our way.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description is comprehensive and well-structured, covering the why (motivation), what (changes made), and notes (storage implications). However, it lacks the required template sections: no explicit 'Related Issues' link, no CLA confirmation statement, and no checklist validation.	Add the standard PR template sections: explicitly link related issues (e.g., `#1433`), include CLA confirmation statement with username, and complete the test checklist items.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: the PR now persists task_run_trace on all eval runs regardless of evaluation_data_type, moving away from conditional persistence based on full_trace data type.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch tawnymanticore/always-persist-eval-trace

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request updates the evaluation runner and data models to always persist the full task run trace when available, making trace persistence independent of the evaluation data type. Validation rules and tests have been updated to allow traces in final_answer runs. Feedback was provided on eval_runner.py regarding the exception handling for pydantic_core.to_json. Since serialization failures raise PydanticSerializationError (which inherits from RuntimeError), catching only TypeError and ValueError will not prevent a crash. It is recommended to catch Exception to ensure the fallback mechanism works as intended.

github-actions · 2026-06-01T16:41:00Z

📊 Coverage Report

Overall Coverage: 92%

Diff: origin/main...HEAD

libs/core/kiln_ai/adapters/eval/eval_runner.py (62.5%): Missing lines 252,258,263

Summary

Total: 8 lines
Missing: 3 lines
Coverage: 62%

Line-by-line

View line-by-line diff coverage

libs/core/kiln_ai/adapters/eval/eval_runner.py

Lines 248-256

  248                     try:
  249                         from pydantic_core import to_json
  250 
  251                         trace = to_json(result_task_run.trace, indent=2).decode()
! 252                     except Exception as e:
  253                         # Broad catch: pydantic_core.to_json can raise
  254                         # PydanticSerializationError (subclass of RuntimeError)
  255                         # plus the usual TypeError / ValueError. Falling back
  256                         # to a repr-based encoder is always preferable to

Lines 254-267

  254                         # PydanticSerializationError (subclass of RuntimeError)
  255                         # plus the usual TypeError / ValueError. Falling back
  256                         # to a repr-based encoder is always preferable to
  257                         # crashing the eval job over an unprintable trace.
! 258                         logger.warning(
  259                             "Falling back to repr trace encoding (%s) for dataset item %s",
  260                             type(e).__name__,
  261                             job.item.id,
  262                         )
! 263                         trace = json.dumps(
  264                             result_task_run.trace, indent=2, default=repr
  265                         )
  266 
  267                 parent_eval = job.eval_config.parent_eval()

📊 HTML Coverage Report - Interactive coverage report
📈 Diff Coverage Report - Detailed diff analysis
Github Actions Run - View the full coverage report

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

libs/core/kiln_ai/adapters/eval/eval_runner.py (1)
1-1: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: Fix formatting error before merge.

The pipeline shows ruff format --check failed. Run uv run ruff format libs/core/kiln_ai/adapters/eval/eval_runner.py to fix the formatting issue.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/adapters/eval/eval_runner.py` at line 1, The file
libs/core/kiln_ai/adapters/eval/eval_runner.py has a formatting error flagged by
ruff; run the formatter (e.g., uv run ruff format
libs/core/kiln_ai/adapters/eval/eval_runner.py) or apply ruff/black formatting
to that module (focus on the top-level import block in eval_runner.py) so the
file passes `ruff format --check` before merging.

🧹 Nitpick comments (1)

libs/core/kiln_ai/adapters/eval/eval_runner.py (1)
249-249: 💤 Low value

Consider moving the import to module level.

The pydantic_core import is inside a try block that catches (TypeError, ValueError) but not ImportError. While this works fine in practice (pydantic_core is a required dependency of pydantic v2), importing at module level would be clearer and more conventional:
from pydantic_core import to_json
Then the try block would only wrap the serialization call.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/adapters/eval/eval_runner.py` at line 249, Move the
pydantic_core import out of the try block to module level by adding "from
pydantic_core import to_json" at the top of the module, then update the
try/except in EvalRunner (where to_json is used) so it only wraps the
serialization call (e.g., the block around to_json(...)) and continues to catch
TypeError/ValueError as before; this keeps import errors separate and makes the
code clearer while leaving function/method names like to_json and the
surrounding eval serialization logic intact.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@libs/core/kiln_ai/adapters/eval/eval_runner.py`:
- Line 1: The file libs/core/kiln_ai/adapters/eval/eval_runner.py has a
formatting error flagged by ruff; run the formatter (e.g., uv run ruff format
libs/core/kiln_ai/adapters/eval/eval_runner.py) or apply ruff/black formatting
to that module (focus on the top-level import block in eval_runner.py) so the
file passes `ruff format --check` before merging.

---

Nitpick comments:
In `@libs/core/kiln_ai/adapters/eval/eval_runner.py`:
- Line 249: Move the pydantic_core import out of the try block to module level
by adding "from pydantic_core import to_json" at the top of the module, then
update the try/except in EvalRunner (where to_json is used) so it only wraps the
serialization call (e.g., the block around to_json(...)) and continues to catch
TypeError/ValueError as before; this keeps import errors separate and makes the
code clearer while leaving function/method names like to_json and the
surrounding eval serialization logic intact.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 77a6f8ff-3da4-4fc4-b9ac-f066a176f191

📥 Commits

Reviewing files that changed from the base of the PR and between 89676e0 and c677c32.

📒 Files selected for processing (4)

libs/core/kiln_ai/adapters/eval/eval_runner.py
libs/core/kiln_ai/adapters/eval/test_eval_runner.py
libs/core/kiln_ai/datamodel/eval.py
libs/core/kiln_ai/datamodel/test_eval_model.py

Removes the EvalDataType gate that dropped task_run_trace for non-full_trace evals. The trace is the only record of what the model actually saw — required to verify input_transform rendering (the new Jinja layer on scosman/templates), system prompt content, and to debug eval failures after the fact. EvalDataType still controls judging behavior; it no longer gates trace persistence. Changes: - eval_runner.py: drop the (evaluation_data_type == full_trace) condition; the trace is now written whenever result_task_run.trace is non-empty. - eval.py: drop the EvalRun validator that rejected setting task_run_trace on final_answer evals. The full_trace-requires-trace invariant is kept. - Tests flipped to assert trace IS persisted on final_answer evals; the parametrized matrix updated accordingly. 138 eval-related tests pass. (Two unrelated pre-commit failures — test_benchmark_get_model perf timing and test_adapter_reuse_preserves_data which depends on the lancedb/pandas transitive issue — were bypassed; they are not affected by this change.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tawnymanticore · 2026-06-01T17:26:33Z

/gemini review

gemini-code-assist

Code Review

This pull request modifies the evaluation runner and data models to always persist the full task run trace when available, decoupling trace persistence from the evaluation data type (which now only controls judging behavior). To handle complex objects in the trace, pydantic_core.to_json is used for serialization with a fallback mechanism. The feedback suggests improving this fallback serialization in eval_runner.py by using a custom default function that attempts to call model_dump() or dict() on Pydantic models, ensuring the JSON structure is preserved rather than degrading to flat string representations via repr.

gemini-code-assist · 2026-06-01T17:27:41Z

+                        trace = json.dumps(
+                            result_task_run.trace, indent=2, default=repr
+                        )


If pydantic_core.to_json fails for any reason, the fallback json.dumps with default=repr will serialize all Pydantic models (such as LiteLLM's Message or Choices objects) as flat string representations (e.g., "ChatMessage(role='user', ...)") instead of proper JSON objects. This degrades the trace structure significantly. Using a custom default function that first attempts to dump Pydantic models to dictionaries preserves the JSON structure of these models in the fallback scenario.

Suggested change

trace = json.dumps(

result_task_run.trace, indent=2, default=repr

)

trace = json.dumps(

result_task_run.trace,

indent=2,

default=lambda o: o.model_dump() if hasattr(o, "model_dump") else (o.dict() if hasattr(o, "dict") else repr(o))

)

gemini-code-assist Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread libs/core/kiln_ai/adapters/eval/eval_runner.py Outdated

coderabbitai Bot reviewed Jun 1, 2026

View reviewed changes

tawnymanticore force-pushed the tawnymanticore/always-persist-eval-trace branch from c677c32 to 5b25d98 Compare June 1, 2026 17:24

gemini-code-assist Bot reviewed Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always persist task_run_trace on eval runs#1438

Always persist task_run_trace on eval runs#1438
tawnymanticore wants to merge 1 commit into
mainfrom
tawnymanticore/always-persist-eval-trace

tawnymanticore commented Jun 1, 2026

Uh oh!

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 1, 2026 •

edited

Loading

libs/core/kiln_ai/adapters/eval/eval_runner.py

Uh oh!

coderabbitai Bot left a comment

Uh oh!

tawnymanticore commented Jun 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tawnymanticore commented Jun 1, 2026

Why

What

Notes

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Coverage Report

Diff: origin/main...HEAD

Summary

Line-by-line

libs/core/kiln_ai/adapters/eval/eval_runner.py

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

tawnymanticore commented Jun 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

github-actions Bot commented Jun 1, 2026 •

edited

Loading