Skip to content

Always persist task_run_trace on eval runs#1438

Open
tawnymanticore wants to merge 1 commit into
mainfrom
tawnymanticore/always-persist-eval-trace
Open

Always persist task_run_trace on eval runs#1438
tawnymanticore wants to merge 1 commit into
mainfrom
tawnymanticore/always-persist-eval-trace

Conversation

@tawnymanticore
Copy link
Copy Markdown
Collaborator

Why

The trace is the only on-disk record of what the model actually saw during an
eval — the rendered system + user message that was sent to the provider. Today
that record is dropped for any eval whose evaluation_data_type is not
full_trace. So for final_answer and reference_answer evals (the common
case), eval_run.kiln has task_run_trace: null.

That gap matters most when something between the dataset item and the model
mutates the input — input_transform (the new Jinja2 layer on
scosman/templates / #1433) is the immediate example, but any future
provider-side prompt assembly will have the same property. Without the trace,
"did my transform fire?" and "what did the model literally read?" become
unanswerable after the fact; the eval row's input field stores the raw
dataset item, not what reached the model.

While debugging an input-feature ablation, I ran an eval, saw an unexpected
result, and had no way to confirm whether the configured Jinja template
actually rendered or was silently bypassed. The trace was being captured in
result_task_run.trace — it just wasn't being written.

What

  • eval_runner.py: drop the evaluation_data_type == full_trace gate. The
    trace is now persisted whenever result_task_run.trace is non-empty.
  • eval.py: drop the EvalRun validator that rejected setting
    task_run_trace on final_answer evals. The full_trace-requires-trace
    invariant is preserved.
  • eval_runner.py: use pydantic_core.to_json instead of json.dumps so
    real provider-SDK trace objects (which contain Pydantic BaseModel
    instances — litellm's Message, Choices, etc.) serialize correctly. The
    stdlib encoder choked on these the first time real data hit the path;
    Pydantic's encoder handles them, with a default=repr fallback that logs
    a warning and writes a degraded trace if even to_json can't.
  • Tests updated to assert traces are preserved on final_answer evals; the
    parametrized matrix in test_validate_output_fields_parametrized reflects
    the relaxed validator.

EvalDataType still controls judging behavior (full_trace evals still
require the trace; reference_answer evals still write the reference answer
field). The change is scoped narrowly to "the trace gets written if we have
one to write."

Notes

  • Storage: traces are typically 5–15 KB each. On a 1000-row dataset across
    several Specs and run-configs this adds up, but evals already persist
    intermediate_outputs, input, output, and task_run_usage per row,
    and a trace is in the same order of magnitude. The verification value
    more than pays the disk cost.
  • The two pre-existing pre-commit failures on this branch
    (test_benchmark_get_model timing-sensitive perf test,
    test_adapter_reuse_preserves_data depending on the lancedb/pandas
    transitive issue) are unrelated to this change and not affected by it.
    138 eval-related tests pass.

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 20f252ab-0b02-4c20-ad9b-eb3923a83495

📥 Commits

Reviewing files that changed from the base of the PR and between c677c32 and 5b25d98.

📒 Files selected for processing (4)
  • libs/core/kiln_ai/adapters/eval/eval_runner.py
  • libs/core/kiln_ai/adapters/eval/test_eval_runner.py
  • libs/core/kiln_ai/datamodel/eval.py
  • libs/core/kiln_ai/datamodel/test_eval_model.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • libs/core/kiln_ai/adapters/eval/test_eval_runner.py
  • libs/core/kiln_ai/datamodel/test_eval_model.py
  • libs/core/kiln_ai/adapters/eval/eval_runner.py
  • libs/core/kiln_ai/datamodel/eval.py

Walkthrough

The PR decouples trace persistence from evaluation data type gating. The runner now saves task run traces unconditionally when available, with serialization via pydantic_core.to_json() and fallback to json.dumps(). Validation is relaxed to allow traces on final_answer runs while maintaining requirements for full_trace runs.

Changes

Trace Persistence and Validation Update

Layer / File(s) Summary
EvalRunner trace persistence
libs/core/kiln_ai/adapters/eval/eval_runner.py, libs/core/kiln_ai/adapters/eval/test_eval_runner.py
EvalRunner.run_job now persists result_task_run.trace whenever available, using pydantic_core.to_json() with json.dumps(default=repr) fallback. Test updated to document and assert trace persistence for final_answer evaluation type.
EvalRun validation update
libs/core/kiln_ai/datamodel/eval.py, libs/core/kiln_ai/datamodel/test_eval_model.py
EvalRun.validate_output_fields relaxes task_run_trace constraints for final_answer runs, permitting optional trace storage while maintaining requirements for full_trace runs. Tests updated to validate trace preservation and parametrized expectations adjusted accordingly.

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

  • leonardmq
  • scosman
  • chiang-daniel

🐰
Traces now flow free where they roam and play,
No longer locked by the type of the day,
Debug lines sparkle in JSON light,
Hopping through evals from morning to night,
A rabbit cheers builds that help us find our way.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description is comprehensive and well-structured, covering the why (motivation), what (changes made), and notes (storage implications). However, it lacks the required template sections: no explicit 'Related Issues' link, no CLA confirmation statement, and no checklist validation. Add the standard PR template sections: explicitly link related issues (e.g., #1433), include CLA confirmation statement with username, and complete the test checklist items.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: the PR now persists task_run_trace on all eval runs regardless of evaluation_data_type, moving away from conditional persistence based on full_trace data type.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch tawnymanticore/always-persist-eval-trace

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the evaluation runner and data models to always persist the full task run trace when available, making trace persistence independent of the evaluation data type. Validation rules and tests have been updated to allow traces in final_answer runs. Feedback was provided on eval_runner.py regarding the exception handling for pydantic_core.to_json. Since serialization failures raise PydanticSerializationError (which inherits from RuntimeError), catching only TypeError and ValueError will not prevent a crash. It is recommended to catch Exception to ensure the fallback mechanism works as intended.

Comment thread libs/core/kiln_ai/adapters/eval/eval_runner.py Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

📊 Coverage Report

Overall Coverage: 92%

Diff: origin/main...HEAD

  • libs/core/kiln_ai/adapters/eval/eval_runner.py (62.5%): Missing lines 252,258,263

Summary

  • Total: 8 lines
  • Missing: 3 lines
  • Coverage: 62%

Line-by-line

View line-by-line diff coverage

libs/core/kiln_ai/adapters/eval/eval_runner.py

Lines 248-256

  248                     try:
  249                         from pydantic_core import to_json
  250 
  251                         trace = to_json(result_task_run.trace, indent=2).decode()
! 252                     except Exception as e:
  253                         # Broad catch: pydantic_core.to_json can raise
  254                         # PydanticSerializationError (subclass of RuntimeError)
  255                         # plus the usual TypeError / ValueError. Falling back
  256                         # to a repr-based encoder is always preferable to

Lines 254-267

  254                         # PydanticSerializationError (subclass of RuntimeError)
  255                         # plus the usual TypeError / ValueError. Falling back
  256                         # to a repr-based encoder is always preferable to
  257                         # crashing the eval job over an unprintable trace.
! 258                         logger.warning(
  259                             "Falling back to repr trace encoding (%s) for dataset item %s",
  260                             type(e).__name__,
  261                             job.item.id,
  262                         )
! 263                         trace = json.dumps(
  264                             result_task_run.trace, indent=2, default=repr
  265                         )
  266 
  267                 parent_eval = job.eval_config.parent_eval()


Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
libs/core/kiln_ai/adapters/eval/eval_runner.py (1)

1-1: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: Fix formatting error before merge.

The pipeline shows ruff format --check failed. Run uv run ruff format libs/core/kiln_ai/adapters/eval/eval_runner.py to fix the formatting issue.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/adapters/eval/eval_runner.py` at line 1, The file
libs/core/kiln_ai/adapters/eval/eval_runner.py has a formatting error flagged by
ruff; run the formatter (e.g., uv run ruff format
libs/core/kiln_ai/adapters/eval/eval_runner.py) or apply ruff/black formatting
to that module (focus on the top-level import block in eval_runner.py) so the
file passes `ruff format --check` before merging.
🧹 Nitpick comments (1)
libs/core/kiln_ai/adapters/eval/eval_runner.py (1)

249-249: 💤 Low value

Consider moving the import to module level.

The pydantic_core import is inside a try block that catches (TypeError, ValueError) but not ImportError. While this works fine in practice (pydantic_core is a required dependency of pydantic v2), importing at module level would be clearer and more conventional:

from pydantic_core import to_json

Then the try block would only wrap the serialization call.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/adapters/eval/eval_runner.py` at line 249, Move the
pydantic_core import out of the try block to module level by adding "from
pydantic_core import to_json" at the top of the module, then update the
try/except in EvalRunner (where to_json is used) so it only wraps the
serialization call (e.g., the block around to_json(...)) and continues to catch
TypeError/ValueError as before; this keeps import errors separate and makes the
code clearer while leaving function/method names like to_json and the
surrounding eval serialization logic intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@libs/core/kiln_ai/adapters/eval/eval_runner.py`:
- Line 1: The file libs/core/kiln_ai/adapters/eval/eval_runner.py has a
formatting error flagged by ruff; run the formatter (e.g., uv run ruff format
libs/core/kiln_ai/adapters/eval/eval_runner.py) or apply ruff/black formatting
to that module (focus on the top-level import block in eval_runner.py) so the
file passes `ruff format --check` before merging.

---

Nitpick comments:
In `@libs/core/kiln_ai/adapters/eval/eval_runner.py`:
- Line 249: Move the pydantic_core import out of the try block to module level
by adding "from pydantic_core import to_json" at the top of the module, then
update the try/except in EvalRunner (where to_json is used) so it only wraps the
serialization call (e.g., the block around to_json(...)) and continues to catch
TypeError/ValueError as before; this keeps import errors separate and makes the
code clearer while leaving function/method names like to_json and the
surrounding eval serialization logic intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 77a6f8ff-3da4-4fc4-b9ac-f066a176f191

📥 Commits

Reviewing files that changed from the base of the PR and between 89676e0 and c677c32.

📒 Files selected for processing (4)
  • libs/core/kiln_ai/adapters/eval/eval_runner.py
  • libs/core/kiln_ai/adapters/eval/test_eval_runner.py
  • libs/core/kiln_ai/datamodel/eval.py
  • libs/core/kiln_ai/datamodel/test_eval_model.py

Removes the EvalDataType gate that dropped task_run_trace for non-full_trace
evals. The trace is the only record of what the model actually saw — required
to verify input_transform rendering (the new Jinja layer on
scosman/templates), system prompt content, and to debug eval failures after
the fact. EvalDataType still controls judging behavior; it no longer gates
trace persistence.

Changes:
- eval_runner.py: drop the (evaluation_data_type == full_trace) condition;
  the trace is now written whenever result_task_run.trace is non-empty.
- eval.py: drop the EvalRun validator that rejected setting task_run_trace
  on final_answer evals. The full_trace-requires-trace invariant is kept.
- Tests flipped to assert trace IS persisted on final_answer evals; the
  parametrized matrix updated accordingly.

138 eval-related tests pass. (Two unrelated pre-commit failures —
test_benchmark_get_model perf timing and test_adapter_reuse_preserves_data
which depends on the lancedb/pandas transitive issue — were bypassed; they
are not affected by this change.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tawnymanticore tawnymanticore force-pushed the tawnymanticore/always-persist-eval-trace branch from c677c32 to 5b25d98 Compare June 1, 2026 17:24
@tawnymanticore
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the evaluation runner and data models to always persist the full task run trace when available, decoupling trace persistence from the evaluation data type (which now only controls judging behavior). To handle complex objects in the trace, pydantic_core.to_json is used for serialization with a fallback mechanism. The feedback suggests improving this fallback serialization in eval_runner.py by using a custom default function that attempts to call model_dump() or dict() on Pydantic models, ensuring the JSON structure is preserved rather than degrading to flat string representations via repr.

Comment on lines +263 to +265
trace = json.dumps(
result_task_run.trace, indent=2, default=repr
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If pydantic_core.to_json fails for any reason, the fallback json.dumps with default=repr will serialize all Pydantic models (such as LiteLLM's Message or Choices objects) as flat string representations (e.g., "ChatMessage(role='user', ...)") instead of proper JSON objects. This degrades the trace structure significantly. Using a custom default function that first attempts to dump Pydantic models to dictionaries preserves the JSON structure of these models in the fallback scenario.

Suggested change
trace = json.dumps(
result_task_run.trace, indent=2, default=repr
)
trace = json.dumps(
result_task_run.trace,
indent=2,
default=lambda o: o.model_dump() if hasattr(o, "model_dump") else (o.dict() if hasattr(o, "dict") else repr(o))
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant