Bug description
There appears to be an issue with how LLM traces are captured and processed. Currently, tool call responses (tool outputs) are missing from the trace logs.
This data gap is propagating directly to the Evaluation suite. Because the evaluations (e.g. for Helpfulness and Hallucination) do not see the tool responses in the context history, they incorrectly flag the final LLM output as a hallucination.
For example, if the LLM provides specific figures or data points retrieved via a tool, the evaluator marks it as a hallucination because it cannot find the source of that data in the incomplete trace. This makes the automated evaluation metrics unreliable for any flows involving tool usage.
How to reproduce
- Run a trace for an agent or LLM call that utilizes native tool calling (Function Calling).
- Ensure the tool returns a successful response and the LLM uses that data to formulate its final answer.
- Open the specific trace in the dashboard and observe that while the tool call is logged, the corresponding tool response is absent from the timeline/context.
- Optional: Run an evaluation (e.g., Hallucination or Helpfulness) on this trace. This step obviously is only failing because the tool response is missing before so this is not the issue itself.
Related sub-libraries
I am using the posthog 7.13.0 python package (maybe you should adjust the following list to also include that)
Additional context
If you want an example trace from my posthog project check this out:
"trace_id": "e0b46db1-b5b0-4ff3-873d-d486538a1156",
"timestamp": "2026-05-14T09:39:22.804000+02:00",
Thank you for your bug report – we love squashing them!
Bug description
There appears to be an issue with how LLM traces are captured and processed. Currently, tool call responses (tool outputs) are missing from the trace logs.
This data gap is propagating directly to the Evaluation suite. Because the evaluations (e.g. for Helpfulness and Hallucination) do not see the tool responses in the context history, they incorrectly flag the final LLM output as a hallucination.
For example, if the LLM provides specific figures or data points retrieved via a tool, the evaluator marks it as a hallucination because it cannot find the source of that data in the incomplete trace. This makes the automated evaluation metrics unreliable for any flows involving tool usage.
How to reproduce
Related sub-libraries
I am using the posthog 7.13.0 python package (maybe you should adjust the following list to also include that)
Additional context
If you want an example trace from my posthog project check this out:
Thank you for your bug report – we love squashing them!