#101
I have developed some scripts to do this, but it may be valuable to integrate this into the eval pipeline.
Currently, we lack observability in agent behaviors:
- Traces are long, and nobody bothers to read them
- Traces alone are not enough. We also need the evaluation and fault-injection logic to understand the run better.
- Many times, the agent did the right thing, but there are bugs in the benchmark itself.
For summary and analysis, we can feed the context to the most advanced LLM with a better reasoning capability and a larger context length at a low cost(because it is one-step).