This is a very early stage of the project.
Still a lot to clean and fix.
The goal is to develop a unified framework for evaluating LLMs, agents, and RAG systems across well-known and custom benchmarks, while providing users with statistical tools to understand and improve their systems.