A killer feature would be to have an eval harness to run different models and see how they compare to each other across the most popular evals available.
- Add Eval harness that allows the user to select the most popular evals available, as well as custom evals via .json.
Pre-checks
What problem are you trying to solve?
What would you like NexaSDK to do?
- Add Eval harness that allows the user to select the most popular evals available, as well as custom evals via .json.Alternatives you've considered
Who does this help, and how much?
Additional context