v0.2.0
0.2.0 (2025-08-11)
Features
- add DROP (simple-evals) (#20) (f85bf19)
- add Humanity's Last Exam (HLE) benchmark (#23) (6f10fb7)
- add MATH and MATH-500 benchmarks for mathematical problem solving (#22) (9c6843b)
- add MGSM (#18) (bec1a7c)
- add openai MRCR benchmark for long context recall (#24) (1b09ebd)
- HealthBench (#16) (2caa47d)
Documentation
- update CLAUDE.md with pre-commit and dependency pinning requirements (f33730e)
Chores
- GitHub Terraform: Create/Update .github/workflows/stale.yaml [skip ci] (1a00342)