To get attention to the utility of the benchmark, we should regularly test whatever the "model of the day" is (Deepseek, Grok 3, GPT-5, etc.)