Open a Claude session (Cowork or Claude.ai) with access to this folder, then say:
"Update the LLM dashboard with the latest benchmark scores. Use the research in
ai_benchmark_report_2026.mdas your source list, verify scores against the live leaderboards, and rewrite the DATA section ofdashboard.html."
Claude will:
- Visit the Tier 1 benchmark sources (Artificial Analysis, LMSYS Arena, Epoch AI, SWE-bench, etc.)
- Pull current scores for each tracked model
- Rewrite the
DATAblock insidedashboard.html - Append an entry to the
historyarray so the change is logged
- Model scores for all tracked benchmarks
- Any new models that have appeared on Tier 1 leaderboards
- The
meta.last_updatedfield andmeta.version - A new entry in
history[]with the date and a brief note
Tell Claude:
"Add [model name] by [provider] to the dashboard. Research its scores on our tracked benchmarks and add it to the DATA section of
dashboard.html."
- First add it to
ai_benchmark_report_2026.md(so it's documented) - Then tell Claude:
"Add [benchmark name] to the dashboard as a tracked benchmark. Populate current model scores where available."
Just open dashboard.html in any browser — no server needed. Works offline.
To share it with your team, share the HTML file directly (email, Slack, shared drive). Anyone can open it locally.
| Frequency | Rationale |
|---|---|
| Monthly | Sufficient for most enterprise decision-making |
| After a major model release | GPT-5.x, Gemini, Claude, or Llama releases |
| Before a model procurement decision | Always refresh before committing |
- Scores marked
verified: truecome from official leaderboard pages or primary papers verified: falsemeans the score is estimated or from a secondary source- Models with <40% benchmark coverage should not be used for final decisions — trigger an update first
- The benchmark contamination caveat in
ai_benchmark_report_2026.mdapplies to all scores here