This report catalogues active benchmarking and evaluation providers relevant to enterprise procurement and deployment of frontier AI models. It covers institutional benchmarks, academic leaderboards, and independent evaluation services active as of early 2026.
A note on benchmark contamination. All benchmarks are subject to Goodhart's Law: once a benchmark becomes a target, models are often fine-tuned or pre-trained on data that overlaps with its test set, eroding its signal. The most credible evaluations in this report either (a) use private, non-public test sets (Scale SEAL, AILuminate GAP), (b) are updated continuously with new test data, or (c) test capabilities that are hard to "train away" (e.g. novel agentic tasks, real-world tool use). When reviewing model scores, always check whether the test set is public and how recently the benchmark was introduced.
All entries in the table below use one or more of the following enterprise relevance categories:
| Category | What it covers |
|---|---|
| Latency / Throughput / Cost | Speed (tokens/sec, TTFT), cost per million tokens, throughput under load |
| Reasoning & Math | Logic, multi-step problem solving, graduate-level science, quantitative tasks |
| Coding & Software Engineering | Code generation, debugging, repo-level tasks, DevOps |
| Agentic & Tool Use | Multi-step planning, external tool/API calls, workflow automation |
| Safety & Compliance | Harmful output avoidance, hazard categories, regulatory readiness |
| Security | Prompt injection, jailbreak resistance, adversarial robustness |
| RAG & Retrieval | Long-context retrieval, document Q&A, knowledge grounding |
| Long-Context | Performance at 32k–1M+ token windows |
| Multimodal | Vision-language reasoning, chart/diagram understanding, document images |
| Multilingual | Cross-language reasoning, translation quality, non-English task performance |
| Instruction Following | Adherence to complex, verifiable constraints in prompts |
| Document Understanding | OCR, form extraction, mixed-media documents |
| Research Automation | End-to-end scientific or analytical workflows |
| Player / organisation | Benchmark / evaluation name | Enterprise relevance category | Canonical URL | Evidence of 2026+ activity | Source type | Why credible |
|---|---|---|---|---|---|---|
| Artificial Analysis | LLM Performance Leaderboard | Latency / Throughput / Cost | https://artificialanalysis.ai/leaderboards/models | Active March 2026; 314+ models tracked | Independent leaderboard | The primary industry reference for combining quality, speed (TTFT, tokens/sec), and cost in a single view. Independent of model providers; methodology is transparent and reproducible. Essential for procurement decisions. |
| Scale AI (SEAL) | SEAL LLM Leaderboards (incl. Agentic Tool Use Enterprise) | Coding; Reasoning & Math; Agentic & Tool Use | https://labs.scale.com/leaderboard | Active 2026 | Private evaluation service | Uses private, non-public test sets evaluated by verified domain experts — specifically designed to resist contamination and training-set gaming. Has a dedicated Enterprise Tool Use category. One of the few evaluators fully independent of model developers. |
| MLCommons | MLPerf Inference v6.0 | Latency / Throughput / Cost | https://mlcommons.org/en/news/mlperf-inference-v6/ | Released March 24, 2026 | Official benchmark | Industry-standard hardware/software benchmarking consortium with cross-company governance. Results are independently validated and cover full inference stack. |
| MLCommons | AILuminate v1.0 (+ Global Assurance Program) | Safety & Compliance | https://ailuminate.mlcommons.org/ | AILuminate Global Assurance Program (AIL GAP) announced Feb 2026; backed by Google, Microsoft, KPMG, Qualcomm | Official benchmark + assurance programme | Non-profit with structured methodology across 12 hazard categories. The 2026 GAP adds private benchmarking-as-a-service and a risk label for non-technical decision-makers — directly enterprise-relevant. |
| Stanford CRFM | MedHELM v4.0 | Reasoning & Math; Safety & Compliance | https://crfm.stanford.edu/helm/medhelm/latest/ | Released Jan 19, 2026 | Official benchmark | Stanford research programme with transparent methodology; covers medical summarisation and clinical reasoning; important for healthcare enterprise deployments. |
| LMSYS | Chatbot Arena | Reasoning & Math; Coding; Instruction Following | https://arena.lmsys.org/leaderboard/text | Active March 2026 | Official leaderboard | ELO-based ranking from millions of blind human preference votes. Strong signal for general chat quality; less directly applicable to enterprise task-specific needs. Does not use private test sets. |
| EleutherAI | LM Evaluation Harness | Multi-task (evaluation infrastructure) | https://github.com/EleutherAI/lm-evaluation-harness | Active 2026; underlies majority of published open-model evaluations | Open-source framework | The de facto standard evaluation harness used to run most benchmarks on open-weight models. Not a benchmark itself, but essential context: if a result cites a benchmark without specifying the harness, it may not be reproducible. |
| Hugging Face / OpenEvals | Open LLM Leaderboard | Multi-task evaluation | https://huggingface.co/spaces/OpenEvals/every-leaderboards | Active 2026 | Official leaderboard | Aggregates results across multiple benchmarks; useful for comparing open-weight models at scale, though less relevant for closed frontier models. |
| SWE-bench team | SWE-bench Verified | Coding & Software Engineering | https://www.swebench.com/ | Results Feb 2026 | Official benchmark | Peer-reviewed; tests real GitHub issue resolution rather than synthetic coding tasks. "Verified" subset removes ambiguous test cases. Strong predictor of real-world software engineering capability. |
| TIGER-Lab / community | MMLU-Pro | Reasoning & Math | https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro | Active 2026; standard self-reported metric for frontier model releases | Academic benchmark | Harder successor to MMLU with more challenging, expert-level questions; significantly harder to saturate than original MMLU. Still widely self-reported by all major labs, so necessary context for interpreting model release claims. |
| EvalPlus team | HumanEval+ / EvalPlus Leaderboard | Coding & Software Engineering | https://evalplus.github.io/leaderboard.html | Active 2026 | Academic benchmark | Augments HumanEval with 80× more test cases to catch edge cases that original HumanEval misses. Peer-reviewed; significantly harder to game than the original. Standard coding baseline. |
| David Rein et al. (Princeton) | GPQA Diamond | Reasoning & Math | https://github.com/idavidrein/gpqa | Active 2026; tracked by Artificial Analysis and Epoch AI | Academic benchmark | 198 graduate-level biology, physics, and chemistry questions that non-experts answer incorrectly ~70% of the time. Highly resistant to contamination due to expert-level difficulty. Widely reported by frontier labs as a reasoning ceiling test. |
| Google Research | IFEval | Instruction Following | https://github.com/google-research/google-research/tree/master/instruction_following_eval | Active 2026 | Academic benchmark | Tests verifiable, machine-checkable instruction constraints (e.g. "respond in fewer than 200 words", "do not use the word X"). Directly predictive of enterprise reliability; does not rely on human judgment for scoring. |
| NVIDIA Research | RULER | Long-Context | https://github.com/NVIDIA/RULER | Active 2026; multilingual extension published March 2026 | Academic benchmark | Tests real long-context performance (not just "can the model see a needle") across 32k–128k+ token windows. Exposes models that claim large context windows but degrade at practical lengths. |
| MMMU team | MMMU / MMMU-Pro | Multimodal | https://mmmu-benchmark.github.io/ | Active March 2026; MMMU-Pro published at ACL 2025 | Academic benchmark | 11.5k college-level multimodal questions across 30 subjects and 183 subfields. MMMU-Pro variant is substantially harder (models score 16–27% vs higher on standard MMMU). The standard reference for frontier vision-language capability. |
| Steel.dev | WebVoyager Leaderboard | RAG & Retrieval; Agentic & Tool Use | https://leaderboard.steel.dev/ | Updated 2026 | Official leaderboard | Real-world web navigation and retrieval tasks; relevant for evaluating browser-use and RAG agents in enterprise automation workflows. |
| Terminal-Bench consortium | Terminal-Bench 2.0 | Agentic & Tool Use; Coding & Software Engineering | https://tbench.ai | Published ICLR 2026; leaderboard active | Benchmark site + conference paper | 89 manually verified, real-workflow terminal tasks. Published at ICLR 2026. Frontier models resolve <65% of tasks. Accompanied by the open-source Harbor evaluation framework. |
| ARC Prize Foundation | ARC-AGI-3 | Reasoning & Math; Agentic & Tool Use | https://arcprize.org/arc-agi-3 | Released 2026 | Official benchmark | AGI-focused evaluation designed to resist pattern memorisation; tests novel visual reasoning that cannot be solved by training on similar examples. |
| Vectara | Hallucination Leaderboard | RAG & Retrieval; Safety & Compliance | https://github.com/vectara/hallucination-leaderboard | Updated Mar 2026 | Official repo | Widely cited; tests summarisation faithfulness. Important caveat: methodology is narrow (summarisation of news articles) and does not generalise to all forms of hallucination. Treat as one data point, not a comprehensive factuality measure. |
| GraphRAG-Bench team | GraphRAG-Bench | RAG & Retrieval | https://graphrag-bench.github.io/ | Accepted ICLR 2026 | Benchmark + paper | Academic benchmark for graph-structured retrieval and summarisation; peer-reviewed at ICLR 2026. |
| ETH Zurich / INSAIT | MathArena | Reasoning & Math | https://matharena.ai | Updated Mar 2026 | Official leaderboard | Academic institutions; uses competition-mathematics problems to test advanced quantitative reasoning. Regularly updated with new problem sets to mitigate saturation. |
| OpenReview (ICLR 2026) | BTZSC | Agentic & Tool Use (classification / routing) | https://openreview.net/forum?id=RIb4mwX3tL | Published 2026 | Peer-reviewed paper | Multi-dataset evaluation for model routing and classification; peer-reviewed. |
| OpenReview (ICLR 2026) | FuncBenchGen | Agentic & Tool Use | https://openreview.net/forum?id=al8BtP6WGf | Published 2026 | Peer-reviewed paper | Controlled evaluation framework for function/tool calling; peer-reviewed at ICLR 2026. |
| OpenReview (ICLR 2026) | WildToolBench | Agentic & Tool Use | https://openreview.net/forum?id=HUtw6wXXlP | Published 2026 | Peer-reviewed paper | Sourced from real user behaviour, not synthetic tasks; more representative of in-the-wild tool use. |
| OpenReview (ICLR 2026) | OrchestrationBench | Agentic & Tool Use | https://openreview.net/forum?id=CL6DGxRPK3 | Published 2026 | Peer-reviewed paper | Multi-domain planning and orchestration evaluation; peer-reviewed. |
| OpenReview (ICLR 2026) | TRAJECT-Bench | Agentic & Tool Use | https://openreview.net/forum?id=uLv7oQPeaH | Published 2026 | Peer-reviewed paper | Trajectory-level (not just final-answer) evaluation of tool use — useful for auditing agent behaviour step-by-step rather than just outcomes. |
| OpenReview (ICLR 2026) | Gaia2 | Agentic & Tool Use | https://openreview.net/forum?id=1xIYzBHwPo | Published 2026 | Peer-reviewed paper | Dynamic agentic environments; successor to GAIA. |
| OpenReview (ICLR 2026) | EXP-Bench | Research Automation | https://openreview.net/forum?id=UFIWu3DpeZ | Published 2026 | Peer-reviewed paper | End-to-end research task automation; peer-reviewed. |
| OpenReview (ICLR 2026) | EnConda-bench | Coding & Software Engineering | https://openreview.net/forum?id=NpY5bajFmH | Published 2026 | Peer-reviewed paper | DevOps and configuration task evaluation at the process level, not just code generation. |
| arXiv (2026) | AMA-Bench | Long-Context; Agentic & Tool Use | https://arxiv.org/abs/2602.22769 | Feb–Mar 2026 | Pre-print | Long-horizon agent memory evaluation; not yet peer-reviewed — treat with appropriate caution until published. |
| arXiv (2026) | Real5-OmniDocBench | Document Understanding | https://arxiv.org/abs/2603.04205 | Mar 2026 | Pre-print | Real-world document benchmark covering mixed-media documents; not yet peer-reviewed. |
| OpenReview (ICLR 2026) | VPI-Bench | Security | https://openreview.net/forum?id=C3t28XHpo3 | Published 2026 | Paper | Visual prompt injection evaluation — important for multimodal agent security; peer-reviewed. |
| arXiv (2026) | MPIB | Security; Safety & Compliance | https://arxiv.org/abs/2602.06268 | Feb 2026 | Pre-print | Medical-domain prompt injection evaluation; clinical safety relevance; not yet peer-reviewed. |
The following categories are underserved by the benchmarks above and are areas to watch for new evaluation work:
Multilingual performance is not addressed by any entry above. For enterprises with global operations, multilingual benchmarks (e.g. MGSM for multilingual math reasoning, ONERULER for multilingual long-context, multilingual MMLU variants) are essential and should be tracked separately.
Consistency and reliability — i.e. whether a model gives the same correct answer across re-runs, paraphrasings, or temperature settings — is not covered by any standard benchmark above. This is a significant practical concern for production systems.
Model-reported evaluations — OpenAI, Anthropic, Google, and Meta all publish their own evaluation results alongside model releases. These are not independent, but understanding what metrics they self-report is necessary context for interpreting public claims.
Cost benchmarking beyond speed — Artificial Analysis covers per-token pricing, but total cost of ownership (context caching, batching discounts, fine-tuning costs) is not systematically benchmarked anywhere.
Not all benchmarks are equally actionable. The following tiers reflect practical enterprise procurement and deployment priorities.
Tier 1 — Use these first. These cover the broadest ground and are directly relevant to procurement and deployment decisions.
- Artificial Analysis — quality + speed + cost in one place; start here for any model selection decision
- Scale AI SEAL — private test sets, enterprise-specific tool-use categories; most contamination-resistant
- MLCommons AILuminate — safety compliance baseline; increasingly referenced in procurement and regulatory contexts
- Chatbot Arena — broad quality signal across millions of human preference votes
- SWE-bench Verified — if coding or software engineering is a primary use case
Tier 2 — Use for specific capability assessment. Pull these in when evaluating models for a defined use case.
- GPQA Diamond — advanced reasoning ceiling
- MMMU / MMMU-Pro — multimodal capability
- RULER — long-context reliability
- IFEval — instruction-following precision
- Terminal-Bench 2.0 / WildToolBench / OrchestrationBench — agentic and tool-use workflows
- MedHELM — healthcare-specific deployments
- VPI-Bench / MPIB — security and safety in agentic systems
Tier 3 — Monitor but don't over-index. Academically useful but either narrow, pre-print only, or not yet widely adopted in enterprise evaluation practice.
- arXiv pre-prints (AMA-Bench, Real5-OmniDocBench, MPIB) — await peer review
- BTZSC, FuncBenchGen, TRAJECT-Bench — interesting but specialist
- Vectara Hallucination Leaderboard — useful signal but methodology is narrow; do not generalise
Last reviewed: March 2026. Benchmark landscape changes rapidly — URLs, methodology, and leaderboard standings should be re-verified quarterly.