Skip to content

Commit 4bef341

Browse files
committed
Add 4 elite-tier evaluation & benchmarking projects
Category 9: Evaluation, Benchmarks & Datasets New additions: - ARC-AGI (4,759 stars) - Abstraction and Reasoning Corpus for measuring general fluid intelligence - AlpacaEval (1,976 stars) - Automatic evaluator for instruction-following LLMs from Stanford - BigCode Evaluation Harness (1,040 stars) - Framework for evaluating code generation models - E2B Code Interpreter (2,298 stars) - Secure sandbox infrastructure for evaluating code LLMs All projects meet elite-tier criteria: - 1000+ GitHub stars - Active development (commits within last 3 months) - OSI-approved licenses (Apache 2.0)
1 parent bd225c0 commit 4bef341

1 file changed

Lines changed: 4 additions & 0 deletions

File tree

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -808,6 +808,8 @@
808808
- **[Vectara Hallucination Leaderboard](https://github.com/vectara/hallucination-leaderboard)** ![GitHub stars](https://img.shields.io/github/stars/vectara/hallucination-leaderboard?style=social) - Leaderboard comparing LLM performance at producing hallucinations when summarizing short documents. Systematic evaluation of factual consistency across major models. Apache 2.0 licensed.
809809
- **[SWE-rebench (Nebius)](https://huggingface.co/datasets/nebius/SWE-rebench)** - Continuously updated benchmark with 21,000+ real-world SWE tasks for evaluating agentic LLMs. Decontaminated, mined from GitHub.
810810
- **[AgentBench (THUDM)](https://github.com/THUDM/AgentBench)** ![GitHub stars](https://img.shields.io/github/stars/THUDM/AgentBench?style=social) - Comprehensive benchmark to evaluate LLMs as agents across 8 diverse environments including household, web shopping, OS interaction, and database tasks. ICLR 2024. Apache 2.0 licensed.
811+
- **[ARC-AGI](https://github.com/fchollet/ARC-AGI)** ![GitHub stars](https://img.shields.io/github/stars/fchollet/ARC-AGI?style=social) - The Abstraction and Reasoning Corpus for measuring general fluid intelligence in AI systems. A challenging benchmark targeting human-like reasoning with program synthesis tasks. Created by François Chollet. Apache 2.0 licensed.
812+
- **[AlpacaEval](https://github.com/tatsu-lab/alpaca_eval)** ![GitHub stars](https://img.shields.io/github/stars/tatsu-lab/alpaca_eval?style=social) - Automatic evaluator for instruction-following LLMs. Provides a leaderboard and benchmark for comparing model performance on open-ended instruction tasks. Stanford's widely-used evaluation benchmark. Apache 2.0 licensed.
811813

812814

813815
#### Evaluation Frameworks
@@ -824,6 +826,8 @@
824826
- **[TruLens](https://github.com/truera/trulens)** ![GitHub stars](https://img.shields.io/github/stars/truera/trulens?style=social) - Evaluation and tracking for LLM experiments and AI agents. Provides feedback functions for measuring quality, relevance, and groundedness with LangChain and LlamaIndex integrations. MIT licensed.
825827
- **[OpenEvals](https://github.com/langchain-ai/openevals)** ![GitHub stars](https://img.shields.io/github/stars/langchain-ai/openevals?style=social) - Open-source evaluation library for LLM and agent applications. Built by LangChain with pre-built evaluators for common use cases including RAG, agents, and structured output validation. MIT licensed.
826828
- **[AutoRAG](https://github.com/Marker-Inc-Korea/AutoRAG)** ![GitHub stars](https://img.shields.io/github/stars/Marker-Inc-Korea/AutoRAG?style=social) - RAG AutoML tool for automatically finding optimal RAG pipelines. Evaluates and optimizes retrieval-augmented generation with AutoML-style automation for your own data and use-case. Apache 2.0 licensed.
829+
- **[BigCode Evaluation Harness](https://github.com/bigcode-project/bigcode-evaluation-harness)** ![GitHub stars](https://img.shields.io/github/stars/bigcode-project/bigcode-evaluation-harness?style=social) - Framework for evaluating autoregressive code generation language models. Supports HumanEval, MBPP, DS-1000, and other code benchmarks with distributed evaluation support. From BigCode Project/Hugging Face. Apache 2.0 licensed.
830+
- **[E2B Code Interpreter](https://github.com/e2b-dev/code-interpreter)** ![GitHub stars](https://img.shields.io/github/stars/e2b-dev/code-interpreter?style=social) - Python & JS/TS SDK for running AI-generated code in secure isolated sandboxes. Essential infrastructure for evaluating code-generating LLMs with safe execution environments. Apache 2.0 licensed.
827831

828832
#### High-quality Open Datasets & Data Tools
829833

0 commit comments

Comments
 (0)