Add 4 elite-tier evaluation & benchmarking projects

alvinreal · alvinreal · commit 4bef341dc64c · 2026-04-27T21:23:18.000+02:00
Category 9: Evaluation, Benchmarks &amp; Datasets

New additions:
- ARC-AGI (4,759 stars) - Abstraction and Reasoning Corpus for measuring general fluid intelligence
- AlpacaEval (1,976 stars) - Automatic evaluator for instruction-following LLMs from Stanford
- BigCode Evaluation Harness (1,040 stars) - Framework for evaluating code generation models
- E2B Code Interpreter (2,298 stars) - Secure sandbox infrastructure for evaluating code LLMs

All projects meet elite-tier criteria:
- 1000+ GitHub stars
- Active development (commits within last 3 months)
- OSI-approved licenses (Apache 2.0)
diff --git a/README.md b/README.md
@@ -808,6 +808,8 @@
 - **[Vectara Hallucination Leaderboard](https://github.com/vectara/hallucination-leaderboard)** ![GitHub stars](https://img.shields.io/github/stars/vectara/hallucination-leaderboard?style=social) - Leaderboard comparing LLM performance at producing hallucinations when summarizing short documents. Systematic evaluation of factual consistency across major models. Apache 2.0 licensed.
 - **[SWE-rebench (Nebius)](https://huggingface.co/datasets/nebius/SWE-rebench)** - Continuously updated benchmark with 21,000+ real-world SWE tasks for evaluating agentic LLMs. Decontaminated, mined from GitHub.
 - **[AgentBench (THUDM)](https://github.com/THUDM/AgentBench)** ![GitHub stars](https://img.shields.io/github/stars/THUDM/AgentBench?style=social) - Comprehensive benchmark to evaluate LLMs as agents across 8 diverse environments including household, web shopping, OS interaction, and database tasks. ICLR 2024. Apache 2.0 licensed.
+- **[ARC-AGI](https://github.com/fchollet/ARC-AGI)** ![GitHub stars](https://img.shields.io/github/stars/fchollet/ARC-AGI?style=social) - The Abstraction and Reasoning Corpus for measuring general fluid intelligence in AI systems. A challenging benchmark targeting human-like reasoning with program synthesis tasks. Created by François Chollet. Apache 2.0 licensed.
+- **[AlpacaEval](https://github.com/tatsu-lab/alpaca_eval)** ![GitHub stars](https://img.shields.io/github/stars/tatsu-lab/alpaca_eval?style=social) - Automatic evaluator for instruction-following LLMs. Provides a leaderboard and benchmark for comparing model performance on open-ended instruction tasks. Stanford's widely-used evaluation benchmark. Apache 2.0 licensed.
 
 
 #### Evaluation Frameworks
@@ -824,6 +826,8 @@
 - **[TruLens](https://github.com/truera/trulens)** ![GitHub stars](https://img.shields.io/github/stars/truera/trulens?style=social) - Evaluation and tracking for LLM experiments and AI agents. Provides feedback functions for measuring quality, relevance, and groundedness with LangChain and LlamaIndex integrations. MIT licensed.
 - **[OpenEvals](https://github.com/langchain-ai/openevals)** ![GitHub stars](https://img.shields.io/github/stars/langchain-ai/openevals?style=social) - Open-source evaluation library for LLM and agent applications. Built by LangChain with pre-built evaluators for common use cases including RAG, agents, and structured output validation. MIT licensed.
 - **[AutoRAG](https://github.com/Marker-Inc-Korea/AutoRAG)** ![GitHub stars](https://img.shields.io/github/stars/Marker-Inc-Korea/AutoRAG?style=social) - RAG AutoML tool for automatically finding optimal RAG pipelines. Evaluates and optimizes retrieval-augmented generation with AutoML-style automation for your own data and use-case. Apache 2.0 licensed.
+- **[BigCode Evaluation Harness](https://github.com/bigcode-project/bigcode-evaluation-harness)** ![GitHub stars](https://img.shields.io/github/stars/bigcode-project/bigcode-evaluation-harness?style=social) - Framework for evaluating autoregressive code generation language models. Supports HumanEval, MBPP, DS-1000, and other code benchmarks with distributed evaluation support. From BigCode Project/Hugging Face. Apache 2.0 licensed.
+- **[E2B Code Interpreter](https://github.com/e2b-dev/code-interpreter)** ![GitHub stars](https://img.shields.io/github/stars/e2b-dev/code-interpreter?style=social) - Python & JS/TS SDK for running AI-generated code in secure isolated sandboxes. Essential infrastructure for evaluating code-generating LLMs with safe execution environments. Apache 2.0 licensed.
 
 #### High-quality Open Datasets & Data Tools