AI监控评估
🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
LLM Comparator is an interactive data visualization tool for evaluating and analyzing LLM responses side-by-side, developed by the PAIR team.
A framework for few-shot evaluation of language models.
Accelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.
Generate ideal question-answers for testing RAG
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
LLMPerf is a library for validating and benchmarking LLMs
LLM Serving Performance Evaluation Harness
Supercharge Your LLM Application Evaluations 🚀
Evaluate the accuracy of LLM generated outputs
Text2SQL-Eval is a Text-to-SQL evaluating component for LLM trained on an open-source training dataset.
🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓
RAGChecker: A Fine-grained Framework For Diagnosing RAG
Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.
Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs
⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡
Evidently is an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
[ICLR 2025] The First Multimodal Seach Engine Pipeline and Benchmark for LMMs
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.