Evaluating large language models

A large language model (LLM) is a type of artificial intelligence (AI) program that is designed for natural language processing tasks, such as recognizing and generating text.

As a data scientist, you might want to monitor your large language models against a range of metrics, in order to ensure the accuracy and quality of its output. Features such as summarization, language toxicity, and question-answering accuracy can be assessed to inform and improve your model parameters.

Open Data Hub now offers Language Model Evaluation as a Service (LM-Eval-aaS), in a feature called LM-Eval. LM-Eval provides a unified framework to test generative language models on a vast range of different evaluation tasks.

The following sections show you how to create an LMEvalJob custom resource (CR) which allows you to activate an evaluation job and generate an analysis of your model’s ability.

modules/setting-up-lmeval.adoc

modules/lmeval-evaluation-job.adoc

modules/lmeval-scenarios.adoc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluating-large-language-models.adoc

evaluating-large-language-models.adoc

Evaluating large language models

Files

evaluating-large-language-models.adoc

Latest commit

History

evaluating-large-language-models.adoc

File metadata and controls

Evaluating large language models