Easily evaluate open source LLMs using Ollama and Hugging Face datasets
This project provides a comprehensive testing suite to evaluate large language models (LLMs) on a wide range of natural language understanding tasks. The suite includes evaluations that cover commonsense reasoning, reading comprehension, textual entailment, and other advanced language tasks.
The suite includes the following evaluations:
- Task: BoolQ is a yes/no question-answering task where the model is given a passage and a question. The model must determine whether the question is true or false based on the passage.
- Dataset Source: SuperGLUE BoolQ
- What it Tests: Reading comprehension and fact-checking abilities.
- Input: Passage and question.
- Output: Yes/No.
- Task: WinoGrande is a commonsense reasoning dataset where the model is given a sentence with an ambiguous pronoun. The model must resolve the pronoun by selecting the most plausible antecedent from two options.
- Dataset Source: WinoGrande
- What it Tests: Commonsense reasoning and pronoun resolution.
- Input: Sentence with two options for an ambiguous pronoun.
- Output: Option 1 or Option 2.
- Task: HellaSwag presents a model with a context and several plausible endings, requiring the model to select the most likely continuation of the context.
- Dataset Source: HellaSwag
- What it Tests: Commonsense reasoning and the ability to predict the next plausible action or event.
- Input: Context and multiple-choice endings.
- Output: The most plausible ending (A, B, C, or D).
- Task: Given a premise and a hypothesis, the model must determine if the hypothesis is entailed, contradicted, or neutral with respect to the premise.
- Dataset Source: SuperGLUE RTE
- What it Tests: Textual entailment and natural language inference.
- Input: Premise and hypothesis.
- Output: Yes/No for entailment.
- Task: PIQA tests a model’s ability to reason about physical interactions. Given a goal and two possible solutions, the model must select the more plausible solution.
- Dataset Source: PIQA
- What it Tests: Commonsense knowledge of physical interactions.
- Input: Goal and two solutions.
- Output: The more plausible solution (Solution 1 or Solution 2).
- Task: CommonSenseQA is a multiple-choice question-answering task that tests the model’s commonsense reasoning ability.
- Dataset Source: CommonSenseQA
- What it Tests: Commonsense knowledge and reasoning.
- Input: Question with multiple-choice answers.
- Output: The correct choice (A, B, C, D, or E).
- Task: MultiRC is a reading comprehension task where the model must read a passage and answer a question by selecting the most appropriate answer from a set of multiple choices.
- Dataset Source: SuperGLUE MultiRC
- What it Tests: Complex reading comprehension.
- Input: Passage, question, and multiple-choice answers.
- Output: The correct answer.
- Task: ARC is a multiple-choice science question-answering task that requires reasoning and scientific knowledge to answer.
- Dataset Source: ARC
- What it Tests: Reasoning in scientific contexts.
- Input: Science question with multiple-choice answers.
- Output: The correct choice (A, B, C, D, or E).
- Task: CB is a textual entailment task that requires the model to decide whether a hypothesis is entailed by, contradicted by, or neutral with respect to a given premise.
- Dataset Source: SuperGLUE CB
- What it Tests: Natural language inference and understanding.
- Input: Premise and hypothesis.
- Output: Entailed, Contradicted, or Neutral.
Install with pipenv
and activate the virtual environment
pipenv install
pipenv shell
You need to have Ollama running on your machine or on a remote server.
To evaluate a model on multiple datasets, you can use the main.py
script.
python main.py --model <model-name> --evaluations <evaluation1> <evaluation2> --sample-size <number>
--model
: The name of the model to evaluate.--evaluations
: The list of evaluations to run. You can include any combination of the following:boolq
winogrande
hellaswag
rte
piqa
commonsenseqa
multirc
arc
cb
--sample-size
: (Optional) Limits the number of samples to evaluate from each dataset.
Evaluate the model mixtral
on BoolQ, PIQA, and WinoGrande with 50 samples from each:
python main.py --model mixtral --evaluations boolq piqa winogrande --sample-size 50
Not using the --evaluation
flag will default to all available evals.
Run the complete dataset of all evaluations and save the results:
python main.py --model mixtral > results/mixtral_eval.log
if you are hosting ollama as a service on another device, you can use a custom client by passing the IP and port of the service using the --custom-client-host
flag.
For example:
python main.py --model phi3:14b --evaluations boolq --custom-client-host http://10.200.200.1:11434
To add a new evaluation, simply create a new evaluation script in the evaluations/
folder, following the structure of existing evaluations, and add it to the evaluation_functions
dictionary in main.py
.
We welcome contributions to add new evaluations, improve existing ones, or optimize the framework! Please create a pull request or open an issue to suggest changes.
This project is licensed under the MIT License. See the LICENSE file for more details.
We would like to thank the creators of the datasets used in this project, as well as the contributors to the Hugging Face datasets
library and ollama
client used for model interaction.
llm_eval_suite/
├── evaluations/
│ ├── boolq_eval.py
│ └── hellaswag_eval.py
| ...
├── models/
│ └── model_loader.py
├── main.py
├── config.py
└── README.md