LLM Evaluation Suite

Easily evaluate open source LLMs using Ollama and Hugging Face datasets

This project provides a comprehensive testing suite to evaluate large language models (LLMs) on a wide range of natural language understanding tasks. The suite includes evaluations that cover commonsense reasoning, reading comprehension, textual entailment, and other advanced language tasks.

Evaluations Included

The suite includes the following evaluations:

1. BoolQ (Boolean Questions)

Task: BoolQ is a yes/no question-answering task where the model is given a passage and a question. The model must determine whether the question is true or false based on the passage.
Dataset Source: SuperGLUE BoolQ
What it Tests: Reading comprehension and fact-checking abilities.
Input: Passage and question.
Output: Yes/No.

2. WinoGrande

Task: WinoGrande is a commonsense reasoning dataset where the model is given a sentence with an ambiguous pronoun. The model must resolve the pronoun by selecting the most plausible antecedent from two options.
Dataset Source: WinoGrande
What it Tests: Commonsense reasoning and pronoun resolution.
Input: Sentence with two options for an ambiguous pronoun.
Output: Option 1 or Option 2.

3. HellaSwag

Task: HellaSwag presents a model with a context and several plausible endings, requiring the model to select the most likely continuation of the context.
Dataset Source: HellaSwag
What it Tests: Commonsense reasoning and the ability to predict the next plausible action or event.
Input: Context and multiple-choice endings.
Output: The most plausible ending (A, B, C, or D).

4. RTE (Recognizing Textual Entailment)

Task: Given a premise and a hypothesis, the model must determine if the hypothesis is entailed, contradicted, or neutral with respect to the premise.
Dataset Source: SuperGLUE RTE
What it Tests: Textual entailment and natural language inference.
Input: Premise and hypothesis.
Output: Yes/No for entailment.

5. PIQA (Physical Interaction Question Answering)

Task: PIQA tests a model’s ability to reason about physical interactions. Given a goal and two possible solutions, the model must select the more plausible solution.
Dataset Source: PIQA
What it Tests: Commonsense knowledge of physical interactions.
Input: Goal and two solutions.
Output: The more plausible solution (Solution 1 or Solution 2).

6. CommonSenseQA

Task: CommonSenseQA is a multiple-choice question-answering task that tests the model’s commonsense reasoning ability.
Dataset Source: CommonSenseQA
What it Tests: Commonsense knowledge and reasoning.
Input: Question with multiple-choice answers.
Output: The correct choice (A, B, C, D, or E).

7. MultiRC (Multiple Sentence Reading Comprehension)

Task: MultiRC is a reading comprehension task where the model must read a passage and answer a question by selecting the most appropriate answer from a set of multiple choices.
Dataset Source: SuperGLUE MultiRC
What it Tests: Complex reading comprehension.
Input: Passage, question, and multiple-choice answers.
Output: The correct answer.

8. ARC (AI2 Reasoning Challenge)

Task: ARC is a multiple-choice science question-answering task that requires reasoning and scientific knowledge to answer.
Dataset Source: ARC
What it Tests: Reasoning in scientific contexts.
Input: Science question with multiple-choice answers.
Output: The correct choice (A, B, C, D, or E).

9. CB (CommitmentBank)

Task: CB is a textual entailment task that requires the model to decide whether a hypothesis is entailed by, contradicted by, or neutral with respect to a given premise.
Dataset Source: SuperGLUE CB
What it Tests: Natural language inference and understanding.
Input: Premise and hypothesis.
Output: Entailed, Contradicted, or Neutral.

Usage

1. Installation

Install with pipenv and activate the virtual environment

pipenv install
pipenv shell

You need to have Ollama running on your machine or on a remote server.

2. Running Evaluations

To evaluate a model on multiple datasets, you can use the main.py script.

Basic Usage

python main.py --model <model-name> --evaluations <evaluation1> <evaluation2> --sample-size <number>

--model: The name of the model to evaluate.
--evaluations: The list of evaluations to run. You can include any combination of the following:
- boolq
- winogrande
- hellaswag
- rte
- piqa
- commonsenseqa
- multirc
- arc
- cb
--sample-size: (Optional) Limits the number of samples to evaluate from each dataset.

Example Usage

Evaluate the model mixtral on BoolQ, PIQA, and WinoGrande with 50 samples from each:

python main.py --model mixtral --evaluations boolq piqa winogrande --sample-size 50

Not using the --evaluation flag will default to all available evals.

Run the complete dataset of all evaluations and save the results:

python main.py --model mixtral > results/mixtral_eval.log

Custom client

if you are hosting ollama as a service on another device, you can use a custom client by passing the IP and port of the service using the --custom-client-host flag. For example:

python main.py --model phi3:14b --evaluations boolq --custom-client-host http://10.200.200.1:11434

3. Adding More Evaluations

To add a new evaluation, simply create a new evaluation script in the evaluations/ folder, following the structure of existing evaluations, and add it to the evaluation_functions dictionary in main.py.

Contributing

We welcome contributions to add new evaluations, improve existing ones, or optimize the framework! Please create a pull request or open an issue to suggest changes.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Acknowledgements

We would like to thank the creators of the datasets used in this project, as well as the contributors to the Hugging Face datasets library and ollama client used for model interaction.

llm_eval_suite/
├── evaluations/
│   ├── boolq_eval.py
│   └── hellaswag_eval.py
|   ...
├── models/
│   └── model_loader.py
├── main.py
├── config.py
└── README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!