SloBench evaluation for generative models

This framework supports evaluation of generative (decoder-type) models on SloBench tasks. The frameowrk can either be used for offline evaluation of the models on validation set or to prepare the test set submission for online evaluation. Currently supported tasks:

Slovene SuperGLUE
SI-NLI
Machine Translation (ENG -> SLO)

Currently supported model libraries:

Huggingface
NeMo
vLLM

Requirements

All required libraries to run the framework using Huggingface model are listed in environment.yaml. For evaluation of NeMo models we recommend running the framework inside official NeMo container or its derivatives, such as dvres/slopt_nemo (added support for GaMS-1B model). NeMo containers already include all necessary libraries (they also support Huggingface models), hence no additional installations are required.

Installation

To install the framework, clone the repository and install the required packages (if they are not already installed).

Clone the repo:

git clone https://github.com/SloLama/slobench_evaluation.git

Navigate to the project directory:
```
cd slobench_evaluation
```
Create conda environment:
```
conda create --name slobench_evaluation
```
Activate the Conda environment:
```
conda activate slobench_evaluation
```
Install dependencies from environment.yaml (if necessary):
```
conda env update --file environment.yaml
```

Usage

Offline evaluation

To run the offline evaluation of the model, use following command:

cd src
python evaluation_engine.py --config=<path_to_the_config_file> --output_file=<path_to_the_output_file>

The config file should be JSON file containing all the necessary parameters for the experiment. See example_config_hf.json for running Huggingface models and example_config_nemo.json for running NeMo models. There should also be a separate JSON file which stores the prompt schemes for all the prompts to be used in the experiment. See prompt_schemes_example.json for an example.

Output file is a txt file, where experiment results will be stored.

Config fields

Here's a detailed description of each field in the provided JSON configuration:

Top-Level Fields

model:
- Contains information about the model used for the tasks.
- library: Specifies the library used to load the model. Two options are huggingface and nemo
- path: The local path of the model or the model ID in HuggingFace's model hub.
- apply_chat_template: (only for Huggingface and vLLM models): Indicates whether the chat template is applied
- guided_decoding (vLLM models only): Indicates whether you want to restrict the model's output to only certain specific values. This option is not supported for the WSC_generative benchmark.
- model_kwargs (vLLM models only): Additional arguments that can be specified in vLLM LLM initialization
prompt_scheme_file:
- File for storing the prompt schemes of all prompts that are to be used in the evaluation process.
benchmarks: An array of benchmark tasks used to evaluate the model. Each benchmark entry evaluates the model on a specific dataset.

Fields for Each Benchmark Object:

dataset:
- Name of the dataset used for evaluation (e.g., BoolQ, MultiRC, etc.).
human_translated (true/false):
- Specifies whether human-translated version of train and validation set is loaded.
machine_translated (true/false):
- Specifies whether the machine-translated version of train and validation set is loaded. If human_translated is also set to True, machine translated examples that are also human translated are replaced with human translations.
seed:
- Random seed value used for reproducibility. Ensures that sampling of few-shot examples is consistent over multiple runs.
evaluation:
- Defines the evaluation parameters.
- majority_correlation (true/false): If true, the evaluation will check how the model's output correlates with the majority label in few-shot examples.
- last_example_correlation (true/false): If true, the evaluation will check how the model's output correlates with the label of last few-shot example.
- ci: Confidence Interval (CI) settings for evaluating the model's performance.
  - type: Specifies how the confidence interval is calculated. Common types include std (standard deviation) and quantile_bootstrap (sampling-based method).
  - alpha: Confidence level, typically 0.95 (95% confidence interval).
  - bootstrap_samples: Number of samples to use when using bootstrapping for CI calculation (e.g., 1000 samples).

Prompt scheme file fields

Below are the fields required in the JSON file containing the prompt schemes:

k:
- Refers to the number of examples or shots used in few-shot learning. k=0 means zero-shot learning, while higher values like k=1, k=2, etc., refer to one-shot, two-shot, and so on. Experiment is run separately for each value of k in the list.
prompt_template:
- Template for how input data is formatted before being sent to the model. Needs to contain {instruction} and {input} placeholders for the task's specific instruction and input text.
instruction: The task-specific instruction to be inserted into the {instruction} placeholder in the prompt template. Acts as the first and main part of the prompt that explains the nature of the task to the model.
prefix:
- To be inserted into the {input} placeholder in the prompt template. Defines how different parts of the dataset are prefixed in the input prompt for the model. Each dataset has its own structure, so the prefixes correspond to the dataset's fields. Check example configs for dataset specific fields.

Online test submission

To prepare the submission for online testing, use the following command:

cd src
python prepare_test_submission.py --config=<path_to_the_config_file> --output_dir=<path_to_the_output_dir>

The config file should be JSON file containing all the necessary parameters for the submission. See example_config_submission.json. All the fields are the same as for evaluation config, except that evaluation field is discarded, k should now be integer instead of array of integers and human_translated and machine_translated fields now apply only to train set.

Output dir is a directory where submission files (one file for each dataset) will be stored.

Model parallelism using vLLM

Curently model parallelism is supported only for vLLM models. To split the model across multiple GPUs, specify tensor_parallel_size under model_kwargs. Example of config for splitting Goggle's Gemma 2 over 4 GPUs:

"model": {
     "library": "vllm",
     "path": "google/gemma-2-9b",
     "apply_chat_template": false,
     "model_kwargs": {
         "tensor_parallel_size": 4
     }
 }

License

Distributed under the Apache 2.0 License. See LICENSE for more information.

Contact

Domen Vreš
[email protected]

Acknowledgements

The framework was developed within the PoVeJMo research program (Adaptive Natural Language Processing with Large Language Models), particularly within the research project titled SloLLaMai -- Open-access computationally efficient models for Slovenian. The program is funded within the Recovery and Resilience Plan by the Slovenian Research and Innovation Agency (ARIS) and NextGenerationEU. The authors also acknowledge the financial support from the Slovenian Research and Innovation Agency (research core funding No. P6-0411 -- Language Resources and Technologies for Slovene).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SloBench evaluation for generative models

Requirements

Installation

Usage

Offline evaluation

Config fields

Top-Level Fields

Fields for Each Benchmark Object:

Prompt scheme file fields

Online test submission

Model parallelism using vLLM

License

Contact

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
example_config_hf.json		example_config_hf.json
example_config_nemo.json		example_config_nemo.json
example_config_submission.json		example_config_submission.json
example_config_vllm.json		example_config_vllm.json
prompt_schemes_example.json		prompt_schemes_example.json
run_evaluation.sbatch		run_evaluation.sbatch

License

GaMS-Team/slobench_evaluation

Folders and files

Latest commit

History

Repository files navigation

SloBench evaluation for generative models

Requirements

Installation

Usage

Offline evaluation

Config fields

Top-Level Fields

Fields for Each Benchmark Object:

Prompt scheme file fields

Online test submission

Model parallelism using vLLM

License

Contact

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages