TREvaL

We are unlocking the TREvaL to facilitate further exploration in evaluation LLMs fields.

Overview

TREvaL(Reward Model for Reasonable Robustness evaluation), is a framework for generating word-level perturbations on open questions and evaluating the helpfulness, harmlessness robustness of LLMs via these prompts. This repo provide a codebase for TREvaL and a method to depict LLMs' loss landscapes. The repository is an implementation for:

TREvaL: Are Large Language Models Really Robust to Word-Level Perturbations?

Accept by Socially Responsible Language Modelling Research (SoLaR) Workshop at NeurIPS'23

Three perturbation forms are available:

Misspelling
Swapping
Synonym

Our code builds on top of various existing repos, especially:

Beavor Series: https://github.com/PKU-Alignment/safe-rlhf
Llama2 Series: https://github.com/facebookresearch/llama
Synonym: https://github.com/LinyangLee/BERT-Attack
landscape: https://github.com/marcellodebernardi/loss-landscapes

Notably, the tested LLMs or MLMs can be obtained in each related repo above.

Installation

To run our code, you need to have Python 3(>=3.8), conda and pip installed on your machine. The installation can be done by the following procedures:

To clone the repo:

git clone https://github.com/Harry-mic/TREval.git
cd TREval/

To create the environment and install the dependencies:

conda env create -f environment.yaml
conda activate TREvaL

Evaluation

To do the evaluation, here is an example:

# LLM model: llama, alpaca, beavor, llama2, llama2-7b-chat, llama2-13b-chat, llama2-70b-chat
# MLM model: bert-base-uncased 
# Reward model: beaver-7b-v1.0-reward, beaver-7b-v1.0-cost
# task: misspelling, swapping, synonym
# attack_degree: 3, 5, 10
# dataset: nq

# The following command gather the function of generating perturbed text and interacting with the Beavor series LLMs
python Beavor_models/generate_and_attack.py --model beavor --dataset nq --task misspelling --attack_degree 10

# For interacting with Llama2 series
torchrun --nproc_per_node 1 llama2/attack.py --model llama-2-chat-7b --dataset nq --task misspelling --attack_degree 10

# For evaluating the score
python Beavor_models/eval.py --model RewardModel --dataset nq --task misspelling --attack_degree 10

To draw the loss landscape, here is an example:

python Beavor_models/create_landscape.py --model beavor --task misspelling --attack_degree 10

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Beavor_models		Beavor_models
llama2		llama2
LICENSE		LICENSE
README.md		README.md
TREvaL.png		TREvaL.png
environment.yaml		environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TREvaL

Overview

Installation

Evaluation

About

Releases

Packages

Languages

License

Harry-mic/TREvaL

Folders and files

Latest commit

History

Repository files navigation

TREvaL

Overview

Installation

Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages