On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research

Published on the Trustworthy and Reliable Large-Scale Machine Learning Models ICLR 2023 Workshop.

Abstract: Perception of toxicity evolves over time and often differs between geographies and cultural backgrounds. Similarly, black-box commercially available APIs for detecting toxicity, such as the Perspective API, are not static, but frequently retrained to address any unattended weaknesses and biases. We evaluate the implications of these changes on the reproducibility of findings that compare the relative merits of models and methods that aim to curb toxicity. Our findings suggest that research that relied on inherited automatic toxicity scores to compare models and techniques may have resulted in inaccurate findings. Rescoring all models from HELM, a widely respected living benchmark, for toxicity with the recent version of the API led to a different ranking of widely used foundation models. We suggest caution in applying apples-to-apples comparisons between studies and lay recommendations for a more structured approach to evaluating toxicity over time.

Installing

All images, tables and values cited in the paper can be reproduced in the notebooks 01 and 02.

conda env create -f environment.yml
conda init black_box
python -m ipykernel install --user --name=black_box

Download data

Rescored toxicity scores and metrics produced for the paper are available at our HuggingFace datasets repo. Published scores from RTP are also needed to reproduce results.

git lfs install
git clone git@hf.co:datasets/for-ai/black-box-api-challenges data

wget https://ai2-public-datasets.s3.amazonaws.com/realtoxicityprompts/realtoxicityprompts-data.tar.gz
tar -xvzf realtoxicityprompts-data.tar.gz -C data/
rm realtoxicityprompts-data.tar.gz

Run scripts

There are three main scripts: score, collate and evaluate. Below are examples of how to use each for DExperts rescored generation files that accompany this repo.

Score

You can replace the input_path for your desired jsonl file and indicate in which column are the text you want to rescore. The script currently supports text that are contained in dictionaries (text key), list of dictionaries and columns of strings. This outputs files with _perspective.jsonl termination.

Perspective API rate limit is 1 by default. Before running this script, don't forget to export your API key.

export PERSPECTIVE_API_KEY=$YOUR_KEY

python -m scripts.score \
    data/dexperts/generations/toxicity/dapt/prompted_gens_gpt2_gens_rescored.jsonl \
    --column_name generations \
    --output_folder data/example \
    --perspective_rate_limit 1

To rescore DExperts's 10k non-toxic RTP prompts, for example, you can run

python -m scripts.score \
    data/dexperts/prompts/nontoxic_prompts-10k.jsonl \
    --column_name prompt \
    --output_folder data/example \
    --perspective_rate_limit 1

Collate

The collate script joins prompts and generations into a single file. We need all three files: with generated text, with scores corresponding to those texts, and the prompts which generated the continuations, if used. This outputs files with _collated.jsonl termination.

python -m scripts.collate \
    data/dexperts/generations/toxicity/dapt/prompted_gens_gpt2_gens_rescored.jsonl \
    data/example/prompted_gens_gpt2_gens_rescored_perspective.jsonl \
    --prompts_path data/dexperts/prompts/nontoxic_prompts-10k.jsonl

You can collate prompts to their new scores with

python -m scripts.collate_prompts data/dexperts/prompts/nontoxic_prompts-10k.jsonl data/example/nontoxic_prompts-10k_perspective.jsonl

Evaluate

With the evaluate script we can compute toxicity metrics such as Expected Maximum Toxicity, Toxicity Probability and Toxic Fraction This outputs files with _toxicity.csv termination.

python -m scripts.evaluate --prompted_json data/example/prompted_gens_gpt2_gens_rescored_collated.jsonl

Scrape, score and evaluate HELM continuations

We scrape the website for models benchmarked under the real_toxicity_prompts task. Those model names are used to download continuations and published stats.jsonl files from HELM's buckets.

python -m scripts.helm.scrape \
    --task "real_toxicity_prompts" \
    --version "v0.2.2" \
    --output_folder "data/rescored/helm"

Then, we rescore downloaded continuations and collate those scores with the original prompts. You can pass prompts_path in case you rescored prompts as well.

HELM uses the first spanScore, instead of the summaryScores as other RTP benchmarks. Evaluation is performed exclusively on notebook 02 from the collated files and original stats.jsonl files.

python -m scripts.helm.score_and_collate \
    --perspective_rate_limit 1 \
    --base_dir data/rescored/helm/real_toxicity_prompts

Citation

@article{pozzobon2023challenges,
  title={On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research},
  author={Pozzobon, Luiza and Ermis, Beyza and Lewis, Patrick and Hooker, Sara},
  journal={arXiv preprint arXiv:2304.12397},
  year={2023}
}

Useful repos

RealToxicityPrompts: https://github.com/allenai/real-toxicity-prompts
DExperts: https://github.com/stanford-crfm/helm
HELM: https://github.com/alisawuffles/DExperts

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
notebooks		notebooks
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
environment.yml		environment.yml
requirements.in		requirements.in
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research

Installing

Download data

Run scripts

Score

Collate

Evaluate

Scrape, score and evaluate HELM continuations

Citation

Useful repos

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Cohere-Labs-Community/black-box-api-challenges

Folders and files

Latest commit

History

Repository files navigation

On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research

Installing

Download data

Run scripts

Score

Collate

Evaluate

Scrape, score and evaluate HELM continuations

Citation

Useful repos

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages