Skip to content
This repository was archived by the owner on Jul 23, 2024. It is now read-only.

Conversation

@yuvalreif
Copy link

[Bias-amplified Splits]

Our work proposes a novel evaluation framework to assess model robustness, by amplifying dataset biases in the training data and challenging models to generalize beyond them. This framework is defined by a bias-amplified training set and a hard, anti-biased test set, which we automatically extract from existing datasets using a novel, clustering-based approach for identifying minority examples—examples that defy common statistical patterns found in the rest of the dataset.

Authors

Implementation

The sub-tasks implement the following methods:

  • format_example: formats sentence-pair tasks to a single input, to match the (input, target) format.
  • get_datasets_raw: This re-implementation is for the sole purpose of passing assertions/tests. Our task re-splits datasets, some of which don't originally have 'validation' or 'test' splits (e.g., in MultiNLI there's only validation_matched/mistmatched), but the task must contain splits named 'validation'/'test'. However, naming the splits 'validation' in the split.jsonnet is not enough -- they must already exist in the dataset. Therefore we artificially add these splits when needed (for MultiNLI and WANLI).

Usage

If your evaluation function should be ran in any other way than the default way
(task.evaluate_predictionS(predictions, gold), you can describe this here.

Checklist:

  • I and my co-authors agree that, if this PR is merged, the code will be available under the same license as the genbench_cbt repository.
  • Prior to submitting, I have ran the GenBench CBT test suite using the genbench-cli test-task tool.
  • I have read the description of what should be in the doc.md of my task, and have added the required arguments.
  • I have submitted or will submit an accompanying paper to the GenBench workshop.

@vernadankers
Copy link
Contributor

Hello!

We are getting quite close to the deadline (September 1, 11:59PM anywhere on earth), so if your PR needs any final changes, please make them now,
and don't forget to submit your accompanying paper to Openreview via https://openreview.net/group?id=GenBench.org/2023/Workshop by September 1.

Good luck finalising your PR and paper, feel free to tag us if you have questions.
Cheers, Verna
On behalf of the GenBench team

@kazemnejad
Copy link
Contributor

@yreif We're in the process of merging the tasks into the repo. In order to merge your task, we need the following changes:

  1. Could you please include a single file usage_example.py of each task where you showcase the full pipeline of using each task for finetuning and evaluation of the way you intent your tasks must be used. Preferably, it should be done on a pretrained huggingface model. Please also include requirements-usage-example.txt for the python dependencies needed to be installed for running the example.

@kazemnejad
Copy link
Contributor

Hey @yreif. Is there any updates regarding the usage_example?
Thanks.

@yuvalreif
Copy link
Author

Hey @kazemnejad, I apologize for the delayed response.
I added usage_example + requirements files for each of the fine-tuning subtasks in the submission.
There are also two evaluation-only prompt based subtasks, should I add usage examples for these as well?
Thanks & appreciate your understanding.

@kazemnejad
Copy link
Contributor

@yreif Thanks for your efforts. Yes, adding an example for the prompt-based tasks would be great. You can create a second usage_example file if you don't want to change the finetuning example :)

@kazemnejad
Copy link
Contributor

@yreif Any updates on the prompt-based tasks example?

@kazemnejad
Copy link
Contributor

@yreif A kind reminder.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants