Skip to content

Support for Big Bench Extra Hard (General-purpose reasoning eval) #92

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_boardgame_qa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# BBEH BoardgameQA

[BoardgameQA](https://arxiv.org/abs/2306.07934) is a benchmark where given a
defeasible theory (a set of input facts, possibly contradictory rules, and
preferences over the rules), and a question about that theory, the task is to
do multi-hop reasoning and conflict resolution over the input theory to answer
the question. The final answer to the question is either `proved` (if the
statement in the question derives from the theory), `disproved` (if the
negation of the statement in the question derives from the theory), or
`unknown` (if neither the statement in the questions nor its negation derives
from the theory). With three labels per question, a random baseline has an
accuracy of ~33.3\%. Conflicts may arise when two rules such as:

R1: a implies c
R2: b implies not c

are both activated leading to different beliefs about the truth value of the
variable c. However, preferences over the rules is provided in the input
question and in the case of conflicts, the derivation from the rule with the
higher preference must be concluded (e.g., if R1 is preferred over R2 and they
both apply, then we conclude c is true).
1 change: 1 addition & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_boardgame_qa/task.json

Large diffs are not rendered by default.

22 changes: 22 additions & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_boolean_expressions/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# BBEH Boolean Expressions

This task requires determining the truth value of a statement that is composed
of logical operands such as *True* and *False* as well as other textual or
mathematical statements that evaluate to True or False. To create this task, we
first randomly create expressions containing only True and False operands and
three logical operators: **and**, **or**, and **not**. We create this in a
bottom-up fashion where we generate smaller sub-expressions and then combine
them with logical operators. Once a large enough expression is created, we
replace some of the True and False operands with statements that evaluate to
True or False. These could be mathematical expressions such as *24 - 2 is
greater than 48 / 2* (which evaluates to False) or textual statements such as
*The capital of Canada is Ottawa* (which evaluates to True). In both cases, we
select these statements from a predefined set. While determining the truth value
of each of these statements in isolation may be easy for many models, including
these statements makes it more difficult for models; otherwise, they can simply
solve the problem by generating a single line of python code.

We generate five expressions using the approach outlined above, four of which
evaluate to False and one of which evaluate to True. The job of the model is
then to find the expression that evaluates to True. Since this is a five-way
question, the random chance accuracy is 20%.

Large diffs are not rendered by default.

20 changes: 20 additions & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_buggy_tables/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# BBEH Buggy Tables

This task was constructed synthetically by the authors. The objective in this
task is to be able to respond to conditional queries over tabular data, where
the information in the table are presented in a buggy way but the description
for the bug is also presented so that the model can reconstruct the original
table based on that. As an example, we provide a row-major/column-major format
of the table where the null values have been mistakenly removed, but we also
provide the positions of the null values in the original table so one can
reconstruct the table given the two pieces of information. As another example,
we provide a buggy version of the table where some random values are appended at
the end of each row or each column, but we also specify how they have been added
so one can use this information to remove them and reconstruct the original
table. As yet another example, we provide a markdown format of the table that
mixes each two rows of the table into one row, but also provide an explanation
of how each two rows have been merged into one so that the original table can be
reconstructed based on that information. Examples of conditional queries
include computing some statistics (count, sum, mean, stdev, median) of some
columns while only considering rows where some columns have some specific
values.
1 change: 1 addition & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_buggy_tables/task.json

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Causal Understanding

This dataset includes a subset of the causal stories in [Nie, Allen, et al. (2024)](https://proceedings.neurips.cc/paper_files/paper/2023/hash/f751c6f8bfb52c60f43942896fe65904-Abstract-Conference.html) and improved examples from [Kıcıman, Emre, et al. (2023)](https://arxiv.org/abs/2305.00050). The first set of questions focuses on testing causal judgment, and the second set focuses on testing the ability to reason about necessary and sufficient causes.

The "Causal understanding" task is a modified version of the following:

- The BBEH-MOCA is a modified version of the dataset ‘MOCA’ by authors Allen Nie, Yuhui Zhang, Atharva Amdekar, Chris Piech, Tatsunori Hashimoto and Tobias Gerstenberg and available at https://github.com/cicl-stanford/moca/tree/main/data.
- The BBEH-Vignettes is a modified version of the dataset 'Actual Causality Vignettes’ Copyright (c) 2022 Amit Sharma made available at https://github.com/amit-sharma/chatgpt-causality-pairs/blob/main/actual-causality/data.csv.
- The BBEH-Lab Vignettes is a modified version of the dataset ‘Actual Causality Pairs’ Copyright (c) 2022 Amit Sharma made available at https://github.com/amit-sharma/chatgpt-causality-pairs/blob/main/actual-causality/lab_data.csv.

Large diffs are not rendered by default.

21 changes: 21 additions & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_disambiguation_qa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# BBEH DisambiguationQA

This task introduces a more challenging adaptation of the original
Disambiguation task in BBH. The objective is to accurately determine the
referents of ambiguous pronouns in complex sentences, or to explicitly identify
instances of unresolvable ambiguity by responding 'ambiguous'. To enhance the
task difficulty and complexity, we constructed a dataset of 120 novel examples
that are longer than those in BBH, require more referent disambiguation, and
each question contains more options so the random chance performance is lower.
These examples were constructed either by creating entirely new sentences or
combining existing BBH instances. Ten annotators (all of them the authors of the
paper) were tasked with creating these examples, each comprising a potentially
ambiguous sentence, a single correct resolution statement, and several
distractor options for a multiple-choice format. To ensure data quality, each
example underwent a two-stage verification process. First, a separate
annotator independently evaluated the correctness of the resolution.
Discrepancies were then resolved through a third-party adjudicator or
collaborative refinement by all three annotators. In cases where consensus
could not be reached, the annotators jointly revised the example to achieve
clarity and accuracy. This rigorous process resulted in 25 examples requiring
modification.

Large diffs are not rendered by default.

12 changes: 12 additions & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_dyck_languages/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# BBEH Dyck Language

This task is from the [BIG-Bench Mistake](https://arxiv.org/pdf/2311.08516).
This task involves finding the first mistake in an existing chain-of-thought
sequence, used to answer a Dyck Language question in the original BIG-Bench Hard
(BBH) dataset. In each example, the target answer is either the number where the
first mistake occurred, or that there are no mistakes in the CoT sequence. These
CoT sequences are generated by prompting PaLM 2 Unicorn on the original BBH
dataset at temperature = 0. The newline is used as a stop token so that each
intermediate step can be prepended with *Thought 1:* , *Thought 2: *, etc.
Further information on the prompting and generation process can be found in the
original work.

Large diffs are not rendered by default.

17 changes: 17 additions & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_geometric_shapes/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# BBEH Geometric Shapes

SVG is a language for drawing shapes. We use two basic commands: 1- M (x, y)
corresponding to moving to the (x, y) coordinate, and 2- L (x, y) corresponding
to drawing a line from the current location to (x, y). We use the shape outlines
from [GeomVerse](https://arxiv.org/abs/2312.12241), a dataset of geometry
questions involving multiple shapes that share some elements, which are
specified as TikZ commands and convert them to SVG. We then ask the model
to identify what shapes will be drawn if we visualize the SVG.

We consider two extra axes for difficulty: 1- we randomly break some lines
segments into multiple collinear line segments, and 2- we add some extra lines
such that they intersect at some points and those intersections form some shapes
(in other cases, shapes are created using the full line segments and not at
their intersection points). We then create four subsets for the task
corresponding to the cross product of few vs many line breaks and intersect vs
no intersect.

Large diffs are not rendered by default.

9 changes: 9 additions & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_hyperbaton/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Hyperbaton

The BBEH Hyperbaton task assesses a model's ability to inductively reason about
adjective order in a novel English variant, where the standard adjective
ordering is randomized. Models must infer this new order from example sentences
with partial orderings and identify correct sentences from provided options.
This task moves beyond testing standard linguistic knowledge, focusing on
inducing and applying new rules, and challenging strong priors about standard
adjective ordering.
1 change: 1 addition & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_hyperbaton/task.json

Large diffs are not rendered by default.

16 changes: 16 additions & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_linguini/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# BBEH Linguini

This task comes from [Sanches et al. 2024](https://arxiv.org/abs/2409.12126)
where the problems are extracted from the International Linguistic Olympiad
(IOL). The original dataset is available
[here](https://github.com/facebookresearch/linguini). According to the original
work that introduced this dataset, the problems are \emph{"linguistic problems
which require meta-linguistic awareness and deductive reasoning capabilities to
be solved instead of pre-existing language proficiency"}.

We created a subset of the Linguini problems by sampling from four categories of
the Linguini problems, namely *translation*, *fill blanks*, *num to text* and
*text to num*. The original dataset contains questions that require multiple
answers. For example, the *fill blanks* questions have multiple blanks that need
to be filled. We create questions that have a single answer by randomly
selecting one of those blanks and only asking the model to fill that one.
1 change: 1 addition & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_linguini/task.json

Large diffs are not rendered by default.

17 changes: 17 additions & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_movie_recommendation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# BBEH Movie Recommendation

The original Movie Recommendation task in BIG-Bench Hard has been created as
follows. For each question, a set of eight movies from MovieLens have been
selected such that a rather large number of people have all liked five of them
and disliked three of them. Then, a question has been generated by giving four
of the five liked movies and asking models to recommend one of the remaining
four movies, where the correct answer is the one left out of the 5 liked movies.

We updated this task as follows. We create multiple sets of movies where one of
them contains the five liked movies and the other ones contain some of the liked
movies and some of the disliked movies. Then, we ask the model to select the set
that contains movies that are more likely to all be liked by a large group of
people. In the new variant we created, instead of recommending a single movie
given four movies, models have to examine each set separately and predict their
overall likability, and then decide the option that is more likely to have a
likability score with our specific definition of likeability.

Large diffs are not rendered by default.

25 changes: 25 additions & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_multistep_arithmetic/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# BBEH Multi-Step Arithmetic

This task introduces new arithmetic operators. An example of such an operator
is as follows:

a >< b equals (a - b) if a * b > 0; otherwise, it equals a + b

Some of the operations can be defined based on the other new operations. For
example we may have:

a ; b equals (a >< b) if a + b > 0; otherwise, it equals a - b

We also define a form of composing multiple operations as follows: a op1 op2 b
denotes (a op1 b) op2 b; for example, 4 +* -5 means (4 +~ 5) * -5 and 4 *++ 5
means (4 * 5) ++ 5.

Then we sample random arithmetic expressions involving the above operations. An
example expression is:

(1 @*+ 4) <>+[] (-4 *<>* -1)

(although our expressions are longer), with @, <>, and [] being new operations.
The job of the model is to compute the value of the expression. Being able to
compute these expressions requires expanding the expressions and making a long
list of computations correctly.

Large diffs are not rendered by default.

17 changes: 17 additions & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_nycc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# BBEH NYCC

This task builds on the existing benchmarks for the New Yorker Caption Contest
(NYCC) dataset (see [this work](https://arxiv.org/abs/2209.06293) and
[this work](https://arxiv.org/abs/2406.10522)). The NYCC caption dataset
consists of a: several hundred contests, each of which is a cartoon published
in the New Yorker magazine and several thousand submitted humorous captions,
b: crowdsourced ratings for each caption. The ratings are on a scale of
**Unfunny**, **Somewhat Funny**, and **Funny**, and each caption has anywhere
from a few dozen to a few thousand ratings. Past works have focused on pairwise
comparison tasks, where two captions and a textual description of the cartoon
are presented to the model, and the model has to pick the funnier of the two.
To make the task significantly more difficult, for each contest we sample one
query from the top ten rated, and then take captions ranked 1000-1009 and ask
the model to choose the funniest. We use the textual descriptions of the
cartoons generated by GPT-4o that are provided in
[this work](https://arxiv.org/abs/2406.10522).
1 change: 1 addition & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_nycc/task.json

Large diffs are not rendered by default.

12 changes: 12 additions & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_object_counting/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# BBEH Object Counting

Given a long list of objects that a person has, the model has to count the
number of items of a certain type. For examples, the items might belong to
classes (fruits, cell phones, cars) and the goal may be to count the total
number of cell phones that the person has. We consider two types of questions:
1- counting the sum of the number of items belonging to two different classes,
and 2- finding the absolute difference of the number of items belonging to two
different classes. To add to the difficulty of the task, some irrelevant
information, including the number of the same items that other people have,
are added to the input context so the problem becomes one of finding multiple
needles in a haystack.

Large diffs are not rendered by default.

23 changes: 23 additions & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_object_properties/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# BBEH Object Properties

In this task, an initial collection of objects with different properties (color,
size, origin, smell, and material) are provided (e.g., a extra-small blue
Canadian jar made of glass and with a smell of rose). Then, the collection goes
through several updates corresponding to adding, removing or editing some of the
objects. The updates are explained in the prompt and the models require a full
grasp of the object properties to identify what changes to the collection must
be made for each update. A simple example of an update is as follows:

My dad threw away all objects of a certain color from my collection.
After this, my collection only had 5 blue objects and 3 white objects.

For the above update, one has to find which color has been removed by comparing
the new colors with the object colors in the previous collection, and then
update the collection accordingly. The set of updates that the collection goes
through in each of the examples are randomly selected from a large set of
possible changes. At the end, a question is asked about the final collection.
The question is either an **either** question in which we ask how many items in
the final collection have property 1 or property 2, ... (e.g., how many items
are either blue or small), or a **neither** question in which we ask how many
items neither have property 1 nor property 2, ... (e.g., how many items are not
blue and are not small).

Large diffs are not rendered by default.

15 changes: 15 additions & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_sarc_triples/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
<!-- mdlint off(LINE_OVER_80) -->

# BBEH SARC Triples

[SARC](https://aclanthology.org/L18-1102.pdf) (Self-Annotated Corpus for Sarcasm) is a large dataset of sarcasm responses mined from the Reddit social media / forum platform. Many Reddit users end a post or reply with the token **/s** when they have intended the preceding text to be interpreted sarcastically or satirically. This allowed positive examples of user-intended sarcasm to be mined.

Forking off the SARC dataset, we construct a challenging task for LLMs that requires reading three independent examples from SARC, and classifying each into binary label, where a positive label indicates sarcasm. The SARC authors created a balanced test set with 64,666 examples. Many of these examples can only be understood with an image or an article link that accompanied the original post or reply. On the other hand, some examples, usually with longer textual content, can be understood on their own. We design our derived benchmark to consist mainly of the latter type. To achieve this, we filter out examples with either (1) less than 100 characters or (2) without a reply, resulting in 679 examples from the original test set, with 48.4% positive label rate. We sample (uniformly-at-random) 600 examples from this set, group them (uniformly-at-random) into groups of three, and pass the text of each 3-tuple of post, reply pair to the following prompt:

Here are three (post, reply) pairs from Reddit. Your task is to decide whether each reply is sarcastic. Specifically, label each pair with a "0" or "1", where a "1" indicates that the reply is sarcastic, and a "0" indicates that the reply does not contain sarcasm, and provide your final answer as a comma-separated set of labels (e.g., "1,0,0" or "0,0,0").
POST 1: post1_text
REPLY 1: reply1_text
POST 2: post2_text
REPLY 2: reply2_text
POST 3: post3_text
REPLY 3: reply3_text
1 change: 1 addition & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_sarc_triples/task.json

Large diffs are not rendered by default.

24 changes: 24 additions & 0 deletions eval/chat_benchmarks/BBEH/data/bbeh_shuffled_objects/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# BBEH Shuffled Objects

The original task in BBH is as follows: there are N people each assigned to an
object/person (e.g., a dance partner, a book, a color, etc.). For example, Alice
has a green book, Bob has a red book, etc. Then, there are multiple switch
operations where pairs of people switch together what they are assigned to
(e.g., Alice and Bob switch their books). At the end, one needs to predict the
object/person assigned to one of the N people (e.g., at the end, what color is
the book that Bob has?).

We created two variants of this problem. In the first variant, we keep
everything the same except that we add switch actions that have no effect. For
example, we add *Then, Person1 and Person2 switch their books. Then, Person2 and
Person1 switch their books*. We add many of these no-effect operations so that
the problem becomes a long-context reasoning problem.

The second variant extends the first variant, in which we assign names to some
of the switch actions as they occur and use those names later. For example, the
first time *Person1 switches with Person2* occurs, we replace the text with
*Person1 switches with Person2 (let's call this Action K)*, and the next time
the same switch happens, with some probability we replace the text with
*action K repeats*. Given the long-context nature of the problem, the model
requires to have the ability to remember information from many steps ago to be
able to identify what that action corresponded to.

Large diffs are not rendered by default.

Loading
Loading