mlfoundations · Hritikbansal · Mar 2, 2025 · Mar 2, 2025 · Mar 3, 2025 · Mar 3, 2025
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_boardgame_qa/README.md b/eval/chat_benchmarks/BBEH/data/bbeh_boardgame_qa/README.md
@@ -0,0 +1,21 @@
+# BBEH BoardgameQA
+
+[BoardgameQA](https://arxiv.org/abs/2306.07934) is a benchmark where given a
+defeasible theory (a set of input facts, possibly contradictory rules, and
+preferences over the rules), and a question about that theory, the task is to
+do multi-hop reasoning and conflict resolution over the input theory to answer
+the question. The final answer to the question is either `proved` (if the
+statement in the question derives from the theory), `disproved` (if the
+negation of the statement in the question derives from the theory), or
+`unknown` (if neither the statement in the questions nor its negation derives
+from the theory). With three labels per question, a random baseline has an
+accuracy of ~33.3\%. Conflicts may arise when two rules such as:
+
+    R1: a implies c
+    R2: b implies not c
+
+are both activated leading to different beliefs about the truth value of the
+variable c. However, preferences over the rules is provided in the input
+question and in the case of conflicts, the derivation from the rule with the
+higher preference must be concluded (e.g., if R1 is preferred over R2 and they
+both apply, then we conclude c is true).
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_boardgame_qa/task.json b/eval/chat_benchmarks/BBEH/data/bbeh_boardgame_qa/task.json
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_boolean_expressions/README.md b/eval/chat_benchmarks/BBEH/data/bbeh_boolean_expressions/README.md
@@ -0,0 +1,22 @@
+# BBEH Boolean Expressions
+
+This task requires determining the truth value of a statement that is composed
+of logical operands such as *True* and *False* as well as other textual or
+mathematical statements that evaluate to True or False. To create this task, we
+first randomly create expressions containing only True and False operands and
+three logical operators: **and**, **or**, and **not**. We create this in a
+bottom-up fashion where we generate smaller sub-expressions and then combine
+them with logical operators. Once a large enough expression is created, we
+replace some of the True and False operands with statements that evaluate to
+True or False. These could be mathematical expressions such as *24 - 2 is
+greater than 48 / 2* (which evaluates to False) or textual statements such as
+*The capital of Canada is Ottawa* (which evaluates to True). In both cases, we
+select these statements from a predefined set. While determining the truth value
+of each of these statements in isolation may be easy for many models, including
+these statements makes it more difficult for models; otherwise, they can simply
+solve the problem by generating a single line of python code.
+
+We generate five expressions using the approach outlined above, four of which
+evaluate to False and one of which evaluate to True. The job of the model is
+then to find the expression that evaluates to True. Since this is a five-way
+question, the random chance accuracy is 20%.
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_boolean_expressions/task.json b/eval/chat_benchmarks/BBEH/data/bbeh_boolean_expressions/task.json
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_buggy_tables/README.md b/eval/chat_benchmarks/BBEH/data/bbeh_buggy_tables/README.md
@@ -0,0 +1,20 @@
+# BBEH Buggy Tables
+
+This task was constructed synthetically by the authors. The objective in this
+task is to be able to respond to conditional queries over tabular data, where
+the information in the table are presented in a buggy way but the description
+for the bug is also presented so that the model can reconstruct the original
+table based on that. As an example, we provide a row-major/column-major format
+of the table where the null values have been mistakenly removed, but we also
+provide the positions of the null values in the original table so one can
+reconstruct the table given the two pieces of information. As another example,
+we provide a buggy version of the table where some random values are appended at
+the end of each row or each column, but we also specify how they have been added
+so one can use this information to remove them and reconstruct the original
+table. As yet another example, we provide a markdown format of the table that
+mixes each two rows of the table into one row, but also provide an explanation
+of how each two rows have been merged into one so that the original table can be
+reconstructed based on that information. Examples of conditional queries
+include computing some statistics (count, sum, mean, stdev, median) of some
+columns while only considering rows where some columns have some specific
+values.
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_buggy_tables/task.json b/eval/chat_benchmarks/BBEH/data/bbeh_buggy_tables/task.json
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_causal_understanding/README.md b/eval/chat_benchmarks/BBEH/data/bbeh_causal_understanding/README.md
@@ -0,0 +1,9 @@
+# Causal Understanding
+
+This dataset includes a subset of the causal stories in [Nie, Allen, et al. (2024)](https://proceedings.neurips.cc/paper_files/paper/2023/hash/f751c6f8bfb52c60f43942896fe65904-Abstract-Conference.html) and improved examples from [Kıcıman, Emre, et al. (2023)](https://arxiv.org/abs/2305.00050). The first set of questions focuses on testing causal judgment, and the second set focuses on testing the ability to reason about necessary and sufficient causes.
+
+The "Causal understanding" task is a modified version of the following:
+
+- The BBEH-MOCA is a modified version of the dataset ‘MOCA’ by authors Allen Nie, Yuhui Zhang, Atharva Amdekar, Chris Piech, Tatsunori Hashimoto and Tobias Gerstenberg and available at https://github.com/cicl-stanford/moca/tree/main/data.
+- The BBEH-Vignettes is a modified version of the dataset 'Actual Causality Vignettes’ Copyright (c) 2022 Amit Sharma made available at https://github.com/amit-sharma/chatgpt-causality-pairs/blob/main/actual-causality/data.csv.
+- The BBEH-Lab Vignettes is a modified version of the dataset ‘Actual Causality Pairs’ Copyright (c) 2022 Amit Sharma made available at https://github.com/amit-sharma/chatgpt-causality-pairs/blob/main/actual-causality/lab_data.csv.
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_causal_understanding/task.json b/eval/chat_benchmarks/BBEH/data/bbeh_causal_understanding/task.json
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_disambiguation_qa/README.md b/eval/chat_benchmarks/BBEH/data/bbeh_disambiguation_qa/README.md
@@ -0,0 +1,21 @@
+# BBEH DisambiguationQA
+
+This task introduces a more challenging adaptation of the original
+Disambiguation task in BBH. The objective is to accurately determine the
+referents of ambiguous pronouns in complex sentences, or to explicitly identify
+instances of unresolvable ambiguity by responding 'ambiguous'. To enhance the
+task difficulty and complexity, we constructed a dataset of 120 novel examples
+that are longer than those in BBH, require more referent disambiguation, and
+each question contains more options so the random chance performance is lower.
+These examples were constructed either by creating entirely new sentences or
+combining existing BBH instances. Ten annotators (all of them the authors of the
+paper) were tasked with creating these examples, each comprising a potentially
+ambiguous sentence, a single correct resolution statement, and several
+distractor options for a multiple-choice format. To ensure data quality, each
+example underwent a two-stage verification process. First, a separate
+annotator independently evaluated the correctness of the resolution.
+Discrepancies were then resolved through a third-party adjudicator or
+collaborative refinement by all three annotators. In cases where consensus
+could not be reached, the annotators jointly revised the example to achieve
+clarity and accuracy. This rigorous process resulted in 25 examples requiring
+modification.
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_disambiguation_qa/task.json b/eval/chat_benchmarks/BBEH/data/bbeh_disambiguation_qa/task.json
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_dyck_languages/README.md b/eval/chat_benchmarks/BBEH/data/bbeh_dyck_languages/README.md
@@ -0,0 +1,12 @@
+# BBEH Dyck Language
+
+This task is from the [BIG-Bench Mistake](https://arxiv.org/pdf/2311.08516).
+This task involves finding the first mistake in an existing chain-of-thought
+sequence, used to answer a Dyck Language question in the original BIG-Bench Hard
+(BBH) dataset. In each example, the target answer is either the number where the
+first mistake occurred, or that there are no mistakes in the CoT sequence. These
+CoT sequences are generated by prompting PaLM 2 Unicorn on the original BBH
+dataset at temperature = 0. The newline is used as a stop token so that each
+intermediate step can be prepended with *Thought 1:* , *Thought 2: *, etc.
+Further information on the prompting and generation process can be found in the
+original work.
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_dyck_languages/task.json b/eval/chat_benchmarks/BBEH/data/bbeh_dyck_languages/task.json
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_geometric_shapes/README.md b/eval/chat_benchmarks/BBEH/data/bbeh_geometric_shapes/README.md
@@ -0,0 +1,17 @@
+# BBEH Geometric Shapes
+
+SVG is a language for drawing shapes. We use two basic commands: 1- M (x, y)
+corresponding to moving to the (x, y) coordinate, and 2- L (x, y) corresponding
+to drawing a line from the current location to (x, y). We use the shape outlines
+from [GeomVerse](https://arxiv.org/abs/2312.12241), a dataset of geometry
+questions involving multiple shapes that share some elements, which are
+specified as TikZ commands and convert them to SVG. We then ask the model
+to identify what shapes will be drawn if we visualize the SVG.
+
+We consider two extra axes for difficulty: 1- we randomly break some lines
+segments into multiple collinear line segments, and 2- we add some extra lines
+such that they intersect at some points and those intersections form some shapes
+(in other cases, shapes are created using the full line segments and not at
+their intersection points). We then create four subsets for the task
+corresponding to the cross product of few vs many line breaks and intersect vs
+no intersect.
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_geometric_shapes/task.json b/eval/chat_benchmarks/BBEH/data/bbeh_geometric_shapes/task.json
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_hyperbaton/README.md b/eval/chat_benchmarks/BBEH/data/bbeh_hyperbaton/README.md
@@ -0,0 +1,9 @@
+# Hyperbaton
+
+The BBEH Hyperbaton task assesses a model's ability to inductively reason about
+adjective order in a novel English variant, where the standard adjective
+ordering is randomized. Models must infer this new order from example sentences
+with partial orderings and identify correct sentences from provided options.
+This task moves beyond testing standard linguistic knowledge, focusing on
+inducing and applying new rules, and challenging strong priors about standard
+adjective ordering.
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_hyperbaton/task.json b/eval/chat_benchmarks/BBEH/data/bbeh_hyperbaton/task.json
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_linguini/README.md b/eval/chat_benchmarks/BBEH/data/bbeh_linguini/README.md
@@ -0,0 +1,16 @@
+# BBEH Linguini
+
+This task comes from [Sanches et al. 2024](https://arxiv.org/abs/2409.12126)
+where the problems are extracted from the International Linguistic Olympiad
+(IOL). The original dataset is available
+[here](https://github.com/facebookresearch/linguini). According to the original
+work that introduced this dataset, the problems are \emph{"linguistic problems
+which require meta-linguistic awareness and deductive reasoning capabilities to
+be solved instead of pre-existing language proficiency"}.
+
+We created a subset of the Linguini problems by sampling from four categories of
+the Linguini problems, namely *translation*, *fill blanks*, *num to text* and
+*text to num*. The original dataset contains questions that require multiple
+answers. For example, the *fill blanks* questions have multiple blanks that need
+to be filled. We create questions that have a single answer by randomly
+selecting one of those blanks and only asking the model to fill that one.
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_linguini/task.json b/eval/chat_benchmarks/BBEH/data/bbeh_linguini/task.json
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_movie_recommendation/README.md b/eval/chat_benchmarks/BBEH/data/bbeh_movie_recommendation/README.md
@@ -0,0 +1,17 @@
+# BBEH Movie Recommendation
+
+The original Movie Recommendation task in BIG-Bench Hard has been created as
+follows. For each question, a set of eight movies from MovieLens have been
+selected such that a rather large number of people have all liked five of them
+and disliked three of them. Then, a question has been generated by giving four
+of the five liked movies and asking models to recommend one of the remaining
+four movies, where the correct answer is the one left out of the 5 liked movies.
+
+We updated this task as follows. We create multiple sets of movies where one of
+them contains the five liked movies and the other ones contain some of the liked
+movies and some of the disliked movies. Then, we ask the model to select the set
+that contains movies that are more likely to all be liked by a large group of
+people. In the new variant we created, instead of recommending a single movie
+given four movies, models have to examine each set separately and predict their
+overall likability, and then decide the option that is more likely to have a
+likability score with our specific definition of likeability.
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_movie_recommendation/task.json b/eval/chat_benchmarks/BBEH/data/bbeh_movie_recommendation/task.json
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_multistep_arithmetic/README.md b/eval/chat_benchmarks/BBEH/data/bbeh_multistep_arithmetic/README.md
@@ -0,0 +1,25 @@
+# BBEH Multi-Step Arithmetic
+
+This task introduces new arithmetic operators. An example of such an operator
+is as follows:
+
+    a >< b equals (a - b) if a * b > 0; otherwise, it equals a + b
+
+Some of the operations can be defined based on the other new operations. For
+example we may have:
+
+    a ; b equals (a >< b) if a + b > 0; otherwise, it equals a - b
+
+We also define a form of composing multiple operations as follows: a op1 op2 b
+denotes (a op1 b) op2 b; for example, 4 +* -5 means (4 +~ 5) * -5 and 4 *++ 5
+means (4 * 5) ++ 5.
+
+Then we sample random arithmetic expressions involving the above operations. An
+example expression is:
+
+    (1 @*+ 4) <>+[] (-4 *<>* -1)
+
+(although our expressions are longer), with @, <>, and [] being new operations.
+The job of the model is to compute the value of the expression. Being able to
+compute these expressions requires expanding the expressions and making a long
+list of computations correctly.
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_multistep_arithmetic/task.json b/eval/chat_benchmarks/BBEH/data/bbeh_multistep_arithmetic/task.json
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_nycc/README.md b/eval/chat_benchmarks/BBEH/data/bbeh_nycc/README.md
@@ -0,0 +1,17 @@
+# BBEH NYCC
+
+This task builds on the existing benchmarks for the New Yorker Caption Contest
+(NYCC) dataset (see [this work](https://arxiv.org/abs/2209.06293) and
+[this work](https://arxiv.org/abs/2406.10522)). The NYCC caption dataset
+consists of a: several hundred contests, each of which is a cartoon published
+in the New Yorker magazine and several thousand submitted humorous captions,
+b: crowdsourced ratings for each caption. The ratings are on a scale of
+**Unfunny**, **Somewhat Funny**, and **Funny**, and each caption has anywhere
+from a few dozen to a few thousand ratings. Past works have focused on pairwise
+comparison tasks, where two captions and a textual description of the cartoon
+are presented to the model, and the model has to pick the funnier of the two.
+To make the task significantly more difficult, for each contest we sample one
+query from the top ten rated, and then take captions ranked 1000-1009 and ask
+the model to choose the funniest. We use the textual descriptions of the
+cartoons generated by GPT-4o that are provided in
+[this work](https://arxiv.org/abs/2406.10522).
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_nycc/task.json b/eval/chat_benchmarks/BBEH/data/bbeh_nycc/task.json
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_object_counting/README.md b/eval/chat_benchmarks/BBEH/data/bbeh_object_counting/README.md
@@ -0,0 +1,12 @@
+# BBEH Object Counting
+
+Given a long list of objects that a person has, the model has to count the
+number of items of a certain type. For examples, the items might belong to
+classes (fruits, cell phones, cars) and the goal may be to count the total
+number of cell phones that the person has. We consider two types of questions:
+1- counting the sum of the number of items belonging to two different classes,
+and 2- finding the absolute difference of the number of items belonging to two
+different classes. To add to the difficulty of the task, some irrelevant
+information, including the number of the same items that other people have,
+are added to the input context so the problem becomes one of finding multiple
+needles in a haystack.
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_object_counting/task.json b/eval/chat_benchmarks/BBEH/data/bbeh_object_counting/task.json
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_object_properties/README.md b/eval/chat_benchmarks/BBEH/data/bbeh_object_properties/README.md
@@ -0,0 +1,23 @@
+# BBEH Object Properties
+
+In this task, an initial collection of objects with different properties (color,
+size, origin, smell, and material) are provided (e.g., a extra-small blue
+Canadian jar made of glass and with a smell of rose). Then, the collection goes
+through several updates corresponding to adding, removing or editing some of the
+objects. The updates are explained in the prompt and the models require a full
+grasp of the object properties to identify what changes to the collection must
+be made for each update. A simple example of an update is as follows:
+
+    My dad threw away all objects of a certain color from my collection.
+    After this, my collection only had 5 blue objects and 3 white objects.
+
+For the above update, one has to find which color has been removed by comparing
+the new colors with the object colors in the previous collection, and then
+update the collection accordingly. The set of updates that the collection goes
+through in each of the examples are randomly selected from a large set of
+possible changes. At the end, a question is asked about the final collection.
+The question is either an **either** question in which we ask how many items in
+the final collection have property 1 or property 2, ... (e.g., how many items
+are either blue or small), or a **neither** question in which we ask how many
+items neither have property 1 nor property 2, ... (e.g., how many items are not
+blue and are not small).
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_object_properties/task.json b/eval/chat_benchmarks/BBEH/data/bbeh_object_properties/task.json
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_sarc_triples/README.md b/eval/chat_benchmarks/BBEH/data/bbeh_sarc_triples/README.md
@@ -0,0 +1,15 @@
+<!-- mdlint off(LINE_OVER_80) -->
+
+# BBEH SARC Triples
+
+[SARC](https://aclanthology.org/L18-1102.pdf) (Self-Annotated Corpus for Sarcasm) is a large dataset of sarcasm responses mined from the Reddit social media / forum platform. Many Reddit users end a post or reply with the token **/s** when they have intended the preceding text to be interpreted sarcastically or satirically. This allowed positive examples of user-intended sarcasm to be mined.
+
+Forking off the SARC dataset, we construct a challenging task for LLMs that requires reading three independent examples from SARC, and classifying each into binary label, where a positive label indicates sarcasm. The SARC authors created a balanced test set with 64,666 examples. Many of these examples can only be understood with an image or an article link that accompanied the original post or reply. On the other hand, some examples, usually with longer textual content, can be understood on their own. We design our derived benchmark to consist mainly of the latter type. To achieve this, we filter out examples with either (1) less than 100 characters or (2) without a reply, resulting in 679 examples from the original test set, with 48.4% positive label rate. We sample (uniformly-at-random) 600 examples from this set, group them (uniformly-at-random) into groups of three, and pass the text of each 3-tuple of post, reply pair to the following prompt:
+
+    Here are three (post, reply) pairs from Reddit. Your task is to decide whether each reply is sarcastic. Specifically, label each pair with a "0" or "1", where a "1" indicates that the reply is sarcastic, and a "0" indicates that the reply does not contain sarcasm, and provide your final answer as a comma-separated set of labels (e.g., "1,0,0" or "0,0,0").
+    POST 1: post1_text
+    REPLY 1: reply1_text
+    POST 2: post2_text
+    REPLY 2: reply2_text
+    POST 3: post3_text
+    REPLY 3: reply3_text
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_sarc_triples/task.json b/eval/chat_benchmarks/BBEH/data/bbeh_sarc_triples/task.json
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_shuffled_objects/README.md b/eval/chat_benchmarks/BBEH/data/bbeh_shuffled_objects/README.md
@@ -0,0 +1,24 @@
+# BBEH Shuffled Objects
+
+The original task in BBH is as follows: there are N people each assigned to an
+object/person (e.g., a dance partner, a book, a color, etc.). For example, Alice
+has a green book, Bob has a red book, etc. Then, there are multiple switch
+operations where pairs of people switch together what they are assigned to
+(e.g., Alice and Bob switch their books). At the end, one needs to predict the
+object/person assigned to one of the N people (e.g., at the end, what color is
+the book that Bob has?).
+
+We created two variants of this problem. In the first variant, we keep
+everything the same except that we add switch actions that have no effect. For
+example, we add *Then, Person1 and Person2 switch their books. Then, Person2 and
+Person1 switch their books*. We add many of these no-effect operations so that
+the problem becomes a long-context reasoning problem.
+
+The second variant extends the first variant, in which we assign names to some
+of the switch actions as they occur and use those names later. For example, the
+first time *Person1 switches with Person2* occurs, we replace the text with
+*Person1 switches with Person2 (let's call this Action K)*, and the next time
+the same switch happens, with some probability we replace the text with
+*action K repeats*. Given the long-context nature of the problem, the model
+requires to have the ability to remember information from many steps ago to be
+able to identify what that action corresponded to.
diff --git a/eval/chat_benchmarks/BBEH/data/bbeh_shuffled_objects/task.json b/eval/chat_benchmarks/BBEH/data/bbeh_shuffled_objects/task.json