[EMNLP 2024] This is the official implementation of the paper "A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners" in PyTorch.
π©βπ«10-11-2024 We presented the work at the Wharton AI & Analytics Initiative's Research & Education Symposium.
π10-07-2024 We support the findings in Apple's trending paper GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models, which references our work to question the reasoning capabilities of LLMs, while generalizing mathematical reasoning problems into symbolic templates. Definitely worth checking out both!
πNews! 09-21-2024 A short version of this work has been accepted to the EMNLP 2024 GenBench Workshop.
πNews! 09-20-2024 The full paper has been accepted to the EMNLP 2024 Main π΄.
π©βπ«09-20-2024 We presented the work at the Penn ASSET & Warren Center research mixer.
π¦07-08-2024 We released a short video on Twitter. Enjoy!
π06-17-2024 A short version of this work has been accepted to the ICML 2024 Workshop on LLMs and Cognition.
π06-16-2024 We released the paper on ArXiv.
Large language models (LLMs) have achieved remarkable progress in understanding and generating human-like text, but there is ongoing debate about whether LLMs possess genuine reasoning capabilities. This work reconceptualizes the evaluation of LLM's reasoning capabilities into a general and rigorous testing framework with statistical guarantee.
We say that an LLM is subject to token bias in a reasoning task if systematic changes to some or all tokens in the task descriptions - while keeping the underlying logic intact - allow us to predict the direction of the shift in the modelβs output. A strong token bias suggests that LLM is relying on superficial patterns in the input rather than truly understanding the underlying reasoning task, leading to brittle performance that fails to generalize well. Let us look at the following classic "twenty-five horses" problem in graph theory:
You want to find the fastest 3 horses in a group of 25 horses. You can only race 5 horses at a time. You donβt have a stopwatch, so you can only know the ranking of each horse within each race. How many races do you need?
GPT-4 and Claude-3-opus achieve an accuracy of nearly 98.5% and 40.5% in answering this question. However, if we simply perturb "horses" to "bunnies", a change that shouldn't affect the logical essence, would systematically decrease the accuracy to 85.0% and 30.0%, respectively. Further changing "25" to other values decreases their accuracy to 46.0% and 24.0%. These observations indicate strong token biases on the frequently-used names "horses" and "25" in such problems, and LLMs do not have a genuine understanding of how it should solve such problems.
You want to find the fastest 3 bunnies in a group of 25 bunnies. You can only race 5 bunnies at a time. You donβt have a stopwatch, so you can only know the ranking of each bunny within each race. How many races do you need?
We take the classic Linda Problem in Psychology as another example. Below is the original problem statement.
Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which is more probable?
(a) Linda is a bank teller.
(b) Linda is a bank teller and is active in the feminist movement.
Experiments in behavioral psychology reveal that people typically believed the second option was more likely than the first, but this contradicts the basic probability rule of conjunction. Advanced LLMs like GPT-4 can typically recognize this fallacy well since it is a classical problem that appears frequently in cognitive science literature. However, altering seemingly irrelevant tokens like the name πββοΈ "Linda" -> π "Luna" in the problem statement, while maintaining the same logical structure would surprisingly confuse most LLMs. In one-shot learning, GPT-4 and Claude-3-opus would see their accuracy decrease from 100.0% to 72.0% and from 95.0% to 32.0%, respectively. (check detailed experiment setups in paper).
Luna is 29 years old, married, deeply passionate about environmental conservation and transgender rights, and volunteers their weekends at local park clean-ups. They studied physics and applied math in college, and held several campaigns to reduce the campusβs carbon footprint. Which is more probable?
(a) Luna is an assistant professor in aerospace engineering and is an active member of an environmental advocacy group.
(b) Luna is an assistant professor in aerospace engineering.
In our paper, we explore many other token biases in logical reasoning, set theory, and mathematical reasoning problems. We reconceptualize the evaluation of reasoning capabilities into a general and rigorous statistical testing framework, moving beyond accuracy. We conclude, with statistical guarantee, that LLMs do not consistently apply genuine reasoning in their decision-making process, but primarily rely on token bias for response generation. Therefore, we raise concerns about the extent to which LLMs truly engage in reasoning; Any robust evaluation of the LLM's generalization should account for the fundamental impact of token bias hidden in the current benchmark problems.
All images are generated by OpenAI GPT-4o. When we requested 'lop-eared bunnies', the model even displayed a visual token bias by generating bunnies with four ears β both lop and erect β suggesting it associated the term 'bunnies' with the presence of two erect ears without genuine logical understandings.
All the twenty-five bunnies above π°x25 will be happy if you could cite our work. Thank you!
@article{jiang2024peek,
title={A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners},
author={Jiang, Bowen and Xie, Yangxinyu and Hao, Zhuoqun and Wang, Xiaomeng and Mallick, Tanwi and Su, Weijie J and Taylor, Camillo J and Roth, Dan},
journal={arXiv preprint arXiv:2406.11050},
year={2024}
}
- Add the twenty-five horses problem to the paper
- Evaluate the new GPT-o1 reasoning model
Please check requirements.txt. You can run the following commands to create a virtual environment and install all the requirements:
python -m venv myenv
source myenv/bin/activate
pip install -r requirements.txt
We provide our synthetic dataset under data/, which contains a comprehensive set of logical-fallacy problems. The dataset file is in JSON format, and each item is a dictionary containing question_id, question, target_answer, and incorrect_answer. You can also follow the instructions below to generate more synthetic data on the fly.
β€οΈ Always set up OpenAI ChatGPT models. Please follow its Developer quickstart to set up your OpenAI API, create a new api_tokens/openai_key.txt file, and copy and paste your API key into it.
π§‘ To use Google Gemini models with an API for inference, follow instructions on Google Vertex AI about the Try Gemini 1.0 Pro (Python) section. Note that your school's Gmail account may not allow you to make payments.
- Step 1: According to their instructions, you need to first install the Vertex AI client libraries to create a project with a project ID, enable Vertex AI API, create a service account, and generate your account key. You don't need to set the environment variable
GOOGLE_APPLICATION_CREDENTIALSsince we have already done that for you in our codes query_llm.py. - Step 2: Install or update the Vertex AI SDK for Python.
- Step 3: Authenticate to Vertex AI and set up Application Default Credentials.
-
Follow the
Local development environment - Provide user credentials for your Google Accountsection to install and initialize the gcloud CLI. This step will download a foldergoogle-cloud-sdkto your project's top directory. -
After installation, run
gcloud initto initialize the gcloud CLI. You will be able to choose your account and project ID. Create a new api_tokens/gemini_project_id.txt file, and copy and paste your project ID into it.
-
To create your credential file, run
gcloud auth application-default loginYou will see a prompt like
Credentials saved to file: [/path/to/your/home/.config/gcloud/application_default_credentials.json]. -
Because of the path of the credential file we set in our config.yaml, run
mv /path/to/your/home/.config/gcloud/application_default_credentials.json google-cloud-sdk/google_gemini_credential.json
-
π To use Meta Llama models with an API for inference, follow instructons on Replicate Run Llama 3 with an API about the Running Llama 3 with Python section to set up your API tokens, create a new api_tokens/llama_key.txt file, and copy and paste your tokens into it.
π To use Anthropic Claude models with an API for inference, follow its Quickstart Guide to install the Anthropic Python SDK, set up an account with API access, get your API key, create a new api_tokens/claude_key.txt file, and copy and paste your key into it. You don't need to set the environment variable ANTHROPIC_API_KEY.
π To use Mistral models with an API for inference, follow its Quickstart to install the mistralai library, set up an account with API access, get your [API key](https://console.anthropic.com/settings/keys, create a new api_tokens/mistral_key.txt file, and copy and paste your key into it. You don't need to set the environment variable MISTRAL_API_KEY.
We allow command-line argparser for the following arguments:
-
--modelto select the LLM for inference. Last updated on 06-29-2024, but our codes should be compatible with any more recent model names.- OpenAI ChatGPT family. Check OpenAI's continuous model upgrades.
gpt3.5or equivalentlygpt-3.5-turbo,gpt-3.5-turbo-0125gpt-3.5-turbo-1106gpt-3.5-turbo-0613gpt-4ogpt4or equivalentlygpt-4-turbo,gpt-4-turbo-2024-04-09gpt-4-0125-previewgpt-4-1106-previewgpt-4-0613
- Google Gemini family. Check Gemini model versions and lifecycle. Note that Google currently imposes a relatively low request-per-minute for API usages, so you may encounter related errors when running the inference code.
geminior equivalentlygemini-1.0-pro,gemini-1.0-pro-002gemini-1.0-pro-001gemini-1.5-pro-preview-0409
- Meta Llama family. Check Choosing which model to use Llama-3 and Llama-2.
llamaor equivalentlyllama3-70b,meta-llama-3-70b-instructllama3-8bor equivalentlymeta-llama-3-8b-instructllama-2-70b-chatllama-2-13b-chatllama-2-7b-chat
- Anthropic Claude family. Check Models overview.
claudeor equivalentlyclaude-3-opus-20240229claude-3-sonnet-20240229claude-3-haiku-20240307
- Mistral family. Check API versioning.
mistralor equivalentlymistral-large-latest,mistral-large-2402mistral-medium-latestor equivalentlymistral-medium-2312mistral-small-latestor equivalentlymistral-small-2402open-mixtral-8x22bor equivalentlyopen-mixtral-8x22b-2404open-mixtral-8x7bor equivalentlymistral-small-2312open-mistral-7bor equivalentlymistral-tiny-2312
- OpenAI ChatGPT family. Check OpenAI's continuous model upgrades.
-
--taskto specifydatato generate synthetic datasets orinferenceto evaluate the LLM's ability to answer the questions. -
--verboseto print detailed data information and model responses during the inference. -
[For Data Generation Only]
--fallacyto select the type of logical fallacy. We currently supportlindafor the Linda Problem and its variants andsetsfor the syllogistic problems. -
[For Data Generation Only]
--gen_modeto select the mode of generating synthetic dataset whentaskisdata. Options arebaseline: simple in-context learning with limited instructions,control: step-by-step guidance to generate both gold samples and random samples with irrelevant info. -
[For Data Generation Only]
--variantto select the variant of the Linda problems, such as the defaultoriginal,variant_one,variant_two, ...,variant_six. Detailed information about each variant can be found in thedef linda_problem()function in prompts.py. Include this argument iff--fallacyislinda. -
[For Data Generation Only]
--connto select the logical connecting word, such asbecause,sothat, ortoto generate new data. Add this argument iff--fallacyislindaand--variantisvariant_oneorvariant_two. -
[For Data Generation Only]
--nto set the number of synthetic data problems to generate. -
[For Inference Only]
--data_fileto set the data file path for inference. -
[For Inference Only]
--eval_modeto set the evaluation mode for the model to answer questions. Options arebaselinefor directly promptingzs_cotfor zero-shot chain-of-thought (CoT) promptingosfor one-shot in-context learning (ICL) prompting with the original Linda Problem (default)os_cotfor one-shot ICL plus COT promptingos_bobfor one-shot ICL prompting but with a rephrased Bob Problemos_bob_cotfor one-shot ICL prompting plus COT but with a rephrased Bob Problemos_incorrectfor one-shot ICL but with an incorrect answer and a rephrased Bob Problemos_incorrect_cotfor one-shot ICL plus COT but with an incorrect answer and a rephrased Bob Problemfsfor few-shot ICL promptingfs_cotfor few-shot ICL plus COT promptingweak_control_zs_cotfor weakly controlled zero-shot CoT prompting, leaking the hint that it is a Linda Problem but without detailed instructionsweak_control_os_cotfor weakly controlled one-shot CoT prompting, leaking the hint that it is a Linda Problem but without detailed instructionscontrol_zs_cotfor controlled zero-shot CoT prompting, leaking the hint that it is a Linda Problem with detailed and carefully-curated instructionscontrol_os_cotfor controlled one-shot CoT prompting, leaking the hint that it is a Linda Problem with detailed and carefully-curated instructions
For example, you can run
python main.py --model gpt3.5 --task data --fallacy linda --gen_mode control --variant original --n 100 --verbose
in the command line and adjust model, fallacy, gen_mode, variant, and n accordingly. All the other hyper-parameters can be set at config.yaml.
Generated files will be saved to the data/ directory.
To start the inference
python main.py --model gpt3.5 --task inference --fallacy linda --eval_mode os_cot --data_file synthetic_dataset_linda_original_gold.json --verbose
in the command line and adjust model, eval_mode, and data_file accordingly.
To efficiently run the evaluation with multiple prompting methods, models, and/or data files in parallel, please modify the number of GPU devices available and adjust the codes in run.sh. Then run
bash run.sh
All results and final accuracies will be automatically saved to the outputs/ directory.



