Skip to content

Commit c968c8f

Browse files
authored
Update README.md
1 parent d5a551e commit c968c8f

File tree

1 file changed

+40
-5
lines changed

1 file changed

+40
-5
lines changed

README.md

Lines changed: 40 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,50 @@ This project tests Local Large Language Models (LLMs) using **syllogisms** to de
77
**You can test yourself by clicking [here](https://longocris.github.io/Belief-Bias-Questionnaire/) or you can check out the [quiz repository](https://github.com/LongoCris/Belief-Bias-Questionnaire)**
88

99
In particular, the following LLMs are tested:
10-
1. [llama3.2:1b](https://ollama.com/library/llama3.2)
11-
2. [Mistral](https://ollama.com/library/mistral)
12-
3. [qwen3:8b](https://ollama.com/library/qwen3)
10+
1. [llama3.2:1b](https://ollama.com/library/llama3.2): a lightweight baseline model from LLaMa 3.2 family
11+
2. [Mistral](https://ollama.com/library/mistral): a 7b model designed for long-context reasoning
12+
3. [qwen3:8b](https://ollama.com/library/qwen3): a very recent model optimized for instructions and logical tasks
1313

14-
The experiments involve the test of each LLM (configured with **temperature** equal to 0 and to 0.7) to see the distribution of accuracy in **conflictual** and **non-conflictual** items. In particular, the conflictual items are defined as the valid-inbelievable and the invalid-believable items, while the non-conflictual ones as the valid-believable and invalid-unbelievable ones. If the errors display correlation between the conflictuality and non-conflictuality of the item, then there is a signal of BB.
14+
The experiments involve the test of each LLM (configured with **temperature** equal to 0 and to 0.7) to see the distribution of accuracy in **conflictual** and **non-conflictual** items.
15+
16+
In particular, the conflictual items are defined as the valid-inbelievable and the invalid-believable items, while the non-conflictual ones as the valid-believable and invalid-unbelievable ones. If the errors display correlation between the conflictuality and non-conflictuality of the items, then there is a signal of BB.
1517

1618
![image](https://github.com/user-attachments/assets/f95062ba-f9fc-4a5d-a186-6f7269ca605f)
1719

18-
For a comparison, the test is also experimented on a set of humans.
20+
For a comparison, the test is also experimented on a set of humans, as shown in the presentation and in the report.
21+
22+
### Belief Bias Questionnaire
23+
24+
The questionnaire consists of **16 items** balanced across the four categories:
25+
* Valid-Believable (VB): 4 items
26+
* Invalid-Believable (IB): 4 items
27+
* Valid-Unbelievable (VU): 4 items
28+
* Invalid-Unbelievable (IU): 4 items
29+
30+
In each category, three items are of **regular** difficulty (being composed of two premises and one conclusion), while one item is **hard** (being composed of three premises and the conclusion). This, for the set of humans, is done to avoid adaptation to the difficulty of the item.
31+
32+
For the humans, one neutral item was added to the questionnaire to stimulate **attention**. Each item required valid or invalid responses within **20 seconds**, to encourage intuitive processing. One item included open-ended explanation to examine the reasoning behind the decision and to compare it to LLMs.
33+
34+
### Results
35+
36+
| Model | Temperature | Average Accuracy | Belief Bias Signal |
37+
|---------------|-------------|------------------|--------------------|
38+
| Llama3.2:1b | 0.0 | Low | Present |
39+
| Llama3.2:1b | 0.7 | Low–Medium | Present |
40+
| Mistral:7b | 0.0 | Medium | Moderate |
41+
| Mistral:7b | 0.7 | Medium | Moderate |
42+
| **Qwen3:8b** | **0.0** | **High** | **Absent** |
43+
| Qwen3:8b | 0.7 | High | Low |
44+
45+
The precise distribution of the errors can be found in the presentation.
46+
47+
As for the control group of humans (balanced gender, minimum B2 english and Bachelor's degree holders), the results showed expression of BB, especially due to the time pressure.
48+
49+
### Conclusions
50+
51+
BB is not unique to humans. Some LLMs mimic it under certain configurations. Model complexity and architecture (not temperature alone) determine the expression of BB.
52+
53+
For deeper insights, see our report.
1954

2055
## Setup Instructions
2156

0 commit comments

Comments
 (0)