Skip to content

casperlchristensen/agentic_prisoners_dilemma

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prisoner's dilemma

Inspired by the classic Prisoner's dilemma game, this pits 2 (or more) agents against one-another in an iterated variant. The agents are both instructed to favour winning, and will exchange messages back-and-forth before choosing whether to defect or not.

Setup

There are two types of actors:

  1. The LLM agents playing the game
  2. Secondary LLMs that will try to predict whether a player is going to defect or not.

The secondary LLMs have access to the messages from the current round and all previous round messages and decisions.

In this PoC, both types are o3-mini-2025-01-31.

The game will play out as N rounds of:

  1. The agents send each other a message
  2. A classifier will try to predict the player actions
  3. The agents pick their action (COOPERATE or DEFECT)
  4. The agents reason about the round and plan their next steps

Environment

This project is managed with https://docs.astral.sh/uv/.

To get started, run: uv sync.

Command

Remember to set OPENAI_API_KEY (depending on model choices).

inspect eval prisoners_dilemma.py@play_game

Eval

After every game, a log file will be created that records:

  1. Win/draw/loss statistics per player for each round
  2. Classifier metrics (confusion matrix)

Examples

See the /runs/ folder for some concrete examples. Note that you will need Inspect to view them properly.

On scheming

The term scheming is used rather loosely here. While the agents are not actually scheming in e.g. https://arxiv.org/pdf/2412.04984, the behaviour of telling a participant that you are going to cooperate and then choose the DEFECT action mirrors scheming behaviour. It does show that models will generally send pro-social messages (i.e. encourage cooperation and mutual trust), but still very much try to win the game when instructed to.

On strategy

Since the number of rounds is known ahead of time, the best strategy is to ALWAYS defect. The agents sometimes figure that out during play, but often start out cooperating regardless. Seems like there is some propensity for models to be cooperative; probably stemming mainly from the RLHF where the models are tuned to be more helpful.

Example metrics

Classification metrics for the auxiliary model predicting whether a participant would defect or not.

Corresponds to run draw_02.eval.

Individual Participant Metrics

Participant Correct Accuracy True Positives False Positives True Negatives False Negatives Precision Recall F1 Score
Participant 0 9 0.9 8 1 1 0 0.88889 1 0.94118
Participant 1 9 0.9 9 1 0 0 0.9 1 0.94737

Overall Metrics

Metric Value
Total Accuracy 0.9
True Positives 17
False Positives 2
True Negatives 1
False Negatives 0
Precision 0.89474
Recall 1
F1 Score 0.94444

About

A PoC that tests the behaviour of LLM agents when playing prisoner's dilemma. The agents get to send messages back and forth before picking whether to cooperate or defect. Additionally uses auxiliary models as scheming detectors.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages