This repository contains the class of an agent whom is trained in a simplified Blackjack environment. This environment is described by the dealing class. The training of the agent is conducted using three different policy iteration approaches:
- Q-Learning (QL)
- State action reward state action (SARSA)
- Tempoal Difference (TD)
The results of training for 10000 games, with a split of 5000 games of exploration and 5000 of exploitation, over each policy iteration method is presented below.
Averaging over each policy search method and multiples of the standard deck trained upon, the agent optimised their decision process to achieve a win, draw and loss rate of 48.37, 33.21 and 20.91 percent respectively in terms of dealer-agent interactions.
The game play is simplified in the sense that the dealer will not attempt to draw any more cards after
the initial two at the start of each round. The agent is then trained to achieved the highest in round score, , is determined as
.
If the agent's collective hand value,
,
where the 's represent the numerical value of the agent's in hand cards is greater than the dealers. Otherwise, the in round score was set to
.
The dynamics of the game play are as follows: Two cards a dealt from a pre-specified multiple of the standard deck of 52 cards to the agent and passive dealer. Then the agent is able to choose from the action space
.
The agent is encouraged to maximise the over game total score defined as
,
where S is the total number of times the agent submitted to stick.
The figure below shows the agent's average achieved over game quadratic score, , for the different deck multiples and policy iteration methods.

agent.py:
This script contains a class that describes agent's decision process, i.e. how it chooses it's
action and the updating of the Q values for a given policy search method.
dealing.py:
This script contain the class of the Blackjack game play. The actions taken by the agent
interact with an instance of this class in a self-perpetuating manner; you only need to call
the newRound method to begin the game play. All other aspects of the game play; score, ace count etc,
will be determined from the agent's actions and the class attributes update as a consequence.
training.py:
This script contains a function that trains the agent for a given deck size, method, exploration and exploitation
period. A count of wins, losses and draws for the exploration and exploitation period is kept and returned.
trainingWithAllMethods.py:
This script trains the agent for ten different regimes, 1-10 decks, for each policy search method. The Q-table for these are then saved as csv files.
validating.py:
This script allows one to reduce the parameter space and request the agent plays the dealer given the
optimal policy found from the corresponding QTables. Please make sure that the compressed QTables file
is extracted and placed in the working directory when trying to run this script.
QTables.zip:
This compressed file contains the QTables for the various regimes the agent was trained on and can be downloaded here . Please unzip this
file and place it in the working directory when attempted to run the validating.py script.
The next task is to separate the dealer from the dealing class into its own class. This will allow implementing dealer actions. Furthermore, there is a clear necessity to find ways to reduce the Q Tables. For a formal review of this project please find my ResearchGate account.
