Sketchnote by Tomomi Imura
Reinforcement learning get three main things: agent, states, and actions wey dey for each state. If agent do action for one state, e go get reward. Imagine say you dey play Super Mario game. You be Mario, you dey one level for the game, you stand near cliff edge. Coin dey above you. You be Mario, for one level, for one position... na your state be dat. If you waka go right (action), you go fall for cliff and your score go low. But if you press jump button, you go collect coin and you go still dey alive. Dat one na better result and e go give you better score.
With reinforcement learning and simulator (the game), you fit learn how to play the game well to get better reward like to dey alive and collect plenty points.
🎥 Click the image above to hear Dmitry discuss Reinforcement Learning
For this lesson, we go dey use Python code. You suppose fit run the Jupyter Notebook code for this lesson, whether for your computer or for cloud.
You fit open the lesson notebook and follow the lesson to build.
Note: If you dey open this code for cloud, you go need to fetch the
rlboard.pyfile wey dey used for the notebook code. Put am for the same folder as the notebook.
For this lesson, we go look the world of Peter and the Wolf, wey dey inspired by one musical fairy tale by one Russian composer, Sergei Prokofiev. We go use Reinforcement Learning make Peter fit explore him environment, collect sweet apples and avoid the wolf.
Reinforcement Learning (RL) na one learning method wey go help us learn the best way wey one agent go behave for one environment by doing plenty experiments. The agent for this environment suppose get one goal, wey reward function go define.
Make we make Peter world simple, e go be one square board wey get size width x height, like this:
Each cell for this board fit be:
- ground, wey Peter and other creatures fit waka on top.
- water, wey you no fit waka on top.
- tree or grass, place wey you fit rest.
- apple, wey Peter go happy to find to chop.
- wolf, wey dey dangerous and Peter suppose avoid.
One Python module dey, rlboard.py, wey get the code to work with this environment. Because this code no dey important to understand our concepts, we go just import the module and use am to create the sample board (code block 1):
from rlboard import *
width, height = 8,8
m = Board(width,height)
m.randomize(seed=13)
m.plot()This code suppose print one picture of the environment wey resemble the one wey dey above.
For our example, Peter goal na to find apple, avoid wolf and other wahala. To do this, e fit waka around until e find apple.
So, for any position, e fit choose one of these actions: up, down, left, and right.
We go define these actions as dictionary, and match them to pairs of coordinate changes. For example, to move right (R) go mean (1,0). (code block 2):
actions = { "U" : (0,-1), "D" : (0,1), "L" : (-1,0), "R" : (1,0) }
action_idx = { a : i for i,a in enumerate(actions.keys()) }To summarize, the strategy and goal for this scenario na:
-
The strategy, for our agent (Peter) na the policy. Policy na function wey dey return action for any state. For our case, the state na the board, including the current position of the player.
-
The goal, for reinforcement learning na to learn better policy wey go help us solve the problem well. But first, make we try the simplest policy wey we dey call random walk.
Make we first solve our problem by using random walk strategy. For random walk, we go dey choose the next action randomly from the allowed actions, until we reach the apple (code block 3).
-
Implement random walk with the code below:
def random_policy(m): return random.choice(list(actions)) def walk(m,policy,start_position=None): n = 0 # number of steps # set initial position if start_position: m.human = start_position else: m.random_start() while True: if m.at() == Board.Cell.apple: return n # success! if m.at() in [Board.Cell.wolf, Board.Cell.water]: return -1 # eaten by wolf or drowned while True: a = actions[policy(m)] new_pos = m.move_pos(m.human,a) if m.is_valid(new_pos) and m.at(new_pos)!=Board.Cell.water: m.move(a) # do the actual move break n+=1 walk(m,random_policy)
The call to
walksuppose return the length of the path, wey fit change from one run to another. -
Run the walk experiment many times (like 100), and print the results (code block 4):
def print_statistics(policy): s,w,n = 0,0,0 for _ in range(100): z = walk(m,policy) if z<0: w+=1 else: s += z n += 1 print(f"Average path length = {s/n}, eaten by wolf: {w} times") print_statistics(random_policy)
You go notice say the average length of the path na around 30-40 steps, wey plenty pass the average distance to the nearest apple wey be around 5-6 steps.
You fit also see how Peter dey move during random walk:
To make our policy smarter, we need to sabi which moves dey "better" pass others. To do this, we need to define our goal.
The goal fit dey defined with reward function, wey go return score for each state. The higher the number, the better the reward function. (code block 5)
move_reward = -0.1
goal_reward = 10
end_reward = -10
def reward(m,pos=None):
pos = pos or m.human
if not m.is_valid(pos):
return end_reward
x = m.at(pos)
if x==Board.Cell.water or x == Board.Cell.wolf:
return end_reward
if x==Board.Cell.apple:
return goal_reward
return move_rewardOne interesting thing about reward functions be say most times, we dey only get better reward for the end of the game. This mean say our algorithm suppose remember "better" steps wey lead to good reward for the end, and make them more important. Same way, all moves wey lead to bad results suppose dey discouraged.
The algorithm wey we go talk about na Q-Learning. For this algorithm, the policy dey defined by one function (or data structure) wey we dey call Q-Table. E dey record how "good" each action dey for one state.
E dey called Q-Table because e dey easy to represent am as table, or multi-dimensional array. Since our board get size width x height, we fit represent the Q-Table with numpy array wey get shape width x height x len(actions): (code block 6)
Q = np.ones((width,height,len(actions)),dtype=np.float)*1.0/len(actions)Notice say we dey initialize all the values for the Q-Table with equal value, for our case - 0.25. This one mean "random walk" policy, because all moves for each state dey equally good. We fit pass the Q-Table to the plot function to visualize the table on the board: m.plot(Q).
For the center of each cell, arrow dey show the preferred direction of movement. Since all directions dey equal, dot go show.
Now we need to run simulation, explore our environment, and learn better Q-Table values wey go help us find the path to the apple faster.
Once we start to move, each action go get reward, i.e. we fit choose the next action based on the highest immediate reward. But for most states, the move no go reach our goal to find apple, so we no fit decide immediately which direction better.
Remember say no be the immediate result wey matter, but the final result wey we go get for the end of the simulation.
To handle this delayed reward, we need to use dynamic programming principles wey go help us think about the problem recursively.
Suppose we dey state s, and we wan move to next state s'. If we do am, we go get immediate reward r(s,a), wey reward function define, plus future reward. If we assume say our Q-Table dey show the "goodness" of each action well, then for state s' we go choose action a wey get maximum value of Q(s',a'). So, the best future reward we fit get for state s go be maxa'Q(s',a') (maximum dey calculated for all possible actions a' for state s').
This give us Bellman formula to calculate Q-Table value for state s, given action a:
Here γ na discount factor wey dey decide how you go prefer current reward over future reward.
With the equation above, we fit write pseudo-code for our learning algorithm:
- Initialize Q-Table Q with equal numbers for all states and actions
- Set learning rate α ← 1
- Repeat simulation many times
- Start for random position
- Repeat
- Choose action a for state s
- Do action by moving to new state s'
- If we meet end-of-game condition, or total reward too small - stop simulation
- Calculate reward r for new state
- Update Q-Function with Bellman equation: Q(s,a) ← (1-α)Q(s,a)+α(r+γ maxa'Q(s',a'))
- s ← s'
- Update total reward and reduce α.
For the algorithm above, we no talk how we go choose action for step 2.1. If we dey choose action randomly, we go explore the environment randomly, and we fit die often or go places we no suppose go. Another way na to exploit Q-Table values wey we don sabi, and choose the best action (with higher Q-Table value) for state s. But this one go stop us from exploring other states, and we fit no find the best solution.
So, the best way na to balance exploration and exploitation. We fit do am by choosing action for state s with probabilities wey match Q-Table values. For the beginning, when Q-Table values dey equal, e go mean random selection, but as we learn more about the environment, we go dey follow the best route while still allowing the agent to choose unexplored path sometimes.
We don ready to implement the learning algorithm. Before we do am, we need one function wey go change Q-Table numbers to vector of probabilities for actions.
-
Create function
probs():def probs(v,eps=1e-4): v = v-v.min()+eps v = v/v.sum() return v
We dey add small
epsto the original vector to avoid division by 0 for the beginning, when all vector components dey equal.
Run the learning algorithm for 5000 experiments, wey we dey call epochs: (code block 8)
for epoch in range(5000):
# Pick initial point
m.random_start()
# Start travelling
n=0
cum_reward = 0
while True:
x,y = m.human
v = probs(Q[x,y])
a = random.choices(list(actions),weights=v)[0]
dpos = actions[a]
m.move(dpos,check_correctness=False) # we allow player to move outside the board, which terminates episode
r = reward(m)
cum_reward += r
if r==end_reward or cum_reward < -1000:
lpath.append(n)
break
alpha = np.exp(-n / 10e5)
gamma = 0.5
ai = action_idx[a]
Q[x,y,ai] = (1 - alpha) * Q[x,y,ai] + alpha * (r + gamma * Q[x+dpos[0], y+dpos[1]].max())
n+=1After we run this algorithm, the Q-Table go update with values wey show how good different actions dey for each step. We fit try visualize the Q-Table by drawing vector for each cell wey go point the direction of movement. To make am simple, we go draw small circle instead of arrow head.
Since Q-Table dey show how "good" each action dey for each state, e dey easy to use am to navigate our world well. For the simplest case, we fit choose action wey get the highest Q-Table value: (code block 9)
def qpolicy_strict(m):
x,y = m.human
v = probs(Q[x,y])
a = list(actions)[np.argmax(v)]
return a
walk(m,qpolicy_strict)If you try di code wey dey up many times, you fit notice say sometimes e go "hang", and you go need press di STOP button for di notebook to stop am. Dis one dey happen because e fit get situations wey two states go dey "point" to each oda based on di best Q-Value, and di agent go dey waka between those states forever.
Task 1: Change di
walkfunction make e get limit for di maximum length of path wey e fit waka, like 100 steps, and check how di code wey dey up go dey return dis value sometimes.
Task 2: Change di
walkfunction make e no dey go back to places wey e don already pass before. Dis one go stopwalkfrom dey loop, but di agent fit still dey "trap" for one place wey e no fit comot.
Better navigation policy na di one wey we use during training, wey join exploitation and exploration. For dis policy, we go dey choose each action with one kind probability, wey dey proportional to di values for di Q-Table. Dis strategy fit still make di agent go back to position wey e don already explore before, but as you fit see from di code wey dey down, e dey result to very short average path to di place wey we dey find (remember say print_statistics dey run di simulation 100 times): (code block 10)
def qpolicy(m):
x,y = m.human
v = probs(Q[x,y])
a = random.choices(list(actions),weights=v)[0]
return a
print_statistics(qpolicy)After you run dis code, you suppose see say di average path length go reduce well well, e go dey between 3-6.
As we don talk before, di learning process na balance between exploration and di knowledge wey we don gather about di problem space. We don see say di result of di learning (di ability to help agent find short path go di goal) don improve, but e go still make sense to check how di average path length dey behave during di learning process:
Di learnings fit summarize like dis:
-
Average path length dey increase. Wetin we dey see here be say for di beginning, di average path length dey increase. Dis one fit be because when we no sabi anything about di environment, we go likely dey trap for bad states, like water or wolf. As we dey learn more and dey use di knowledge, we fit explore di environment longer, but we still no sabi where di apples dey well.
-
Path length dey reduce as we dey learn more. Once we don sabi enough, e go dey easier for di agent to reach di goal, and di path length go start to reduce. But we still dey open to exploration, so we dey sometimes waka comot from di best path, dey try new options, wey go make di path longer pass di best one.
-
Length go increase suddenly. Wetin we dey see for dis graph be say at one point, di length go increase suddenly. Dis one dey show say di process dey stochastic, and sometimes we fit "spoil" di Q-Table coefficients by dey overwrite dem with new values. To reduce dis one, we suppose dey reduce di learning rate (like towards di end of training, we go dey adjust Q-Table values small small).
Overall, e dey important to remember say di success and quality of di learning process dey depend well well on parameters, like learning rate, learning rate decay, and discount factor. Dem dey call dem hyperparameters, to separate dem from parameters, wey we dey optimize during training (like Q-Table coefficients). Di process to find di best hyperparameter values na hyperparameter optimization, and e deserve separate topic.
Disclaimer:
Dis docu wey you dey see don use AI translation service Co-op Translator take translate am. Even though we dey try make sure say e correct, abeg no forget say machine translation fit get mistake or no too accurate. Di original docu for di language wey dem first write am na di main correct one. If na important matter, e go better make professional human translator check am. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because of dis translation.







