Skip to content

Temporal Credit Assignment

Max Mowbray edited this page Jan 23, 2020 · 2 revisions

Offline and Online Temporal Credit Assignment

The project explores the efficacy of different reinforcement learning algorithms for the purpose of continuous, single-input-multiple-output (SIMO) bioprocess optimisation or control, as simplified under discrete time and table look up. All algorithms and studies use the same e-greedy policy of exploration in training. The dynamic system is defined as follows:

Where the state of the system, S is distributed as biomass, X and nitrate, N concentration and nitrate inflow describes the range of controls available to the agent.

The objective function of the agent is a waste minimisation function and therefore, rewards can only be allocated at the end of each run. As a result, all updates are completed offline.

The reward allocated to action at time t was assigned following 4 policies:

(T is the terminal time). Here, the project considers SARSA (1), Q Learning (2) and Monte Carlo Learning (3). The training curves for each are displayed below:

Time Based Allocation

In the case of both Sarsa (1) and Q Learning (2), credit assignment must be done offline after each run. This slows training and so the following results are presented for deterministic environments

1)

2)

3)

Clearly, SARSA does the worst of all three algorithms, but none learn particularly well.

Time and Quality Based Allocation

Again allocation is made offline at the end of each run. Essentially, offline allocation slows the learning process. A deterministic environment is implemented for both SARSA and Q Learning due to unstable learning. A stochastic environment was implemented for Monte Carlo Learning, emphasising the difference in performance between the algorithms.

1)

2)

3)

SARSA and Q Learning suffer because the time and quality based rule causes destabilisation of the state-action value function. Monte Carlo Learning outperforms the other two algorithms because it applies a variable learning rate and updates on the basis of experienced return, helping to stabilise the agent's estimation of the state-action value function. This update equates the expected return of a state-action pair to the mean observed return. As such, temporal credit assignment approaches should avoid discounting with respect to time as this approach does not provide consistent reward allocation on the basis of some objective function i.e. state-action pairs may be experienced in a 'good' run, as well as a 'bad' run. Inclusion of the variational derivative in reward allocation provides stabilisation in training (the variance in reward achieved reduces). Readers should note the similarity here between Monte Carlo Learning and Distributional Q Learning. Monte Carlo Learning in this case essentially learns the mean of a distribution of rewards. It is felt that Distributional Q Learning would enable the agent to learn within a stochastic environment under this credit assignment policy, although of course this is done offline (so not really QL).

Quality Based Allocation

As the reward allocation is no longer dependent upon satisfaction of some objective function, reward may be allocated online in the case of SARSA (1) and Q Learning (2). This should better demonstrate the true utility of the algorithms.

(1)

(2)