Open
Description
Hi,
I am running your model on Pong and it doesn't seem like the R2D2 model is converging at all? In contrast, your Ape-X implementation works and starts converging nicely after 2-3 hours.
Here your R2D2 implementation results after training for 32 hours on an 1080 TI with 4 workers:
Note there are various items in your implementation that are different from the papers for both Ape-X and R2D2, such as worker epsilons being below 0.4 and always constant (which has a significant impact on convergence speed) , or the DM R2D2 model taking as additional input the last action and last reward.
Did you manage to get any convergence yourself? If so, how can I replicate it?