Description
In PPO algorithm mentioned here [https://arxiv.org/pdf/1707.06347.pdf]
it has (state , action, reward) tuple and [s1,a1,r1,s2,a2,r2....sn,an,rn] as experience to train
actor model and critic model.PPO elements (state , action, reward) in games are direct as we have known.
But I am not very clear about what is the (state ,action ,reward) here in NLP Scenario.
For example ,If
prompt = P1P2P3...Pn (a string with n toakens)
generated = W1W2W3..Wn ( a string with n tokens)
does action indict by a single token one by one like :
s1 = P1P2P3..Pn
a1 = W1
reward1=reward_model_trained_phase2(P1..PnW1) ?
then the experience is [ P1..Pn, W1, R1 , P1..PnW1, W2 ,R2, ... ]
or action is just the full generated output:
s1 = P1P2P3..Pn
a1 = W1W2..Wn
reward1=reward_model_trained_phase2(P1..PnW1..Wn) ?
but in this way , what is s2 , a2 , r2 and following sn,an,rn in the experience ? Do they come from other prompt-generated pairs ? And what does critic model do in this situation? Because the reward is simply cauculated by the RM model , the actor model is instructed directly from RM model not critic model.
Any simple and vivid example is apprecaited.