【Need Help】What is  [state, action, reward ] in  NLP Scenario  for  PPO in deepspeed-chat


In PPO algorithm mentioned here [https://arxiv.org/pdf/1707.06347.pdf]
it has (state , action, reward) tuple and  [s1,a1,r1,s2,a2,r2....sn,an,rn] as experience to train
actor model and critic model.PPO elements (state , action, reward) in games are direct as we have known. 

But I am not very clear about what is the (state ,action ,reward) here in NLP Scenario.
For example ,If 
prompt =  P1P2P3...Pn  (a string with n toakens)
generated = W1W2W3..Wn ( a string with n tokens)

does  action indict by a single token one by one like :  
s1 = P1P2P3..Pn  
a1 = W1  
reward1=reward_model_trained_phase2(P1..PnW1) ?
then the experience is  [ P1..Pn,  W1, R1  ,    P1..PnW1,  W2 ,R2,  ...    ]  

or  action is just the full generated output:  
s1 = P1P2P3..Pn  
a1 = W1W2..Wn  
reward1=reward_model_trained_phase2(P1..PnW1..Wn) ?  
but in this way , what is s2 , a2 , r2 and following sn,an,rn in the experience  ?  Do they come from other prompt-generated pairs ?   And what does critic model do in this situation? Because the reward is simply cauculated by the RM model , the actor model is instructed directly from RM model not critic model.

Any simple and vivid example is apprecaited. 




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

【Need Help】What is [state, action, reward ] in NLP Scenario for PPO in deepspeed-chat #552

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

【Need Help】What is [state, action, reward ] in NLP Scenario for PPO in deepspeed-chat #552

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions