Skip to content

【Need Help】What is [state, action, reward ] in NLP Scenario for PPO in deepspeed-chat #552

Open
@valkryhx

Description

@valkryhx

In PPO algorithm mentioned here [https://arxiv.org/pdf/1707.06347.pdf]
it has (state , action, reward) tuple and [s1,a1,r1,s2,a2,r2....sn,an,rn] as experience to train
actor model and critic model.PPO elements (state , action, reward) in games are direct as we have known.

But I am not very clear about what is the (state ,action ,reward) here in NLP Scenario.
For example ,If
prompt = P1P2P3...Pn (a string with n toakens)
generated = W1W2W3..Wn ( a string with n tokens)

does action indict by a single token one by one like :
s1 = P1P2P3..Pn
a1 = W1
reward1=reward_model_trained_phase2(P1..PnW1) ?
then the experience is [ P1..Pn, W1, R1 , P1..PnW1, W2 ,R2, ... ]

or action is just the full generated output:
s1 = P1P2P3..Pn
a1 = W1W2..Wn
reward1=reward_model_trained_phase2(P1..PnW1..Wn) ?
but in this way , what is s2 , a2 , r2 and following sn,an,rn in the experience ? Do they come from other prompt-generated pairs ? And what does critic model do in this situation? Because the reward is simply cauculated by the RM model , the actor model is instructed directly from RM model not critic model.

Any simple and vivid example is apprecaited.

Metadata

Metadata

Assignees

No one assigned

    Labels

    deespeed chatDeepSpeed ChatmodelingRelated to modeling questions.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions