Skip to content

Offline GRPO #10

@zihaolucky

Description

@zihaolucky

Hello, I'm interested in the implementation of offline GRPO, does it work well by simply use the static prompt/completions/rewards? The formula of GRPO doesn't change

        # For batch_size=1, unpack the single feature
        prompt = features["prompt"][0]
        completions_list = features["completions"][0]
        rewards_list = features["rewards"][0]

        reward_mean = np.mean(rewards_list)
        reward_std = np.std(rewards_list)

        tokenized_examples = defaultdict(list)

        idx = indices[0]  # Since batch_size=1, indices is a single-element list
        for completion, reward in zip(completions_list, rewards_list):
            batch = self._tokenize_single(prompt, completion)

            # Append each tokenized example to the batch
            for key in tokenized_examples:
                tokenized_examples[key].append(batch[key])

            tokenized_examples["group_id"].append(idx)
            tokenized_examples["group_size"].append(len(completions_list))

            advantage = (reward - reward_mean) / (reward_std + 1e-4)
            tokenized_examples["advantage"].append(advantage)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions