Skip to content

Formalizations around grading/reward model #162

@jamesbraza

Description

@jamesbraza

Currently as of v0.14.0, we have few different techniques for grading:

In summary, we rely on Environment.step or a tool call to invoke a custom grading behavior. This works fine when doing entire rollouts.

However, when trying to do patterns like zero shot evaluation (e.g. no agent/rollout involved, just an LLM prompt then grading), we have no standard interface to use for something like a ZeroShotEvaluator. It would be nice to build something like this, possible:

class Environment(ABC, Generic[TEnvState]):
    ...

    # Reward to use as a placeholder without a reward model
    PLACEHOLDER_REWARD: ClassVar[float] = 0.0

    async def get_reward(obs: list[Message]) -> float:
        """Compute a reward given the input messages."""
        return self.PLACEHOLDER_REWARD


class HotPotQAEnv(Environment[HotPotQAEnvState]):
    ...

    async def get_reward(obs: list[Message]) -> float:
        answer = obs[-1].content  # Assume answer is in last message
        if answer is None:
            return self.incorrect_reward
        return (
            self.correct_reward
            if (
                await eval_answer(
                    normalize_answer(answer),
                    self.normalized_correct_answer,
                    self.evaluation_mode,
                )
            )
            else self.incorrect_reward
        )

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions