Formalizations around grading/reward model

Currently as of `v0.14.0`, we have few different techniques for grading:

- GSM8k is graded via string processing in its `submit_answer` tool: https://github.com/Future-House/aviary/blob/v0.14.0/packages/gsm8k/src/aviary/envs/gsm8k/env.py#L123-L146
- HotPotQA is graded via string processing in its `submit_answer` tool: https://github.com/Future-House/aviary/blob/v0.14.0/packages/hotpotqa/src/aviary/envs/hotpotqa/env.py#L353-L367
- `paper-qa` as of https://github.com/Future-House/paper-qa/pull/768 is graded inside `GradablePaperQAEnvironment.step` via LLM extraction of MC option then string processing

In summary, we rely on `Environment.step` or a tool call to invoke a custom grading behavior. This works fine when doing entire rollouts.

However, when trying to do patterns like zero shot evaluation (e.g. no agent/rollout involved, just an LLM prompt then grading), we have no standard interface to use for something like a `ZeroShotEvaluator`. It would be nice to build something like this, possible:

```python
class Environment(ABC, Generic[TEnvState]):
    ...

    # Reward to use as a placeholder without a reward model
    PLACEHOLDER_REWARD: ClassVar[float] = 0.0

    async def get_reward(obs: list[Message]) -> float:
        """Compute a reward given the input messages."""
        return self.PLACEHOLDER_REWARD


class HotPotQAEnv(Environment[HotPotQAEnvState]):
    ...

    async def get_reward(obs: list[Message]) -> float:
        answer = obs[-1].content  # Assume answer is in last message
        if answer is None:
            return self.incorrect_reward
        return (
            self.correct_reward
            if (
                await eval_answer(
                    normalize_answer(answer),
                    self.normalized_correct_answer,
                    self.evaluation_mode,
                )
            )
            else self.incorrect_reward
        )
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Formalizations around grading/reward model #162

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Formalizations around grading/reward model #162

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions