-
Notifications
You must be signed in to change notification settings - Fork 30
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Currently as of v0.14.0, we have few different techniques for grading:
- GSM8k is graded via string processing in its
submit_answertool: https://github.com/Future-House/aviary/blob/v0.14.0/packages/gsm8k/src/aviary/envs/gsm8k/env.py#L123-L146 - HotPotQA is graded via string processing in its
submit_answertool: https://github.com/Future-House/aviary/blob/v0.14.0/packages/hotpotqa/src/aviary/envs/hotpotqa/env.py#L353-L367 paper-qaas of Moved toMultipleChoiceQuestion/MultipleChoiceEvaluationfromaviarypaper-qa#768 is graded insideGradablePaperQAEnvironment.stepvia LLM extraction of MC option then string processing
In summary, we rely on Environment.step or a tool call to invoke a custom grading behavior. This works fine when doing entire rollouts.
However, when trying to do patterns like zero shot evaluation (e.g. no agent/rollout involved, just an LLM prompt then grading), we have no standard interface to use for something like a ZeroShotEvaluator. It would be nice to build something like this, possible:
class Environment(ABC, Generic[TEnvState]):
...
# Reward to use as a placeholder without a reward model
PLACEHOLDER_REWARD: ClassVar[float] = 0.0
async def get_reward(obs: list[Message]) -> float:
"""Compute a reward given the input messages."""
return self.PLACEHOLDER_REWARD
class HotPotQAEnv(Environment[HotPotQAEnvState]):
...
async def get_reward(obs: list[Message]) -> float:
answer = obs[-1].content # Assume answer is in last message
if answer is None:
return self.incorrect_reward
return (
self.correct_reward
if (
await eval_answer(
normalize_answer(answer),
self.normalized_correct_answer,
self.evaluation_mode,
)
)
else self.incorrect_reward
)Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request