Implement TrajectoryRubric and ExponentialDiscountingTrajectoryRubric #338
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Initial implementation of trajectory-based rubrics for delayed rewards, as specified in RFC 004's "Delayed Rewards" section (see #337).
What's Implemented
New Files
src/openenv/core/rubrics/__init__.pysrc/openenv/core/rubrics/base.pyRubricbase class with nn.Module-like APIsrc/openenv/core/rubrics/trajectory.pyTrajectoryRubricandExponentialDiscountingTrajectoryRubrictests/core/test_rubrics/test_base_rubric.pytests/core/test_rubrics/test_trajectory_rubric.pyRubric Base Class
forward(action, observation) -> float: Abstract method to implement__call__(): Sync evaluation with pre/post hookschildren(),named_children(),rubrics(),named_rubrics(): Iterationget_rubric(path): Access nested rubrics by dot-separated pathstate_dict()/load_state_dict(): Serialization supportlast_score: Tracks most recent evaluation resultTrajectoryRubric
(action, observation)pairs internallyintermediate_rewarduntilobservation.done=Truescore_trajectory(trajectory): Compute final scorecompute_step_rewards(): Define credit assignment strategyreset(): Clear trajectory on env.reset()trajectory: Read-only property for current trajectoryExponentialDiscountingTrajectoryRubric
r_t = gamma^(T-1-t) * R_finalgamma=1.0: Equal credit to all stepsgamma=0.0: Only final step gets rewardgamma=0.99: Standard RL discounting (later steps get more)Current Status
This PR provides the core infrastructure for trajectory-based rubrics. The classes are fully functional and tested, but not yet integrated with environments.
What Works
What's Missing (Follow-up PRs)
Environmentbase class to requirerubricattribute and call rubric duringstep()Sequential,Gate,WeightedSum,RubricList,RubricDictconnect4_envoropenspiel_envFollow-up Plan
PR 3: Container Rubrics (next)
PR 4: Environment Integration
PR 5: Example Migration
Migrate an existing game environment to use
ExponentialDiscountingTrajectoryRubric:PR 6: LLMJudge (future)
Test Plan
Rubricbase class (15 tests)TrajectoryRubric(23 tests)Usage Example
Depends on: #337