Skip to content

Conversation

@Darktex
Copy link
Contributor

@Darktex Darktex commented Jan 28, 2026

Summary

Initial implementation of trajectory-based rubrics for delayed rewards, as specified in RFC 004's "Delayed Rewards" section (see #337).

What's Implemented

New Files

File Description
src/openenv/core/rubrics/__init__.py Package exports
src/openenv/core/rubrics/base.py Rubric base class with nn.Module-like API
src/openenv/core/rubrics/trajectory.py TrajectoryRubric and ExponentialDiscountingTrajectoryRubric
tests/core/test_rubrics/test_base_rubric.py Base class tests (15 tests)
tests/core/test_rubrics/test_trajectory_rubric.py Trajectory rubric tests (23 tests)

Rubric Base Class

  • forward(action, observation) -> float: Abstract method to implement
  • __call__(): Sync evaluation with pre/post hooks
  • Child auto-registration when rubrics assigned as attributes
  • children(), named_children(), rubrics(), named_rubrics(): Iteration
  • get_rubric(path): Access nested rubrics by dot-separated path
  • state_dict() / load_state_dict(): Serialization support
  • last_score: Tracks most recent evaluation result

TrajectoryRubric

  • Accumulates (action, observation) pairs internally
  • Returns intermediate_reward until observation.done=True
  • Abstract score_trajectory(trajectory): Compute final score
  • Abstract compute_step_rewards(): Define credit assignment strategy
  • reset(): Clear trajectory on env.reset()
  • trajectory: Read-only property for current trajectory

ExponentialDiscountingTrajectoryRubric

  • Standard gamma-based discounting: r_t = gamma^(T-1-t) * R_final
  • gamma=1.0: Equal credit to all steps
  • gamma=0.0: Only final step gets reward
  • gamma=0.99: Standard RL discounting (later steps get more)

Current Status

This PR provides the core infrastructure for trajectory-based rubrics. The classes are fully functional and tested, but not yet integrated with environments.

What Works

  • Creating custom trajectory rubrics by subclassing
  • Accumulating trajectories during episodes
  • Computing discounted per-step rewards
  • State serialization/deserialization
  • Hook-based observability
  • 38 tests all passing

What's Missing (Follow-up PRs)

PR Description Dependencies
Environment Integration Update Environment base class to require rubric attribute and call rubric during step() This PR
Container Rubrics Sequential, Gate, WeightedSum, RubricList, RubricDict This PR
LLMJudge Rubric that calls LLM via MCP for evaluation Container Rubrics
Example Migration Add trajectory rubric to connect4_env or openspiel_env Environment Integration

Follow-up Plan

PR 3: Container Rubrics (next)

# New containers for rubric composition
Sequential(*rubrics)      # Fail-fast chain
Gate(rubric, threshold)   # Threshold gating  
WeightedSum(rubrics, weights)  # Weighted combination
RubricList(rubrics)       # Dynamic list container
RubricDict({name: rubric})  # Named rubric dispatch

PR 4: Environment Integration

class Environment(Generic[ActT, ObsT, StateT]):
    rubric: Rubric  # Required - must be set in __init__

    def step(self, action: ActT) -> ObsT:
        # ... execute action ...
        reward = self.rubric(action, observation)
        return observation.with_reward(reward)

PR 5: Example Migration

Migrate an existing game environment to use ExponentialDiscountingTrajectoryRubric:

class Connect4Rubric(ExponentialDiscountingTrajectoryRubric):
    def score_trajectory(self, trajectory):
        _, final_obs = trajectory[-1]
        if final_obs.winner == 'agent':
            return 1.0
        elif final_obs.winner == 'opponent':
            return 0.0
        return 0.5  # Draw

PR 6: LLMJudge (future)

class LLMJudge(Rubric):
    def __init__(self, prompt_template: str, endpoint: str):
        ...
    
    def forward(self, action, observation) -> float:
        # Call LLM via MCP for evaluation
        ...

Test Plan

  • Unit tests for Rubric base class (15 tests)
  • Unit tests for TrajectoryRubric (23 tests)
  • Various gamma values (0, 0.5, 0.99, 1.0)
  • Win/loss/draw outcomes
  • Edge cases (empty trajectory, single step, 100-step episodes)
  • State serialization roundtrip
  • Hook invocation on each step
  • Reset clears trajectory
  • Formatting check passes

Usage Example

from openenv.core.rubrics import ExponentialDiscountingTrajectoryRubric

class ChessRubric(ExponentialDiscountingTrajectoryRubric):
    def score_trajectory(self, trajectory):
        _, final_obs = trajectory[-1]
        outcome = final_obs.metadata.get('winner')
        if outcome == 'agent': return 1.0
        elif outcome == 'opponent': return 0.0
        return 0.5  # Draw

# Usage in environment
rubric = ChessRubric(gamma=0.99)
for action, obs in episode:
    reward = rubric(action, obs)  # 0.0 until done
step_rewards = rubric.compute_step_rewards()  # Discounted rewards
rubric.reset()  # Ready for next episode

Depends on: #337

Initial implementation of trajectory-based rubrics for delayed rewards,
as specified in RFC 004's "Delayed Rewards" section.

New files:
- src/openenv/core/rubrics/base.py: Rubric base class with nn.Module-like API
- src/openenv/core/rubrics/trajectory.py: TrajectoryRubric and ExponentialDiscountingTrajectoryRubric
- src/openenv/core/rubrics/__init__.py: Package exports

Tests (38 passing):
- tests/core/test_rubrics/test_base_rubric.py: Base Rubric class tests
- tests/core/test_rubrics/test_trajectory_rubric.py: Trajectory rubric tests

See PR description for current status and follow-up plan.
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 28, 2026
@greptile-apps
Copy link

greptile-apps bot commented Jan 28, 2026

Greptile Overview

Greptile Summary

This PR implements the core infrastructure for trajectory-based rubrics as specified in RFC 004's "Delayed Rewards" section. The implementation introduces two main classes:

Key Changes:

  • Rubric base class (base.py): Abstract base with nn.Module-inspired API - implements forward(), child auto-registration, pre/post hooks, and state serialization
  • TrajectoryRubric (trajectory.py): Abstract base for delayed reward computation - accumulates (action, observation) pairs internally and computes final score when done=True
  • ExponentialDiscountingTrajectoryRubric: Concrete implementation with standard gamma-based temporal discounting (r_t = gamma^(T-1-t) * R_final)

Design Alignment:

  • Follows RFC 004 specification exactly
  • Rewards remain inside environment boundary (server-side only)
  • No agent exposure - rubrics are internal environment components
  • Not yet integrated with Environment base class (planned for follow-up PR)
  • Memory-conscious: trajectories stored in CPU memory only

Test Coverage:

  • 38 tests total across base and trajectory rubrics
  • Covers edge cases: empty trajectories, single-step episodes, 100-step episodes
  • Tests various gamma values (0.0, 0.5, 0.99, 1.0)
  • Validates hooks, state serialization, reset behavior

Status:
This is pure infrastructure - no breaking changes, no environment integration yet. Follow-up PRs will add container rubrics (Sequential, Gate, WeightedSum) and integrate with Environment base class.

Confidence Score: 5/5

  • This PR is safe to merge - it adds pure infrastructure with no integration or breaking changes
  • Score reflects that this is well-designed infrastructure code that exactly matches RFC 004 specification, has comprehensive test coverage (38 tests), introduces no breaking changes, and adds no environment integration yet. Code quality is high with proper abstractions, error handling, and documentation.
  • No files require special attention - all implementations are clean and well-tested

Important Files Changed

Filename Overview
src/openenv/core/rubrics/base.py Implements nn.Module-like base class with forward(), hooks, child registration, and state serialization - well-structured
src/openenv/core/rubrics/trajectory.py Trajectory accumulation with TrajectoryRubric base and ExponentialDiscountingTrajectoryRubric implementation - matches RFC 004 spec exactly
tests/core/test_rubrics/test_trajectory_rubric.py Extensive tests for trajectory rubrics covering accumulation, discounting, reset behavior, edge cases, and various gamma values (23 tests)

Sequence Diagram

sequenceDiagram
    participant Env as Environment
    participant TR as TrajectoryRubric
    participant Trajectory as Internal Trajectory Buffer
    
    Note over Env,Trajectory: Episode Start
    Env->>TR: reset()
    TR->>Trajectory: Clear buffer []
    
    Note over Env,Trajectory: Step 1 (not done)
    Env->>TR: __call__(action1, obs1)
    TR->>TR: forward(action1, obs1)
    TR->>Trajectory: Append (action1, obs1)
    TR-->>Env: Return intermediate_reward (0.0)
    
    Note over Env,Trajectory: Step 2 (not done)
    Env->>TR: __call__(action2, obs2)
    TR->>TR: forward(action2, obs2)
    TR->>Trajectory: Append (action2, obs2)
    TR-->>Env: Return intermediate_reward (0.0)
    
    Note over Env,Trajectory: Step 3 (done=True)
    Env->>TR: __call__(action3, obs3_done)
    TR->>TR: forward(action3, obs3_done)
    TR->>Trajectory: Append (action3, obs3_done)
    TR->>TR: score_trajectory(trajectory)
    Note right of TR: Subclass implements<br/>scoring logic
    TR-->>Env: Return final_score (e.g., 1.0)
    
    Note over Env,Trajectory: Post-Episode
    Env->>TR: compute_step_rewards()
    TR->>TR: Apply discounting strategy
    Note right of TR: ExponentialDiscounting:<br/>r_t = gamma^(T-1-t) * R_final
    TR-->>Env: [r_0, r_1, r_2]
Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants