Skip to content

Conversation

@zafstojano
Copy link

@zafstojano zafstojano commented Jan 25, 2026

Summary

Hey there, I am one of the core contributors of Reasoning Gym - a suite of 100+ environments with verifiable rewards. I would be really happy to contribute this set of procedural data generators to OpenEnv!

Since these are all single-step environments, I went with the following design philosophy:

  • When the user connects to an environment, calling env.reset(...) creates an environment with the passed arguments:
    env.reset(
        dataset_name='figlet_font',
        dataset_config={"min_word_len": 4, "max_word_len": 6},
        seed=42,
        size=20
    )
  • Since it is a single step environment, right after doing a step, the episode is done=True. This time, simply calling env.reset() with no arguments will yield a new generated sample from the previously instantiated environment.
  • Once the user decides to call env.reset(...) with new dataset configs, it will re-instantiate a new dataset and continue yielding data from there:
    # Create composite dataset - note dataset_specs is list of dicts
    # User constructs the specs according to reasoning_gym API
    dataset_specs = [
        {"name": "leg_counting", "weight": 2, "config": {}},
        {"name": "figlet_font", "weight": 1, "config": {"min_word_len": 4, "max_word_len": 6}},
    ]
    result = env.reset(
        dataset_name='composite',
        dataset_specs=dataset_specs,
        seed=42,
        size=30
    )

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation
  • New environment
  • Refactoring

Alignment Checklist

Before submitting, verify:

  • I have read .claude/docs/PRINCIPLES.md and this PR aligns with our principles
  • I have checked .claude/docs/INVARIANTS.md and no invariants are violated
  • I have run /pre-submit-pr (or bash .claude/hooks/lint.sh and tests) and addressed all issues

RFC Status

  • Not required (bug fix, docs, minor refactoring)
  • RFC exists: #___
  • RFC needed (will create before merge)

Test Plan

After building the Docker image, I have created a small script to test out the calls to the environment

Sample script
from reasoning_gym_env import ReasoningGymAction, ReasoningGymObservation, ReasoningGymEnv


def test_simple_dataset():
    """Test simple dataset with leg_counting questions."""
    print("\n" + "="*70)
    print("TEST 1: Simple Dataset (leg_counting)")
    print("="*70)
    
    env = ReasoningGymEnv(base_url="http://localhost:8000")

    print("\n[1/3] Creating dataset with 10 leg_counting questions...")
    result = env.reset(
        dataset_name='leg_counting',
        seed=42,
        size=10
    )
    print(f"  → First question received: {result.observation.question}")

    print("\n[2/3] Submitting answer '4'...")    
    result = env.step(ReasoningGymAction(answer="4"))
    print(f"  → Score: {result.observation.score}")
    print(f"  → Correct answer: {result.observation.correct_answer}")
    print(f"  → Episode done: {result.done}")

    print("\n[3/3] Getting next question from same dataset...")
    result = env.reset()
    print(f"  → Next question: {result.observation.question}")

    env.close()
    print("\n✓ Simple dataset test complete")


def test_simple_dataset_with_config():
    """Test simple dataset with configuration."""
    print("\n" + "="*70)
    print("TEST 2: Simple Dataset with Config (figlet_font)")
    print("="*70)
    
    env = ReasoningGymEnv(base_url="http://localhost:8000")

    print("\n[1/1] Creating figlet_font dataset with config...")
    print("  → Config: min_word_len=4, max_word_len=6")
    print("  → Dataset size: 20 questions")
    result = env.reset(
        dataset_name='figlet_font',
        dataset_config={"min_word_len": 4, "max_word_len": 6},
        seed=42,
        size=20
    )
    print(f"  → Question received:")
    print(f"     {result.observation.question}")

    env.close()
    print("\n✓ Simple dataset with config test complete")


def test_composite_dataset():
    """Test composite dataset with multiple question types."""
    print("\n" + "="*70)
    print("TEST 3: Composite Dataset (leg_counting + figlet_font)")
    print("="*70)
    
    env = ReasoningGymEnv(base_url="http://localhost:8000")

    # Create composite dataset - note dataset_specs is list of dicts
    # User constructs the specs according to reasoning_gym API
    dataset_specs = [
        {"name": "leg_counting", "weight": 2, "config": {}},
        {"name": "figlet_font", "weight": 1, "config": {"min_word_len": 4, "max_word_len": 6}},
    ]

    print("\n[1/2] Creating composite dataset...")
    print("  → Dataset specs:")
    print(f"     - leg_counting (weight: 2)")
    print(f"     - figlet_font (weight: 1, min_word_len=4, max_word_len=6)")
    print(f"  → Total size: 30 questions")
    result = env.reset(
        dataset_name='composite',
        dataset_specs=dataset_specs,
        seed=42,
        size=30
    )
    print(f"  → Question received: {result.observation.question}...")
    print(f"  → Dataset metadata: {result.observation.dataset_metadata}")

    print("\n[2/2] Submitting answer 'my answer'...")
    result = env.step(ReasoningGymAction(answer="my answer"))
    print(f"  → Score: {result.observation.score}")
    print(f"  → Correct answer: {result.observation.correct_answer}")
    print(f"  → Episode done: {result.done}")

    env.close()
    print("\n✓ Composite dataset test complete")


def test_dataset_persistence():
    """Test dataset persistence across multiple resets."""
    print("\n" + "="*70)
    print("TEST 4: Dataset Persistence & Iterator Looping")
    print("="*70)
    
    env = ReasoningGymEnv(base_url="http://localhost:8000")

    print("\n[Step 1] Creating dataset with 5 questions (seed=42)...")
    result = env.reset(
        dataset_name='leg_counting',
        seed=42,
        size=5
    )
    print(f"  → Question 1: {result.observation.question}")

    print("\n[Step 2] Calling reset() with no params - should get question 2...")
    result = env.reset()  # No params - get question 2 from same dataset
    print(f"  → Question 2: {result.observation.question}")
    
    print("\n[Step 3] Getting question 3...")
    result = env.reset()  # Question 3
    print(f"  → Question 3: {result.observation.question}")
    
    print("\n[Step 4] Getting question 4...")
    result = env.reset()  # Question 4
    print(f"  → Question 4: {result.observation.question}")
    
    print("\n[Step 5] Getting question 5 (last question)...")
    result = env.reset()  # Question 5
    print(f"  → Question 5: {result.observation.question}")
    
    print("\n[Step 6] Getting next question - should loop back to question 1...")
    result = env.reset()  # Question 1 again (iterator loops)
    print(f"  → Question (looped): {result.observation.question}")

    print("\n[Step 7] Creating new dataset with different seed (seed=99, size=10)...")
    print("  → This should rebuild the dataset entirely")
    result = env.reset(
        dataset_name='leg_counting',
        seed=99,
        size=10
    )
    print(f"  → New dataset question: {result.observation.question}")

    env.close()
    print("\n✓ Dataset persistence test complete")


if __name__ == "__main__":
    import traceback
    
    tests = [
        ("Simple Dataset", test_simple_dataset),
        ("Simple Dataset with Config", test_simple_dataset_with_config),
        ("Composite Dataset", test_composite_dataset),
        ("Dataset Persistence", test_dataset_persistence),
    ]
    
    results = []
    
    for test_name, test_func in tests:
        try:
            test_func()
            results.append((test_name, "PASSED", None))
        except Exception as e:
            results.append((test_name, "FAILED", e))
            print(f"\n❌ {test_name} FAILED with error:")
            print(f"   {str(e)}")
            traceback.print_exc()
    
    # Print summary
    print("\n" + "="*70)
    print("TEST SUMMARY")
    print("="*70)
    for test_name, status, error in results:
        icon = "✅" if status == "PASSED" else "❌"
        print(f"{icon} {test_name}: {status}")
        if error:
            print(f"   Error: {str(error)}")
    
    passed = sum(1 for _, status, _ in results if status == "PASSED")
    total = len(results)
    print(f"\nResults: {passed}/{total} tests passed")
    
    if passed == total:
        print("🎉 All tests passed!")
    else:
        print(f"⚠️  {total - passed} test(s) failed")
Script Output
======================================================================
TEST 1: Simple Dataset (leg_counting)
======================================================================

[1/3] Creating dataset with 10 leg_counting questions...
  → First question received: Your task is to count how many legs there are in total when given a list of animals.

Now, how many legs are there in total if you have 3 sea slugs, 12 deers, 2 giraffes, 11 elephants?


[2/3] Submitting answer '4'...
  → Score: 0.0
  → Correct answer: 100
  → Episode done: True

[3/3] Getting next question from same dataset...
  → Next question: Your task is to count how many legs there are in total when given a list of animals.

Now, how many legs are there in total if you have 6 sheeps, 11 dogs, 12 praying mantiss?


✓ Simple dataset test complete

======================================================================
TEST 2: Simple Dataset with Config (figlet_font)
======================================================================

[1/1] Creating figlet_font dataset with config...
  → Config: min_word_len=4, max_word_len=6
  → Dataset size: 20 questions
  → Question received:
     What word does this say?

                                                     
                                                     
##         ####  ######   ######   #######  ######   
##          ##   ##   ##  ##   ##  ##       ##   ##  
##          ##   ######   ######   #####    ######   
##          ##   ##       ##       ##       ## ##    
#######    ####  ##       ##       #######  ##  ##   
                                                     


✓ Simple dataset with config test complete

======================================================================
TEST 3: Composite Dataset (leg_counting + figlet_font)
======================================================================

[1/2] Creating composite dataset...
  → Dataset specs:
     - leg_counting (weight: 2)
     - figlet_font (weight: 1, min_word_len=4, max_word_len=6)
  → Total size: 30 questions
  → Question received: Your task is to count how many legs there are in total when given a list of animals.

Now, how many legs are there in total if you have 6 sheeps, 11 dogs, 12 praying mantiss?
...
  → Dataset metadata: None

[2/2] Submitting answer 'my answer'...
  → Score: 0.0
  → Correct answer: 140
  → Episode done: True

✓ Composite dataset test complete

======================================================================
TEST 4: Dataset Persistence & Iterator Looping
======================================================================

[Step 1] Creating dataset with 5 questions (seed=42)...
  → Question 1: Your task is to count how many legs there are in total when given a list of animals.

Now, how many legs are there in total if you have 3 sea slugs, 12 deers, 2 giraffes, 11 elephants?


[Step 2] Calling reset() with no params - should get question 2...
  → Question 2: Your task is to count how many legs there are in total when given a list of animals.

Now, how many legs are there in total if you have 6 sheeps, 11 dogs, 12 praying mantiss?


[Step 3] Getting question 3...
  → Question 3: Your task is to count how many legs there are in total when given a list of animals.

Now, how many legs are there in total if you have 2 crabs, 10 lobsters, 1 human, 2 cows, 3 bees, 13 elephants, 9 dogs, 12 snakes, 5 shrimps?


[Step 4] Getting question 4...
  → Question 4: Your task is to count how many legs there are in total when given a list of animals.

Now, how many legs are there in total if you have 2 grasshoppers, 8 spiders, 1 tiger, 2 chickens, 5 starfishs, 13 ants, 2 snakes?


[Step 5] Getting question 5 (last question)...
  → Question 5: Your task is to count how many legs there are in total when given a list of animals.

Now, how many legs are there in total if you have 3 wasps, 10 jellyfishs, 9 elephants, 13 crabs?


[Step 6] Getting next question - should loop back to question 1...
  → Question (looped): Your task is to count how many legs there are in total when given a list of animals.

Now, how many legs are there in total if you have 3 sea slugs, 12 deers, 2 giraffes, 11 elephants?


[Step 7] Creating new dataset with different seed (seed=99, size=10)...
  → This should rebuild the dataset entirely
  → New dataset question: Your task is to count how many legs there are in total when given a list of animals.

Now, how many legs are there in total if you have 12 bees, 7 horses, 9 cows, 11 elephants, 12 giraffes, 9 ducks, 2 woodlouses, 10 jellyfishs, 8 spiders?


✓ Dataset persistence test complete

======================================================================
TEST SUMMARY
======================================================================
✅ Simple Dataset: PASSED
✅ Simple Dataset with Config: PASSED
✅ Composite Dataset: PASSED
✅ Dataset Persistence: PASSED

Results: 4/4 tests passed
🎉 All tests passed!

Claude Code Review

Alignment Review Report                                                           
                                                                                  
Automated Checks                                                                  
                                                                                  
- Lint: ✅ PASS - 80 files already formatted                                      
- Debug code: ⚠️ FOUND - Multiple print statements detected (see details below)   
                                                                                  
Open RFCs Context                                                                 
                                                                                  
All RFCs are in "In Review" status:                                               
- RFC 000: Project Phases (In Review)                                             
- RFC 001: Abstractions (In Review)                                               
- RFC 002: Environment Spec (In Review)                                           
- RFC 003: MCP Support (In Review)                                                
                                                                                  
No direct conflicts identified between these changes and active RFCs.             
                                                                                  
Tier 1: Fixes Required                                                            
                                                                                  
Debug Code Analysis:                                                              
The check-debug.sh script flagged many print statements. However, upon analysis of
 the diff, all print statements in the new reasoning_gym_env code are in          
docstrings/examples, not in actual executable code:                               
                                                                                  
- Lines 76, 81-88, 92, 300, 305: Print statements in README examples ✅           
- Lines 483, 487-488, 492: Print statements in docstring examples in client.py ✅ 
                                                                                  
The debug script also found print statements in existing src/openenv files, but   
those are:                                                                        
1. Outside the scope of this PR (pre-existing)                                    
2. Primarily in docstrings and example code                                       
3. One test file (test_local_docker_provider.py) that is marked with a TODO to be 
removed/refactored                                                                
                                                                                  
Conclusion: ✅ No Tier 1 fixes required in this PR's changes.                     
                                                                                  
Tier 2: Alignment Discussion                                                      
                                                                                  
Principle Conflicts                                                               
                                                                                  
None identified                                                                   
                                                                                  
RFC Conflicts                                                                     
                                                                                  
None identified                                                                   
                                                                                  
Architecture Review                                                               
                                                                                  
The Reasoning Gym environment implementation follows OpenEnv patterns correctly:  
                                                                                  
✅ Gymnasium API compliance - Uses standard reset/step/state signatures           
✅ Type safety - Properly typed with generics: Environment[ReasoningGymAction,    
ReasoningGymObservation, State]                                                   
✅ Pydantic models - Action and Observation extend correct base classes           
✅ Client-server separation - No server imports in client code; shared models in  
models.py                                                                         
✅ Rewards in environment - Scoring handled server-side via                       
reasoning_gym.score_answer()                                                      
✅ WebSocket support - Uses EnvClient with WebSocket connections                  
✅ Container isolation - Dockerfile follows standard patterns                     
✅ Concurrent sessions - Declares SUPPORTS_CONCURRENT_SESSIONS = True             
                                                                                  
Summary                                                                           
                                                                                  
- ✅ 0 mechanical issues to fix                                                   
- ✅ 0 alignment points for human review                                          
- ✅ 0 RFC conflicts to discuss                                                   
                                                                                  
Overall Assessment: This PR is well-aligned with OpenEnv principles and           
invariants. The implementation follows established patterns from echo_env and     
other reference environments. All automated checks pass, and no architectural     
concerns were identified. 

zafstojano and others added 6 commits January 25, 2026 16:51
Integrate reasoning_gym library to provide single-step reasoning tasks.
Each episode presents one question from a configurable dataset, the agent
submits an answer, and receives a score (0.0 to 1.0).

Features:
- Single-step episodes: reset() provides question, step() validates answer
- Dataset persistence: Dataset reused across resets until config changes
- Flexible configuration: Supports simple and composite datasets
- Concurrent sessions: Multiple clients can connect simultaneously

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Replace EchoEnv template content with accurate documentation for
Reasoning Gym environment. Update includes:

- Single-step reasoning task workflow
- Dataset configuration (simple and composite)
- Dataset persistence behavior
- Correct action/observation models (answer, score, question)
- Reward structure (score-based, not length-based)
- Use cases for LLM evaluation and agent training

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Show how to access the dataset_metadata field in the Quick Start example,
demonstrating the full observation interface.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Add comprehensive test suite with 26 tests covering environment behavior, models, client, and integration workflows
- Fix imports in server files to support both Docker (direct import) and local testing (relative import)
- Fix minor formatting issue in docstring

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 25, 2026
@zafstojano zafstojano changed the title Feat/reasoning gym env Add the Reasoning Gym set of environments Jan 25, 2026
@greptile-apps
Copy link

greptile-apps bot commented Jan 25, 2026

Greptile Overview

Greptile Summary

Added Reasoning Gym environment integration to OpenEnv, providing 100+ single-step reasoning tasks with verifiable rewards.

Key Implementation Details:

  • Single-step episodes where reset() provides a question and step() validates the answer and returns done=True
  • Dataset persistence across resets - calling reset() without parameters reuses the existing dataset and provides the next question
  • Sequential iteration through questions with automatic wrap-around when the dataset is exhausted
  • Support for both simple datasets (e.g., leg_counting) with optional config parameters and composite datasets that blend multiple task types
  • Rewards computed inside environment using reasoning_gym.score_answer(), following the "rewards inside environment" principle

Architecture Alignment:

  • ✅ Follows Gymnasium API patterns with proper reset() and step() signatures
  • ✅ Type-safe implementation using generics: Environment[ReasoningGymAction, ReasoningGymObservation, State]
  • ✅ Client-server separation maintained (client imports only from models.py, not from server/)
  • ✅ WebSocket support via EnvClient base class
  • ✅ Container isolation with multi-stage Dockerfile
  • ✅ Comprehensive test coverage (18 tests covering reset, step, dataset persistence, error cases, and integration workflows)
  • ✅ Declares SUPPORTS_CONCURRENT_SESSIONS = True for concurrent WebSocket sessions

Design Philosophy:
The environment implements a stateful dataset pattern where the dataset configuration persists until explicitly changed. This allows efficient iteration through questions without recreating the dataset on each reset. When reset() is called with new parameters, the dataset is rebuilt; otherwise, it advances to the next question in the sequence.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk - it's a well-implemented new environment that follows all OpenEnv patterns and principles
  • Score reflects comprehensive adherence to OpenEnv invariants and principles: proper client-server separation, rewards computed inside environment, Gymnasium API compliance, full type safety with generics, WebSocket support, container isolation, and comprehensive test coverage. The implementation follows established patterns from reference environments and includes 18 tests covering normal operations, edge cases, and integration workflows. No bugs, security issues, or alignment violations were found.
  • No files require special attention - all implementations are clean and follow established patterns

Important Files Changed

Filename Overview
envs/reasoning_gym_env/models.py Pydantic models for actions and observations - clean, properly typed, follows OpenEnv patterns
envs/reasoning_gym_env/client.py Client implementation - correctly extends EnvClient with proper serialization methods, no server imports
envs/reasoning_gym_env/server/reasoning_gym_environment.py Core environment logic - implements single-step reasoning tasks with dataset persistence, proper reward computation inside environment
envs/reasoning_gym_env/server/app.py FastAPI app creation - follows standard create_app pattern with proper imports
tests/envs/test_reasoning_gym_environment.py Comprehensive test suite - covers reset, step, dataset persistence, error cases, and integration workflows

Sequence Diagram

sequenceDiagram
    participant Client as ReasoningGymEnv<br/>(Client)
    participant WS as WebSocket<br/>Connection
    participant Server as FastAPI<br/>Server
    participant Env as ReasoningGymEnvironment
    participant RG as reasoning_gym<br/>Library

    Note over Client,RG: Initial Setup & First Episode
    Client->>Server: Connect (WebSocket)
    Server->>Env: Create environment instance
    
    Client->>WS: reset(dataset_name='leg_counting',<br/>seed=42, size=10)
    WS->>Server: Forward reset request
    Server->>Env: reset(...)
    Env->>RG: create_dataset('leg_counting',<br/>seed=42, size=10)
    RG-->>Env: Dataset instance
    Env->>Env: Create iterator from dataset
    Env->>Env: Get next question from iterator
    Env-->>Server: ReasoningGymObservation<br/>(question, done=False)
    Server-->>WS: Serialize observation
    WS-->>Client: StepResult with question
    
    Note over Client,RG: Agent Answers Question
    Client->>WS: step(ReasoningGymAction(answer="4"))
    WS->>Server: Forward step request
    Server->>Env: step(action)
    Env->>RG: score_answer(answer, entry)
    RG-->>Env: score (0.0-1.0)
    Env-->>Server: ReasoningGymObservation<br/>(score, correct_answer, done=True)
    Server-->>WS: Serialize observation
    WS-->>Client: StepResult with score
    
    Note over Client,RG: Next Question (Reuse Dataset)
    Client->>WS: reset() [no params]
    WS->>Server: Forward reset request
    Server->>Env: reset()
    Env->>Env: Reuse existing dataset
    Env->>Env: Get next question from iterator
    Note over Env: If iterator exhausted,<br/>wrap around to start
    Env-->>Server: ReasoningGymObservation<br/>(question, done=False)
    Server-->>WS: Serialize observation
    WS-->>Client: StepResult with question
    
    Note over Client,RG: New Dataset Configuration
    Client->>WS: reset(dataset_name='composite',<br/>dataset_specs=[...], seed=99, size=30)
    WS->>Server: Forward reset request
    Server->>Env: reset(...)
    Env->>RG: create_dataset('composite',<br/>datasets=specs, seed=99, size=30)
    RG-->>Env: New dataset instance
    Env->>Env: Create new iterator
    Env->>Env: Get first question
    Env-->>Server: ReasoningGymObservation<br/>(question, done=False)
    Server-->>WS: Serialize observation
    WS-->>Client: StepResult with question
Loading

@zafstojano
Copy link
Author

zafstojano commented Jan 28, 2026

tagging @burtenshaw @Darktex for visibility :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant