Skip to content

Update openenv examples to use environment_factory#5235

Open
sergiopaniego wants to merge 12 commits intomainfrom
update-openenv-examples
Open

Update openenv examples to use environment_factory#5235
sergiopaniego wants to merge 12 commits intomainfrom
update-openenv-examples

Conversation

@sergiopaniego
Copy link
Member

@sergiopaniego sergiopaniego commented Mar 6, 2026

What does this PR do?

TODO:

  • Migrate notebooks
  • Update TRL-OpenEnv guide
  • Add multi-env example

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?


Note

Medium Risk
Changes are documentation- and example-heavy but substantially rewrite the OpenEnv guide and notebooks; risk is mainly broken instructions or non-working sample code for users.

Overview
Updates the OpenEnv documentation and examples to use GRPOTrainer(environment_factory=...) instead of custom rollout_func patterns, including new guidance on environment class/tool method requirements and reward access via the environments parameter.

Reorganizes the Examples Overview and notebooks index to group OpenEnv notebooks/scripts, and expands the OpenEnv guide with revised install/run instructions plus new sections on multi-environment training and server concurrency requirements.

Heavily refactors openenv_sudoku_grpo.ipynb to follow the environment_factory workflow (new SudokuEnv wrapper, updated dataset/prompts/config, and simplified reward functions), reducing notebook size and removing the manual rollout implementation.

Written by Cursor Bugbot for commit cadf6a6. This will update automatically on new commits. Configure here.

sergiopaniego and others added 10 commits March 4, 2026 10:12
- catch.py: Format observations as readable text, normalize reward to 0-1,
  handle incomplete episodes
- echo.py: Rename step->echo and MyEchoEnv->EchoToolEnv, wrap in main()
- wordle.py: Normalize reward to 0-1, add RichProgressCallback
- sudoku.py: Fix cumulative message handling (diff-based), add board
  validation for move validity, add progress/hints/tried-moves to responses,
  add LoRA support, tune defaults for memory efficiency
- vllm_generation.py: Add </tool_call> stop token for tool calling loop
- grpo_trainer.py: Skip tool calls for environments that are done

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sergiopaniego
Copy link
Member Author

PR: Update OpenEnv examples and docs to environment_factory generated by Claude

Summary

  • Migrated all OpenEnv example scripts and notebooks from rollout_func to environment_factory: sudoku, catch, browsergym_llm (scripts) and wordle, sudoku (notebooks)
  • New multi_env.py example script demonstrating multi-environment GRPO training (Wordle + Catch in the same run) with per-environment reward functions returning None for non-applicable environments
  • Rewrote docs/source/openenv.md entirely around environment_factory: environment class pattern, tool discovery via introspection, Echo and Wordle tutorials, multi-environment section, server concurrency configuration, and a migration guide from rollout_func
  • Reorganized example_overview.md and examples/notebooks/README.md with dedicated OpenEnv subsections
  • Updated all OpenEnv documentation URLs to current paths (from_hub removed, .html extensions)

Trainer fixes (grpo_trainer.py)

  • Fixed self.llmself.vllm_generation.llm for colocate mode in the tool-calling loop
  • Added server mode branch for max_model_len (was raising NotImplementedError)
  • Added done-check (_done/done) to stop calling tools for finished environments
  • Added size alignment between tool_mask/logprobs/completion_ids after the tool loop (re-tokenization can cause off-by-one mismatches)
  • Added truncation to max_completion_length to prevent size mismatch errors in long multi-turn episodes
  • Fixed nan metrics crashing JSON serialization in loggers (convert to None). This happens in multi-environment training where per-environment reward functions return None for non-applicable samples, resulting in nan averages

vLLM fix (vllm_generation.py)

  • Added tools parameter to generate() so colocate mode stops generation at </tool_call>, allowing the tool-calling loop to execute tools between turns
  • Passed self.tools from the trainer to enable this

Not migrated

  • browsergym.py: requires multimodal tool responses (screenshots) not yet supported in _tool_call_loop. To migrate, _tool_call_loop needs to: (1) allow tool methods to return structured content ([{"type": "image", ...}]), (2) detect multimodal returns and build proper content blocks instead of str(), and (3) run image processing (pixel_values, image_grid_thw) per tool-calling iteration, not just once at generation start
  • nemo_gym: fundamentally incompatible, external agent servers own generation
  • browsergym notebook: blocked by Playwright/greenlet server issue and model compatibility (FunctionGemma, not Qwen3). To migrate: (1) upgrade openenv-core to async v0.2.2+ so Playwright runs in a dedicated thread pool instead of conflicting with FastAPI's event loop, and (2) investigate Qwen3 compatibility or adapt the notebook to work with FunctionGemma's tool-calling format

Not in this PR

  • Upgrade to openenv-core async v0.2.2+ (server + client .sync())
  • Update HF Space URLs to official openenv org (some scripts still point to personal spaces)

Required HF Space configuration

environment_factory creates N concurrent environment instances (one per generation), each opening a WebSocket to the server. By default, OpenEnv servers only allow 1 concurrent session, so the Spaces need to be configured for concurrency:

  1. In server/environment.py, declare concurrent session support:
SUPPORTS_CONCURRENT_SESSIONS: bool = True
  1. In server/app.py, set the concurrency limit:
app = create_app(
    create_my_environment,
    MyAction,
    MyObservation,
    max_concurrent_envs=64,  # must be >= generation_batch_size
)

max_concurrent_envs should be >= per_device_train_batch_size * gradient_accumulation_steps. For example, with gradient_accumulation_steps=64 and batch size 1, you need at least 64 concurrent sessions.

This applies to all Spaces used with environment_factory: echo, wordle, catch, sudoku, browsergym_llm, carla.

@sergiopaniego sergiopaniego marked this pull request as ready for review March 13, 2026 14:47
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

# When tools are provided, stop generation after the first tool call so the tool-calling
# loop can execute the tool and feed the result back before the next generation.
if tools and "stop" not in generation_kwargs:
generation_kwargs["stop"] = ["</tool_call>"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this? the model should output an EOS right after </tool_call> (btw we have no guarantee that all models use the str </tool_call> so we can't use it here

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

"id": "0f9RqHh7KyJl"
},
"metadata": {},
"source": "# @title SudokuEnv class (click to expand)\nimport re\nfrom collections import defaultdict\n\nfrom textarena_env import TextArenaAction, TextArenaEnv\n\n\ndef _is_valid_board_state(board_str: str) -> bool:\n return \"R1\" in board_str and \"R9\" in board_str and \"|\" in board_str\n\n\ndef _parse_board(board_str: str) -> list[list[int]]:\n grid = [[0] * 9 for _ in range(9)]\n if not _is_valid_board_state(board_str):\n return grid\n for line in board_str.split(\"\\n\"):\n line_stripped = line.strip()\n if line_stripped and line_stripped[0] == \"R\" and len(line_stripped) > 1 and line_stripped[1].isdigit():\n row = int(line_stripped[1]) - 1\n cell_part = line_stripped[2:]\n col = 0\n for char in cell_part:\n if char == \".\":\n grid[row][col] = 0\n col += 1\n elif char.isdigit():\n grid[row][col] = int(char)\n col += 1\n return grid\n\n\ndef _count_filled_cells(board_str: str) -> int:\n if not _is_valid_board_state(board_str):\n return 0\n grid = _parse_board(board_str)\n return sum(1 for row in grid for cell in row if cell != 0)\n\n\ndef _get_valid_numbers(grid: list[list[int]], row: int, col: int) -> set[int]:\n if grid[row][col] != 0:\n return set()\n used = set()\n for c in range(9):\n if grid[row][c] != 0:\n used.add(grid[row][c])\n for r in range(9):\n if grid[r][col] != 0:\n used.add(grid[r][col])\n box_row, box_col = 3 * (row // 3), 3 * (col // 3)\n for r in range(box_row, box_row + 3):\n for c in range(box_col, box_col + 3):\n if grid[r][c] != 0:\n used.add(grid[r][c])\n return set(range(1, 10)) - used\n\n\ndef _extract_empty_cells_with_candidates(board_str: str, sort_by_difficulty: bool = True):\n grid = _parse_board(board_str)\n cells_with_candidates = []\n for row in range(9):\n for col in range(9):\n if grid[row][col] == 0:\n candidates = _get_valid_numbers(grid, row, col)\n cells_with_candidates.append((row + 1, col + 1, candidates))\n if sort_by_difficulty:\n cells_with_candidates.sort(key=lambda x: len(x[2]))\n return cells_with_candidates\n\n\ndef _extract_empty_cells(board_str: str) -> list[tuple[int, int]]:\n empty_cells = []\n if not _is_valid_board_state(board_str):\n return empty_cells\n for line in board_str.split(\"\\n\"):\n line_stripped = line.strip()\n if line_stripped and line_stripped[0] == \"R\" and len(line_stripped) > 1 and line_stripped[1].isdigit():\n row = int(line_stripped[1])\n cell_part = line_stripped[2:]\n col = 0\n for char in cell_part:\n if char == \".\":\n col += 1\n empty_cells.append((row, col))\n elif char.isdigit():\n col += 1\n return empty_cells\n\n\ndef _extract_board_only(text: str) -> str:\n if not text:\n return \"\"\n lines = text.split(\"\\n\")\n board_lines = []\n in_board = False\n for line in lines:\n stripped = line.strip()\n if stripped.startswith(\"C1\") or (\n stripped and stripped[0] == \"R\" and len(stripped) > 1 and stripped[1].isdigit()\n ):\n in_board = True\n if in_board and (stripped.startswith(\"-\") or stripped.startswith(\"R\") or stripped.startswith(\"C1\")):\n board_lines.append(line)\n elif (\n in_board\n and stripped\n and not stripped.startswith(\"-\")\n and not (stripped[0] == \"R\" and len(stripped) > 1 and stripped[1].isdigit())\n ):\n break\n return \"\\n\".join(board_lines) if board_lines else \"\"\n\n\ndef _make_compact_prompt(board, step, successful_moves, failed_moves, difficulty=\"easy\"):\n cells_filled = len(successful_moves)\n summary = f\"Step {step}. Progress: {cells_filled} cells filled.\"\n board_only = _extract_board_only(board) if board else \"No board available.\"\n tried_moves_hint = \"\"\n all_tried = successful_moves + failed_moves\n if all_tried:\n tried_moves_hint = f\"\\n\\nMOVES ALREADY TRIED (do not repeat): {', '.join(all_tried)}\"\n hints = \"\"\n if difficulty == \"easy\" and board:\n cells_with_candidates = _extract_empty_cells_with_candidates(board, sort_by_difficulty=True)\n if cells_with_candidates:\n guaranteed = []\n other_hints = []\n for row, col, candidates in cells_with_candidates[:10]:\n if len(candidates) == 1:\n num = list(candidates)[0]\n guaranteed.append(f\"[{row} {col} {num}]\")\n elif len(candidates) <= 3:\n nums = \",\".join(str(n) for n in sorted(candidates))\n other_hints.append(f\"({row},{col})->{nums}\")\n if guaranteed:\n hints = f\"\\n\\nGUARANTEED MOVES: {', '.join(guaranteed[:5])}\"\n if other_hints:\n hints += f\"\\nOther options: {' | '.join(other_hints[:5])}\"\n elif difficulty == \"medium\" and board:\n cells_with_candidates = _extract_empty_cells_with_candidates(board, sort_by_difficulty=False)\n if cells_with_candidates:\n cell_hints = []\n for row, col, candidates in cells_with_candidates[:10]:\n nums = \",\".join(str(n) for n in sorted(candidates))\n cell_hints.append(f\"({row},{col})->{nums}\")\n if cell_hints:\n hints = f\"\\n\\nEmpty cells: {' | '.join(cell_hints)}\"\n return f\"{summary}\\n\\nBoard:\\n{board_only}{tried_moves_hint}{hints}\\n\\nYour move:\"\n\n\nclass SudokuEnv:\n def __init__(self):\n self.client = TextArenaEnv(base_url=\"https://openenv-sudoku.hf.space\")\n self.difficulty = \"easy\"\n self.max_turns = 100\n self._turn = 0\n self._move_counts = defaultdict(int)\n self._successful_moves = []\n self._failed_moves = []\n self._valid_move_scores = []\n self._empty_cell_scores = []\n self._correct_scores = []\n self._repetition_scores = []\n self._last_board_state = \"\"\n self._last_full_content = \"\"\n self._initial_filled = 0\n self._max_filled = 0\n\n def reset(self, **kwargs) -> str | None:\n result = self.client.reset()\n observation = result.observation\n self.done = False\n self._turn = 0\n self._move_counts = defaultdict(int)\n self._successful_moves = []\n self._failed_moves = []\n self._valid_move_scores = []\n self._empty_cell_scores = []\n self._correct_scores = []\n self._repetition_scores = []\n self._last_board_state = \"\"\n self._initial_filled = 0\n self._max_filled = 0\n\n # Store full message content for diffing (messages are cumulative)\n self._last_full_content = observation.messages[0].content if observation.messages else \"\"\n\n for message in observation.messages:\n if message.content and _is_valid_board_state(message.content):\n self._last_board_state = message.content\n self._initial_filled = _count_filled_cells(self._last_board_state)\n self._max_filled = self._initial_filled\n break\n\n return _make_compact_prompt(\n board=self._last_board_state,\n step=1,\n successful_moves=[],\n failed_moves=[],\n difficulty=self.difficulty,\n )\n\n def place(self, row: int, col: int, number: int) -> str:\n \"\"\"Place a number on the Sudoku board.\n\n Args:\n row: Row number (1-9).\n col: Column number (1-9).\n number: The digit to place (1-9).\n\n Returns:\n The updated board state or feedback from the environment.\n \"\"\"\n if self.done:\n raise ValueError(\"Game over.\")\n\n self._turn += 1\n move = f\"[{row} {col} {number}]\"\n\n # Step environment\n result = self.client.step(TextArenaAction(message=move))\n observation = result.observation\n correct_score = float(result.reward or 0.0)\n self.done = result.done\n\n # Only check the NEW content for feedback (messages are cumulative)\n full_content = observation.messages[0].content if observation.messages else \"\"\n new_content = full_content[len(self._last_full_content):]\n self._last_full_content = full_content\n\n new_content_lower = new_content.lower()\n valid_move = not any(\n kw in new_content_lower for kw in [\"invalid\", \"error\", \"cannot\", \"already\", \"violation\", \"lost\"]\n )\n got_warning = \"please resubmit\" in new_content_lower or \"avoid penalties\" in new_content_lower\n\n # Update board state from new content\n board_state = \"\"\n if full_content and \"|\" in full_content and \"R1\" in full_content:\n board_state = full_content\n\n # Empty cell score\n if self._last_board_state and move:\n empty_cells = _extract_empty_cells(self._last_board_state)\n targets_empty = (row, col) in empty_cells\n empty_cell_score = 1.0 if targets_empty else -1.0\n else:\n empty_cell_score = 0.0\n\n # Repetition score\n is_new_move = self._move_counts[move] == 0\n repetition_count = self._move_counts[move]\n self._move_counts[move] += 1\n if repetition_count > 0:\n repetition_score = -min(2 ** (repetition_count - 1), 10.0)\n else:\n repetition_score = 0.0\n\n # Valid move score\n if valid_move and is_new_move:\n valid_move_score = 1.0\n if move:\n self._successful_moves.append(move)\n elif got_warning:\n valid_move_score = -0.5\n if move:\n self._failed_moves.append(move)\n else:\n valid_move_score = 0.0\n\n # Update board state\n if board_state and _is_valid_board_state(board_state):\n self._last_board_state = board_state\n current_filled = _count_filled_cells(self._last_board_state)\n if current_filled > self._max_filled:\n self._max_filled = current_filled\n\n self._valid_move_scores.append(valid_move_score)\n self._empty_cell_scores.append(empty_cell_score)\n self._correct_scores.append(correct_score)\n self._repetition_scores.append(repetition_score)\n\n # Return a compact prompt for the next turn\n return _make_compact_prompt(\n board=self._last_board_state,\n step=self._turn + 1,\n successful_moves=self._successful_moves,\n failed_moves=self._failed_moves,\n difficulty=self.difficulty,\n )\n\n # ── Reward properties ──\n\n @property\n def correct_reward(self) -> float:\n return self._correct_scores[-1] if self._correct_scores else 0.0\n\n @property\n def valid_move_reward(self) -> float:\n return sum(self._valid_move_scores) / len(self._valid_move_scores) if self._valid_move_scores else 0.0\n\n @property\n def empty_cell_reward(self) -> float:\n return sum(self._empty_cell_scores) / len(self._empty_cell_scores) if self._empty_cell_scores else 0.0\n\n @property\n def repetition_reward(self) -> float:\n return sum(self._repetition_scores) / len(self._repetition_scores) if self._repetition_scores else 0.0\n\n @property\n def progress_reward(self) -> float:\n remaining_to_fill = 81 - self._initial_filled\n if remaining_to_fill > 0:\n return (self._max_filled - self._initial_filled) / remaining_to_fill\n return 1.0",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notebook uses cumulative content instead of new content for board state

Medium Severity

In the notebook's SudokuEnv.place() method, board_state is assigned from full_content (the entire cumulative game history) instead of new_content (the delta since the last turn). The corresponding sudoku.py script correctly uses new_content. This causes self._last_board_state to hold the full game text, which makes _extract_board_only() return the initial board (not the current one) and _extract_empty_cells() return cells from all historical boards. As a result, the hints shown to the model are stale, and the empty_cell_reward is incorrectly permissive (cells filled in earlier turns still appear as empty).

Fix in Cursor Fix in Web

Comment on lines +1510 to +1511
if getattr(env, "_done", False) or getattr(env, "done", False):
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like extending the assumptions we make about what an environment is. To ensure max compatibility, we should aim for a definition of an environment as minimal as possible, so unless this is strictly necessary I’d prefer not to introduce additional structure.

In particular, I don’t think we need special handling for done. The environment can simply include that signal in the observation it returns, and the model should learn not to query the environment again after that point.

If respecting done is important, the environment can raise an exception when it is called after termination. That makes the constraint explicit and helps the model learn that calling the environment again in that state is invalid.

Comment on lines +1522 to +1544
for i in range(len(tool_mask)):
comp_len = len(completion_ids[i])
mask_len = len(tool_mask[i])
if mask_len > comp_len:
tool_mask[i] = tool_mask[i][:comp_len]
elif mask_len < comp_len:
tool_mask[i] = tool_mask[i] + [1] * (comp_len - mask_len)
if logprobs is not None:
for i in range(len(logprobs)):
comp_len = len(completion_ids[i])
lp_len = len(logprobs[i])
if lp_len > comp_len:
logprobs[i] = logprobs[i][:comp_len]
elif lp_len < comp_len:
logprobs[i] = logprobs[i] + [0.0] * (comp_len - lp_len)

# Truncate to max_completion_length to prevent size mismatches downstream
for i in range(len(completion_ids)):
if len(completion_ids[i]) > self.max_completion_length:
completion_ids[i] = completion_ids[i][: self.max_completion_length]
tool_mask[i] = tool_mask[i][: self.max_completion_length]
if logprobs is not None:
logprobs[i] = logprobs[i][: self.max_completion_length]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything here can be removed, we're addressing the issue in #5224

Comment on lines +2339 to +2342
metrics = {}
for key, val in self._metrics[mode].items():
avg = sum(val) / len(val)
metrics[key] = None if math.isnan(avg) else avg
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why you need this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants