Update openenv examples to use environment_factory#5235
Update openenv examples to use environment_factory#5235sergiopaniego wants to merge 12 commits intomainfrom
environment_factory#5235Conversation
- catch.py: Format observations as readable text, normalize reward to 0-1, handle incomplete episodes - echo.py: Rename step->echo and MyEchoEnv->EchoToolEnv, wrap in main() - wordle.py: Normalize reward to 0-1, add RichProgressCallback - sudoku.py: Fix cumulative message handling (diff-based), add board validation for move validity, add progress/hints/tried-moves to responses, add LoRA support, tune defaults for memory efficiency - vllm_generation.py: Add </tool_call> stop token for tool calling loop - grpo_trainer.py: Skip tool calls for environments that are done Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
PR: Update OpenEnv examples and docs to Summary
Trainer fixes (
|
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| # When tools are provided, stop generation after the first tool call so the tool-calling | ||
| # loop can execute the tool and feed the result back before the next generation. | ||
| if tools and "stop" not in generation_kwargs: | ||
| generation_kwargs["stop"] = ["</tool_call>"] |
There was a problem hiding this comment.
why do you need this? the model should output an EOS right after </tool_call> (btw we have no guarantee that all models use the str </tool_call> so we can't use it here
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| "id": "0f9RqHh7KyJl" | ||
| }, | ||
| "metadata": {}, | ||
| "source": "# @title SudokuEnv class (click to expand)\nimport re\nfrom collections import defaultdict\n\nfrom textarena_env import TextArenaAction, TextArenaEnv\n\n\ndef _is_valid_board_state(board_str: str) -> bool:\n return \"R1\" in board_str and \"R9\" in board_str and \"|\" in board_str\n\n\ndef _parse_board(board_str: str) -> list[list[int]]:\n grid = [[0] * 9 for _ in range(9)]\n if not _is_valid_board_state(board_str):\n return grid\n for line in board_str.split(\"\\n\"):\n line_stripped = line.strip()\n if line_stripped and line_stripped[0] == \"R\" and len(line_stripped) > 1 and line_stripped[1].isdigit():\n row = int(line_stripped[1]) - 1\n cell_part = line_stripped[2:]\n col = 0\n for char in cell_part:\n if char == \".\":\n grid[row][col] = 0\n col += 1\n elif char.isdigit():\n grid[row][col] = int(char)\n col += 1\n return grid\n\n\ndef _count_filled_cells(board_str: str) -> int:\n if not _is_valid_board_state(board_str):\n return 0\n grid = _parse_board(board_str)\n return sum(1 for row in grid for cell in row if cell != 0)\n\n\ndef _get_valid_numbers(grid: list[list[int]], row: int, col: int) -> set[int]:\n if grid[row][col] != 0:\n return set()\n used = set()\n for c in range(9):\n if grid[row][c] != 0:\n used.add(grid[row][c])\n for r in range(9):\n if grid[r][col] != 0:\n used.add(grid[r][col])\n box_row, box_col = 3 * (row // 3), 3 * (col // 3)\n for r in range(box_row, box_row + 3):\n for c in range(box_col, box_col + 3):\n if grid[r][c] != 0:\n used.add(grid[r][c])\n return set(range(1, 10)) - used\n\n\ndef _extract_empty_cells_with_candidates(board_str: str, sort_by_difficulty: bool = True):\n grid = _parse_board(board_str)\n cells_with_candidates = []\n for row in range(9):\n for col in range(9):\n if grid[row][col] == 0:\n candidates = _get_valid_numbers(grid, row, col)\n cells_with_candidates.append((row + 1, col + 1, candidates))\n if sort_by_difficulty:\n cells_with_candidates.sort(key=lambda x: len(x[2]))\n return cells_with_candidates\n\n\ndef _extract_empty_cells(board_str: str) -> list[tuple[int, int]]:\n empty_cells = []\n if not _is_valid_board_state(board_str):\n return empty_cells\n for line in board_str.split(\"\\n\"):\n line_stripped = line.strip()\n if line_stripped and line_stripped[0] == \"R\" and len(line_stripped) > 1 and line_stripped[1].isdigit():\n row = int(line_stripped[1])\n cell_part = line_stripped[2:]\n col = 0\n for char in cell_part:\n if char == \".\":\n col += 1\n empty_cells.append((row, col))\n elif char.isdigit():\n col += 1\n return empty_cells\n\n\ndef _extract_board_only(text: str) -> str:\n if not text:\n return \"\"\n lines = text.split(\"\\n\")\n board_lines = []\n in_board = False\n for line in lines:\n stripped = line.strip()\n if stripped.startswith(\"C1\") or (\n stripped and stripped[0] == \"R\" and len(stripped) > 1 and stripped[1].isdigit()\n ):\n in_board = True\n if in_board and (stripped.startswith(\"-\") or stripped.startswith(\"R\") or stripped.startswith(\"C1\")):\n board_lines.append(line)\n elif (\n in_board\n and stripped\n and not stripped.startswith(\"-\")\n and not (stripped[0] == \"R\" and len(stripped) > 1 and stripped[1].isdigit())\n ):\n break\n return \"\\n\".join(board_lines) if board_lines else \"\"\n\n\ndef _make_compact_prompt(board, step, successful_moves, failed_moves, difficulty=\"easy\"):\n cells_filled = len(successful_moves)\n summary = f\"Step {step}. Progress: {cells_filled} cells filled.\"\n board_only = _extract_board_only(board) if board else \"No board available.\"\n tried_moves_hint = \"\"\n all_tried = successful_moves + failed_moves\n if all_tried:\n tried_moves_hint = f\"\\n\\nMOVES ALREADY TRIED (do not repeat): {', '.join(all_tried)}\"\n hints = \"\"\n if difficulty == \"easy\" and board:\n cells_with_candidates = _extract_empty_cells_with_candidates(board, sort_by_difficulty=True)\n if cells_with_candidates:\n guaranteed = []\n other_hints = []\n for row, col, candidates in cells_with_candidates[:10]:\n if len(candidates) == 1:\n num = list(candidates)[0]\n guaranteed.append(f\"[{row} {col} {num}]\")\n elif len(candidates) <= 3:\n nums = \",\".join(str(n) for n in sorted(candidates))\n other_hints.append(f\"({row},{col})->{nums}\")\n if guaranteed:\n hints = f\"\\n\\nGUARANTEED MOVES: {', '.join(guaranteed[:5])}\"\n if other_hints:\n hints += f\"\\nOther options: {' | '.join(other_hints[:5])}\"\n elif difficulty == \"medium\" and board:\n cells_with_candidates = _extract_empty_cells_with_candidates(board, sort_by_difficulty=False)\n if cells_with_candidates:\n cell_hints = []\n for row, col, candidates in cells_with_candidates[:10]:\n nums = \",\".join(str(n) for n in sorted(candidates))\n cell_hints.append(f\"({row},{col})->{nums}\")\n if cell_hints:\n hints = f\"\\n\\nEmpty cells: {' | '.join(cell_hints)}\"\n return f\"{summary}\\n\\nBoard:\\n{board_only}{tried_moves_hint}{hints}\\n\\nYour move:\"\n\n\nclass SudokuEnv:\n def __init__(self):\n self.client = TextArenaEnv(base_url=\"https://openenv-sudoku.hf.space\")\n self.difficulty = \"easy\"\n self.max_turns = 100\n self._turn = 0\n self._move_counts = defaultdict(int)\n self._successful_moves = []\n self._failed_moves = []\n self._valid_move_scores = []\n self._empty_cell_scores = []\n self._correct_scores = []\n self._repetition_scores = []\n self._last_board_state = \"\"\n self._last_full_content = \"\"\n self._initial_filled = 0\n self._max_filled = 0\n\n def reset(self, **kwargs) -> str | None:\n result = self.client.reset()\n observation = result.observation\n self.done = False\n self._turn = 0\n self._move_counts = defaultdict(int)\n self._successful_moves = []\n self._failed_moves = []\n self._valid_move_scores = []\n self._empty_cell_scores = []\n self._correct_scores = []\n self._repetition_scores = []\n self._last_board_state = \"\"\n self._initial_filled = 0\n self._max_filled = 0\n\n # Store full message content for diffing (messages are cumulative)\n self._last_full_content = observation.messages[0].content if observation.messages else \"\"\n\n for message in observation.messages:\n if message.content and _is_valid_board_state(message.content):\n self._last_board_state = message.content\n self._initial_filled = _count_filled_cells(self._last_board_state)\n self._max_filled = self._initial_filled\n break\n\n return _make_compact_prompt(\n board=self._last_board_state,\n step=1,\n successful_moves=[],\n failed_moves=[],\n difficulty=self.difficulty,\n )\n\n def place(self, row: int, col: int, number: int) -> str:\n \"\"\"Place a number on the Sudoku board.\n\n Args:\n row: Row number (1-9).\n col: Column number (1-9).\n number: The digit to place (1-9).\n\n Returns:\n The updated board state or feedback from the environment.\n \"\"\"\n if self.done:\n raise ValueError(\"Game over.\")\n\n self._turn += 1\n move = f\"[{row} {col} {number}]\"\n\n # Step environment\n result = self.client.step(TextArenaAction(message=move))\n observation = result.observation\n correct_score = float(result.reward or 0.0)\n self.done = result.done\n\n # Only check the NEW content for feedback (messages are cumulative)\n full_content = observation.messages[0].content if observation.messages else \"\"\n new_content = full_content[len(self._last_full_content):]\n self._last_full_content = full_content\n\n new_content_lower = new_content.lower()\n valid_move = not any(\n kw in new_content_lower for kw in [\"invalid\", \"error\", \"cannot\", \"already\", \"violation\", \"lost\"]\n )\n got_warning = \"please resubmit\" in new_content_lower or \"avoid penalties\" in new_content_lower\n\n # Update board state from new content\n board_state = \"\"\n if full_content and \"|\" in full_content and \"R1\" in full_content:\n board_state = full_content\n\n # Empty cell score\n if self._last_board_state and move:\n empty_cells = _extract_empty_cells(self._last_board_state)\n targets_empty = (row, col) in empty_cells\n empty_cell_score = 1.0 if targets_empty else -1.0\n else:\n empty_cell_score = 0.0\n\n # Repetition score\n is_new_move = self._move_counts[move] == 0\n repetition_count = self._move_counts[move]\n self._move_counts[move] += 1\n if repetition_count > 0:\n repetition_score = -min(2 ** (repetition_count - 1), 10.0)\n else:\n repetition_score = 0.0\n\n # Valid move score\n if valid_move and is_new_move:\n valid_move_score = 1.0\n if move:\n self._successful_moves.append(move)\n elif got_warning:\n valid_move_score = -0.5\n if move:\n self._failed_moves.append(move)\n else:\n valid_move_score = 0.0\n\n # Update board state\n if board_state and _is_valid_board_state(board_state):\n self._last_board_state = board_state\n current_filled = _count_filled_cells(self._last_board_state)\n if current_filled > self._max_filled:\n self._max_filled = current_filled\n\n self._valid_move_scores.append(valid_move_score)\n self._empty_cell_scores.append(empty_cell_score)\n self._correct_scores.append(correct_score)\n self._repetition_scores.append(repetition_score)\n\n # Return a compact prompt for the next turn\n return _make_compact_prompt(\n board=self._last_board_state,\n step=self._turn + 1,\n successful_moves=self._successful_moves,\n failed_moves=self._failed_moves,\n difficulty=self.difficulty,\n )\n\n # ── Reward properties ──\n\n @property\n def correct_reward(self) -> float:\n return self._correct_scores[-1] if self._correct_scores else 0.0\n\n @property\n def valid_move_reward(self) -> float:\n return sum(self._valid_move_scores) / len(self._valid_move_scores) if self._valid_move_scores else 0.0\n\n @property\n def empty_cell_reward(self) -> float:\n return sum(self._empty_cell_scores) / len(self._empty_cell_scores) if self._empty_cell_scores else 0.0\n\n @property\n def repetition_reward(self) -> float:\n return sum(self._repetition_scores) / len(self._repetition_scores) if self._repetition_scores else 0.0\n\n @property\n def progress_reward(self) -> float:\n remaining_to_fill = 81 - self._initial_filled\n if remaining_to_fill > 0:\n return (self._max_filled - self._initial_filled) / remaining_to_fill\n return 1.0", |
There was a problem hiding this comment.
Notebook uses cumulative content instead of new content for board state
Medium Severity
In the notebook's SudokuEnv.place() method, board_state is assigned from full_content (the entire cumulative game history) instead of new_content (the delta since the last turn). The corresponding sudoku.py script correctly uses new_content. This causes self._last_board_state to hold the full game text, which makes _extract_board_only() return the initial board (not the current one) and _extract_empty_cells() return cells from all historical boards. As a result, the hints shown to the model are stale, and the empty_cell_reward is incorrectly permissive (cells filled in earlier turns still appear as empty).
| if getattr(env, "_done", False) or getattr(env, "done", False): | ||
| continue |
There was a problem hiding this comment.
This feels like extending the assumptions we make about what an environment is. To ensure max compatibility, we should aim for a definition of an environment as minimal as possible, so unless this is strictly necessary I’d prefer not to introduce additional structure.
In particular, I don’t think we need special handling for done. The environment can simply include that signal in the observation it returns, and the model should learn not to query the environment again after that point.
If respecting done is important, the environment can raise an exception when it is called after termination. That makes the constraint explicit and helps the model learn that calling the environment again in that state is invalid.
| for i in range(len(tool_mask)): | ||
| comp_len = len(completion_ids[i]) | ||
| mask_len = len(tool_mask[i]) | ||
| if mask_len > comp_len: | ||
| tool_mask[i] = tool_mask[i][:comp_len] | ||
| elif mask_len < comp_len: | ||
| tool_mask[i] = tool_mask[i] + [1] * (comp_len - mask_len) | ||
| if logprobs is not None: | ||
| for i in range(len(logprobs)): | ||
| comp_len = len(completion_ids[i]) | ||
| lp_len = len(logprobs[i]) | ||
| if lp_len > comp_len: | ||
| logprobs[i] = logprobs[i][:comp_len] | ||
| elif lp_len < comp_len: | ||
| logprobs[i] = logprobs[i] + [0.0] * (comp_len - lp_len) | ||
|
|
||
| # Truncate to max_completion_length to prevent size mismatches downstream | ||
| for i in range(len(completion_ids)): | ||
| if len(completion_ids[i]) > self.max_completion_length: | ||
| completion_ids[i] = completion_ids[i][: self.max_completion_length] | ||
| tool_mask[i] = tool_mask[i][: self.max_completion_length] | ||
| if logprobs is not None: | ||
| logprobs[i] = logprobs[i][: self.max_completion_length] |
There was a problem hiding this comment.
everything here can be removed, we're addressing the issue in #5224
| metrics = {} | ||
| for key, val in self._metrics[mode].items(): | ||
| avg = sum(val) / len(val) | ||
| metrics[key] = None if math.isnan(avg) else avg |


What does this PR do?
TODO:
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Note
Medium Risk
Changes are documentation- and example-heavy but substantially rewrite the OpenEnv guide and notebooks; risk is mainly broken instructions or non-working sample code for users.
Overview
Updates the OpenEnv documentation and examples to use
GRPOTrainer(environment_factory=...)instead of customrollout_funcpatterns, including new guidance on environment class/tool method requirements and reward access via theenvironmentsparameter.Reorganizes the Examples Overview and notebooks index to group OpenEnv notebooks/scripts, and expands the OpenEnv guide with revised install/run instructions plus new sections on multi-environment training and server concurrency requirements.
Heavily refactors
openenv_sudoku_grpo.ipynbto follow theenvironment_factoryworkflow (newSudokuEnvwrapper, updated dataset/prompts/config, and simplified reward functions), reducing notebook size and removing the manual rollout implementation.Written by Cursor Bugbot for commit cadf6a6. This will update automatically on new commits. Configure here.