Update openenv examples to use `environment_factory` by sergiopaniego · Pull Request #5235 · huggingface/trl

sergiopaniego · 2026-03-06T18:44:17Z

What does this PR do?

TODO:

Migrate notebooks
Update TRL-OpenEnv guide
Add multi-env example

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Note

Medium Risk
Changes are documentation- and example-heavy but substantially rewrite the OpenEnv guide and notebooks; risk is mainly broken instructions or non-working sample code for users.

Overview
Updates the OpenEnv documentation and examples to use GRPOTrainer(environment_factory=...) instead of custom rollout_func patterns, including new guidance on environment class/tool method requirements and reward access via the environments parameter.

Reorganizes the Examples Overview and notebooks index to group OpenEnv notebooks/scripts, and expands the OpenEnv guide with revised install/run instructions plus new sections on multi-environment training and server concurrency requirements.

Heavily refactors openenv_sudoku_grpo.ipynb to follow the environment_factory workflow (new SudokuEnv wrapper, updated dataset/prompts/config, and simplified reward functions), reducing notebook size and removing the manual rollout implementation.

^{Written by Cursor Bugbot for commit cadf6a6. This will update automatically on new commits. Configure here.}

- catch.py: Format observations as readable text, normalize reward to 0-1, handle incomplete episodes - echo.py: Rename step->echo and MyEchoEnv->EchoToolEnv, wrap in main() - wordle.py: Normalize reward to 0-1, add RichProgressCallback - sudoku.py: Fix cumulative message handling (diff-based), add board validation for move validity, add progress/hints/tried-moves to responses, add LoRA support, tune defaults for memory efficiency - vllm_generation.py: Add </tool_call> stop token for tool calling loop - grpo_trainer.py: Skip tool calls for environments that are done Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…-examples

sergiopaniego · 2026-03-13T14:46:58Z

PR: Update OpenEnv examples and docs to environment_factory generated by Claude

Summary

Migrated all OpenEnv example scripts and notebooks from rollout_func to environment_factory: sudoku, catch, browsergym_llm (scripts) and wordle, sudoku (notebooks)
New multi_env.py example script demonstrating multi-environment GRPO training (Wordle + Catch in the same run) with per-environment reward functions returning None for non-applicable environments
Rewrote docs/source/openenv.md entirely around environment_factory: environment class pattern, tool discovery via introspection, Echo and Wordle tutorials, multi-environment section, server concurrency configuration, and a migration guide from rollout_func
Reorganized example_overview.md and examples/notebooks/README.md with dedicated OpenEnv subsections
Updated all OpenEnv documentation URLs to current paths (from_hub removed, .html extensions)

Trainer fixes (`grpo_trainer.py`)

Fixed self.llm → self.vllm_generation.llm for colocate mode in the tool-calling loop
Added server mode branch for max_model_len (was raising NotImplementedError)
Added done-check (_done/done) to stop calling tools for finished environments
Added size alignment between tool_mask/logprobs/completion_ids after the tool loop (re-tokenization can cause off-by-one mismatches)
Added truncation to max_completion_length to prevent size mismatch errors in long multi-turn episodes
Fixed nan metrics crashing JSON serialization in loggers (convert to None). This happens in multi-environment training where per-environment reward functions return None for non-applicable samples, resulting in nan averages

vLLM fix (`vllm_generation.py`)

Added tools parameter to generate() so colocate mode stops generation at </tool_call>, allowing the tool-calling loop to execute tools between turns
Passed self.tools from the trainer to enable this

Not migrated

browsergym.py: requires multimodal tool responses (screenshots) not yet supported in _tool_call_loop. To migrate, _tool_call_loop needs to: (1) allow tool methods to return structured content ([{"type": "image", ...}]), (2) detect multimodal returns and build proper content blocks instead of str(), and (3) run image processing (pixel_values, image_grid_thw) per tool-calling iteration, not just once at generation start
nemo_gym: fundamentally incompatible, external agent servers own generation
browsergym notebook: blocked by Playwright/greenlet server issue and model compatibility (FunctionGemma, not Qwen3). To migrate: (1) upgrade openenv-core to async v0.2.2+ so Playwright runs in a dedicated thread pool instead of conflicting with FastAPI's event loop, and (2) investigate Qwen3 compatibility or adapt the notebook to work with FunctionGemma's tool-calling format

Not in this PR

Upgrade to openenv-core async v0.2.2+ (server + client .sync())
Update HF Space URLs to official openenv org (some scripts still point to personal spaces)

Required HF Space configuration

environment_factory creates N concurrent environment instances (one per generation), each opening a WebSocket to the server. By default, OpenEnv servers only allow 1 concurrent session, so the Spaces need to be configured for concurrency:

In server/environment.py, declare concurrent session support:

SUPPORTS_CONCURRENT_SESSIONS: bool = True

In server/app.py, set the concurrency limit:

app = create_app(
    create_my_environment,
    MyAction,
    MyObservation,
    max_concurrent_envs=64,  # must be >= generation_batch_size
)

max_concurrent_envs should be >= per_device_train_batch_size * gradient_accumulation_steps. For example, with gradient_accumulation_steps=64 and batch size 1, you need at least 64 concurrent sessions.

This applies to all Spaces used with environment_factory: echo, wordle, catch, sudoku, browsergym_llm, carla.

HuggingFaceDocBuilderDev · 2026-03-13T14:51:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

examples/notebooks/openenv_sudoku_grpo.ipynb

qgallouedec · 2026-03-13T15:16:58Z

trl/generation/vllm_generation.py

+            # When tools are provided, stop generation after the first tool call so the tool-calling
+            # loop can execute the tool and feed the result back before the next generation.
+            if tools and "stop" not in generation_kwargs:
+                generation_kwargs["stop"] = ["</tool_call>"]


why do you need this? the model should output an EOS right after </tool_call> (btw we have no guarantee that all models use the str </tool_call> so we can't use it here

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-13T15:18:33Z

examples/notebooks/openenv_sudoku_grpo.ipynb

-    "id": "0f9RqHh7KyJl"
-   },
+   "metadata": {},
+   "source": "# @title SudokuEnv class (click to expand)\nimport re\nfrom collections import defaultdict\n\nfrom textarena_env import TextArenaAction, TextArenaEnv\n\n\ndef _is_valid_board_state(board_str: str) -> bool:\n    return \"R1\" in board_str and \"R9\" in board_str and \"|\" in board_str\n\n\ndef _parse_board(board_str: str) -> list[list[int]]:\n    grid = [[0] * 9 for _ in range(9)]\n    if not _is_valid_board_state(board_str):\n        return grid\n    for line in board_str.split(\"\\n\"):\n        line_stripped = line.strip()\n        if line_stripped and line_stripped[0] == \"R\" and len(line_stripped) > 1 and line_stripped[1].isdigit():\n            row = int(line_stripped[1]) - 1\n            cell_part = line_stripped[2:]\n            col = 0\n            for char in cell_part:\n                if char == \".\":\n                    grid[row][col] = 0\n                    col += 1\n                elif char.isdigit():\n                    grid[row][col] = int(char)\n                    col += 1\n    return grid\n\n\ndef _count_filled_cells(board_str: str) -> int:\n    if not _is_valid_board_state(board_str):\n        return 0\n    grid = _parse_board(board_str)\n    return sum(1 for row in grid for cell in row if cell != 0)\n\n\ndef _get_valid_numbers(grid: list[list[int]], row: int, col: int) -> set[int]:\n    if grid[row][col] != 0:\n        return set()\n    used = set()\n    for c in range(9):\n        if grid[row][c] != 0:\n            used.add(grid[row][c])\n    for r in range(9):\n        if grid[r][col] != 0:\n            used.add(grid[r][col])\n    box_row, box_col = 3 * (row // 3), 3 * (col // 3)\n    for r in range(box_row, box_row + 3):\n        for c in range(box_col, box_col + 3):\n            if grid[r][c] != 0:\n                used.add(grid[r][c])\n    return set(range(1, 10)) - used\n\n\ndef _extract_empty_cells_with_candidates(board_str: str, sort_by_difficulty: bool = True):\n    grid = _parse_board(board_str)\n    cells_with_candidates = []\n    for row in range(9):\n        for col in range(9):\n            if grid[row][col] == 0:\n                candidates = _get_valid_numbers(grid, row, col)\n                cells_with_candidates.append((row + 1, col + 1, candidates))\n    if sort_by_difficulty:\n        cells_with_candidates.sort(key=lambda x: len(x[2]))\n    return cells_with_candidates\n\n\ndef _extract_empty_cells(board_str: str) -> list[tuple[int, int]]:\n    empty_cells = []\n    if not _is_valid_board_state(board_str):\n        return empty_cells\n    for line in board_str.split(\"\\n\"):\n        line_stripped = line.strip()\n        if line_stripped and line_stripped[0] == \"R\" and len(line_stripped) > 1 and line_stripped[1].isdigit():\n            row = int(line_stripped[1])\n            cell_part = line_stripped[2:]\n            col = 0\n            for char in cell_part:\n                if char == \".\":\n                    col += 1\n                    empty_cells.append((row, col))\n                elif char.isdigit():\n                    col += 1\n    return empty_cells\n\n\ndef _extract_board_only(text: str) -> str:\n    if not text:\n        return \"\"\n    lines = text.split(\"\\n\")\n    board_lines = []\n    in_board = False\n    for line in lines:\n        stripped = line.strip()\n        if stripped.startswith(\"C1\") or (\n            stripped and stripped[0] == \"R\" and len(stripped) > 1 and stripped[1].isdigit()\n        ):\n            in_board = True\n        if in_board and (stripped.startswith(\"-\") or stripped.startswith(\"R\") or stripped.startswith(\"C1\")):\n            board_lines.append(line)\n        elif (\n            in_board\n            and stripped\n            and not stripped.startswith(\"-\")\n            and not (stripped[0] == \"R\" and len(stripped) > 1 and stripped[1].isdigit())\n        ):\n            break\n    return \"\\n\".join(board_lines) if board_lines else \"\"\n\n\ndef _make_compact_prompt(board, step, successful_moves, failed_moves, difficulty=\"easy\"):\n    cells_filled = len(successful_moves)\n    summary = f\"Step {step}. Progress: {cells_filled} cells filled.\"\n    board_only = _extract_board_only(board) if board else \"No board available.\"\n    tried_moves_hint = \"\"\n    all_tried = successful_moves + failed_moves\n    if all_tried:\n        tried_moves_hint = f\"\\n\\nMOVES ALREADY TRIED (do not repeat): {', '.join(all_tried)}\"\n    hints = \"\"\n    if difficulty == \"easy\" and board:\n        cells_with_candidates = _extract_empty_cells_with_candidates(board, sort_by_difficulty=True)\n        if cells_with_candidates:\n            guaranteed = []\n            other_hints = []\n            for row, col, candidates in cells_with_candidates[:10]:\n                if len(candidates) == 1:\n                    num = list(candidates)[0]\n                    guaranteed.append(f\"[{row} {col} {num}]\")\n                elif len(candidates) <= 3:\n                    nums = \",\".join(str(n) for n in sorted(candidates))\n                    other_hints.append(f\"({row},{col})->{nums}\")\n            if guaranteed:\n                hints = f\"\\n\\nGUARANTEED MOVES: {', '.join(guaranteed[:5])}\"\n            if other_hints:\n                hints += f\"\\nOther options: {' | '.join(other_hints[:5])}\"\n    elif difficulty == \"medium\" and board:\n        cells_with_candidates = _extract_empty_cells_with_candidates(board, sort_by_difficulty=False)\n        if cells_with_candidates:\n            cell_hints = []\n            for row, col, candidates in cells_with_candidates[:10]:\n                nums = \",\".join(str(n) for n in sorted(candidates))\n                cell_hints.append(f\"({row},{col})->{nums}\")\n            if cell_hints:\n                hints = f\"\\n\\nEmpty cells: {' | '.join(cell_hints)}\"\n    return f\"{summary}\\n\\nBoard:\\n{board_only}{tried_moves_hint}{hints}\\n\\nYour move:\"\n\n\nclass SudokuEnv:\n    def __init__(self):\n        self.client = TextArenaEnv(base_url=\"https://openenv-sudoku.hf.space\")\n        self.difficulty = \"easy\"\n        self.max_turns = 100\n        self._turn = 0\n        self._move_counts = defaultdict(int)\n        self._successful_moves = []\n        self._failed_moves = []\n        self._valid_move_scores = []\n        self._empty_cell_scores = []\n        self._correct_scores = []\n        self._repetition_scores = []\n        self._last_board_state = \"\"\n        self._last_full_content = \"\"\n        self._initial_filled = 0\n        self._max_filled = 0\n\n    def reset(self, **kwargs) -> str | None:\n        result = self.client.reset()\n        observation = result.observation\n        self.done = False\n        self._turn = 0\n        self._move_counts = defaultdict(int)\n        self._successful_moves = []\n        self._failed_moves = []\n        self._valid_move_scores = []\n        self._empty_cell_scores = []\n        self._correct_scores = []\n        self._repetition_scores = []\n        self._last_board_state = \"\"\n        self._initial_filled = 0\n        self._max_filled = 0\n\n        # Store full message content for diffing (messages are cumulative)\n        self._last_full_content = observation.messages[0].content if observation.messages else \"\"\n\n        for message in observation.messages:\n            if message.content and _is_valid_board_state(message.content):\n                self._last_board_state = message.content\n                self._initial_filled = _count_filled_cells(self._last_board_state)\n                self._max_filled = self._initial_filled\n                break\n\n        return _make_compact_prompt(\n            board=self._last_board_state,\n            step=1,\n            successful_moves=[],\n            failed_moves=[],\n            difficulty=self.difficulty,\n        )\n\n    def place(self, row: int, col: int, number: int) -> str:\n        \"\"\"Place a number on the Sudoku board.\n\n        Args:\n            row: Row number (1-9).\n            col: Column number (1-9).\n            number: The digit to place (1-9).\n\n        Returns:\n            The updated board state or feedback from the environment.\n        \"\"\"\n        if self.done:\n            raise ValueError(\"Game over.\")\n\n        self._turn += 1\n        move = f\"[{row} {col} {number}]\"\n\n        # Step environment\n        result = self.client.step(TextArenaAction(message=move))\n        observation = result.observation\n        correct_score = float(result.reward or 0.0)\n        self.done = result.done\n\n        # Only check the NEW content for feedback (messages are cumulative)\n        full_content = observation.messages[0].content if observation.messages else \"\"\n        new_content = full_content[len(self._last_full_content):]\n        self._last_full_content = full_content\n\n        new_content_lower = new_content.lower()\n        valid_move = not any(\n            kw in new_content_lower for kw in [\"invalid\", \"error\", \"cannot\", \"already\", \"violation\", \"lost\"]\n        )\n        got_warning = \"please resubmit\" in new_content_lower or \"avoid penalties\" in new_content_lower\n\n        # Update board state from new content\n        board_state = \"\"\n        if full_content and \"|\" in full_content and \"R1\" in full_content:\n            board_state = full_content\n\n        # Empty cell score\n        if self._last_board_state and move:\n            empty_cells = _extract_empty_cells(self._last_board_state)\n            targets_empty = (row, col) in empty_cells\n            empty_cell_score = 1.0 if targets_empty else -1.0\n        else:\n            empty_cell_score = 0.0\n\n        # Repetition score\n        is_new_move = self._move_counts[move] == 0\n        repetition_count = self._move_counts[move]\n        self._move_counts[move] += 1\n        if repetition_count > 0:\n            repetition_score = -min(2 ** (repetition_count - 1), 10.0)\n        else:\n            repetition_score = 0.0\n\n        # Valid move score\n        if valid_move and is_new_move:\n            valid_move_score = 1.0\n            if move:\n                self._successful_moves.append(move)\n        elif got_warning:\n            valid_move_score = -0.5\n            if move:\n                self._failed_moves.append(move)\n        else:\n            valid_move_score = 0.0\n\n        # Update board state\n        if board_state and _is_valid_board_state(board_state):\n            self._last_board_state = board_state\n            current_filled = _count_filled_cells(self._last_board_state)\n            if current_filled > self._max_filled:\n                self._max_filled = current_filled\n\n        self._valid_move_scores.append(valid_move_score)\n        self._empty_cell_scores.append(empty_cell_score)\n        self._correct_scores.append(correct_score)\n        self._repetition_scores.append(repetition_score)\n\n        # Return a compact prompt for the next turn\n        return _make_compact_prompt(\n            board=self._last_board_state,\n            step=self._turn + 1,\n            successful_moves=self._successful_moves,\n            failed_moves=self._failed_moves,\n            difficulty=self.difficulty,\n        )\n\n    # ── Reward properties ──\n\n    @property\n    def correct_reward(self) -> float:\n        return self._correct_scores[-1] if self._correct_scores else 0.0\n\n    @property\n    def valid_move_reward(self) -> float:\n        return sum(self._valid_move_scores) / len(self._valid_move_scores) if self._valid_move_scores else 0.0\n\n    @property\n    def empty_cell_reward(self) -> float:\n        return sum(self._empty_cell_scores) / len(self._empty_cell_scores) if self._empty_cell_scores else 0.0\n\n    @property\n    def repetition_reward(self) -> float:\n        return sum(self._repetition_scores) / len(self._repetition_scores) if self._repetition_scores else 0.0\n\n    @property\n    def progress_reward(self) -> float:\n        remaining_to_fill = 81 - self._initial_filled\n        if remaining_to_fill > 0:\n            return (self._max_filled - self._initial_filled) / remaining_to_fill\n        return 1.0",


Notebook uses cumulative content instead of new content for board state

Medium Severity

In the notebook's SudokuEnv.place() method, board_state is assigned from full_content (the entire cumulative game history) instead of new_content (the delta since the last turn). The corresponding sudoku.py script correctly uses new_content. This causes self._last_board_state to hold the full game text, which makes _extract_board_only() return the initial board (not the current one) and _extract_empty_cells() return cells from all historical boards. As a result, the hints shown to the model are stale, and the empty_cell_reward is incorrectly permissive (cells filled in earlier turns still appear as empty).

qgallouedec · 2026-03-13T15:22:54Z

trl/trainer/grpo_trainer.py

+                if getattr(env, "_done", False) or getattr(env, "done", False):
+                    continue


This feels like extending the assumptions we make about what an environment is. To ensure max compatibility, we should aim for a definition of an environment as minimal as possible, so unless this is strictly necessary I’d prefer not to introduce additional structure.

In particular, I don’t think we need special handling for done. The environment can simply include that signal in the observation it returns, and the model should learn not to query the environment again after that point.

If respecting done is important, the environment can raise an exception when it is called after termination. That makes the constraint explicit and helps the model learn that calling the environment again in that state is invalid.

qgallouedec · 2026-03-13T15:23:40Z

trl/trainer/grpo_trainer.py

+        for i in range(len(tool_mask)):
+            comp_len = len(completion_ids[i])
+            mask_len = len(tool_mask[i])
+            if mask_len > comp_len:
+                tool_mask[i] = tool_mask[i][:comp_len]
+            elif mask_len < comp_len:
+                tool_mask[i] = tool_mask[i] + [1] * (comp_len - mask_len)
+        if logprobs is not None:
+            for i in range(len(logprobs)):
+                comp_len = len(completion_ids[i])
+                lp_len = len(logprobs[i])
+                if lp_len > comp_len:
+                    logprobs[i] = logprobs[i][:comp_len]
+                elif lp_len < comp_len:
+                    logprobs[i] = logprobs[i] + [0.0] * (comp_len - lp_len)
+
+        # Truncate to max_completion_length to prevent size mismatches downstream
+        for i in range(len(completion_ids)):
+            if len(completion_ids[i]) > self.max_completion_length:
+                completion_ids[i] = completion_ids[i][: self.max_completion_length]
+                tool_mask[i] = tool_mask[i][: self.max_completion_length]
+                if logprobs is not None:
+                    logprobs[i] = logprobs[i][: self.max_completion_length]


everything here can be removed, we're addressing the issue in #5224

qgallouedec · 2026-03-13T15:24:15Z

trl/trainer/grpo_trainer.py

+        metrics = {}
+        for key, val in self._metrics[mode].items():
+            avg = sum(val) / len(val)
+            metrics[key] = None if math.isnan(avg) else avg


why you need this?

sergiopaniego and others added 10 commits March 4, 2026 10:12

Update openenv scripts to use environemnt_factory

2463c6b

Updated examples

a240c15

Merge branch 'main' of github.com:huggingface/trl into update-openenv…

6a93c55

…-examples

Merge branch 'main' into update-openenv-examples

3d3222c

Update

de73f92

Notebooks and multi env updated

286d4eb

Merge branch 'main' of github.com:huggingface/trl into update-openenv…

d7a7c9a

…-examples

OpenEnv guide updated

4e7cc85

Fixes

9704198

sergiopaniego marked this pull request as ready for review March 13, 2026 14:47

Trigger CI checks

17c1516

cursor bot reviewed Mar 13, 2026

View reviewed changes

examples/notebooks/openenv_sudoku_grpo.ipynb Outdated Show resolved Hide resolved

examples/notebooks/openenv_sudoku_grpo.ipynb Outdated Show resolved Hide resolved

Hard -> easy

cadf6a6

qgallouedec reviewed Mar 13, 2026

View reviewed changes

cursor bot reviewed Mar 13, 2026

View reviewed changes

qgallouedec reviewed Mar 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update openenv examples to use `environment_factory`#5235

Update openenv examples to use `environment_factory`#5235
sergiopaniego wants to merge 12 commits intomainfrom
update-openenv-examples

sergiopaniego commented Mar 6, 2026 •

edited by cursor bot

Loading

Uh oh!

sergiopaniego commented Mar 13, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

qgallouedec Mar 13, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 13, 2026

Uh oh!

qgallouedec Mar 13, 2026

Uh oh!

qgallouedec Mar 13, 2026

Uh oh!

qgallouedec Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if getattr(env, "_done", False) or getattr(env, "done", False):
		continue

Conversation

sergiopaniego commented Mar 6, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

sergiopaniego commented Mar 13, 2026

Summary

Trainer fixes (grpo_trainer.py)

vLLM fix (vllm_generation.py)

Not migrated

Not in this PR

Required HF Space configuration

Uh oh!

HuggingFaceDocBuilderDev commented Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

qgallouedec Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 13, 2026

Choose a reason for hiding this comment

Notebook uses cumulative content instead of new content for board state

Uh oh!

qgallouedec Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sergiopaniego commented Mar 6, 2026 •

edited by cursor bot

Loading

Trainer fixes (`grpo_trainer.py`)

vLLM fix (`vllm_generation.py`)