Use cases, pain points, and background
Building a new environment in NeMo Gym today requires copying an existing benchmark, understanding which parts to keep and which to replace, manually wiring YAML configs, and hoping you picked the right example to copy from. There is no guided path. The result is that:
- New contributors copy the wrong template. Someone building a judge-based benchmark might copy
example_single_tool_call (which has a trivial verify()) instead of math_with_judge (which has the judge wiring they need). There's nothing telling them which to start from.
- Boilerplate is re-implemented across benchmarks. LLM-as-a-judge logic (calling a judge model server, prompt template formatting, position-bias-aware answer swapping, response parsing) is independently implemented in
math_with_judge, equivalence_llm_judge, terminus_judge, and others. Same for multi-turn correction loops across proof_refinement_agent and multi_turn_agent environments.
- YAML wiring is error-prone. Each new environment needs a config that correctly references agent servers, resources servers, model servers (and optionally judge model servers), with the right
type and name cross-references. Getting this wrong produces runtime errors that are hard to debug.
Description
Add a CLI command that scaffolds a new environment from composable options, directional example:
gym init env style=nemo verifier=judge agent=multistep
gym init env style=nemo verifier=rm agent=multiturn
gym init env style=nemo verifier=custom agent=custom
gym init env style=gymnasium
style=nemo - NeMo Gym environment style
Two composable flags:
agent= (interaction pattern)
multistep (default) — scaffolds a resources server wired to simple_agent. The agent runs the tool-call loop; the user just implements tool endpoints and verify().
multiturn — scaffolds a resources server wired to multi_turn_agent. The agent runs the outer generate-verify-feedback loop; the user implements verify() with correction feedback.
custom — scaffolds both a custom agent server and a resources server. Full control over the agent loop.
verifier= (verification strategy)
custom (default) — empty verify() for the user to implement with their own logic.
judge — verify() pre-wired with LLM-as-a-judge: judge model server reference in config, prompt template scaffolding, position-bias-aware dual evaluation, response parsing. Based on patterns from math_with_judge / equivalence_llm_judge.
rm — verify() pre-wired to call a reward model server and return its score as the reward.
The CLI generates the full environment: resources_servers/<name>/ with app.py, config YAML, data/example.jsonl placeholder, tests, and requirements.txt. For agent=custom, also generates responses_api_agents/<name>/.
style=gymnasium - Classic Gymnasium-compatible APIs
For users coming from the OpenAI Gym / Gymnasium ecosystem who expect reset() / step() / reward semantics. Scaffolds an environment class with the familiar APIs, with NeMo Gym handling the translation to NeMo Gym's architecture behind the scenes.
Design
What files should be touched? What logic should be written?
Out of scope
What are some items that this issue could be mistaken to cover that this issue should explicitly NOT cover?
Acceptance Criteria
Use cases, pain points, and background
Building a new environment in NeMo Gym today requires copying an existing benchmark, understanding which parts to keep and which to replace, manually wiring YAML configs, and hoping you picked the right example to copy from. There is no guided path. The result is that:
example_single_tool_call(which has a trivialverify()) instead ofmath_with_judge(which has the judge wiring they need). There's nothing telling them which to start from.math_with_judge,equivalence_llm_judge,terminus_judge, and others. Same for multi-turn correction loops acrossproof_refinement_agentandmulti_turn_agentenvironments.typeandnamecross-references. Getting this wrong produces runtime errors that are hard to debug.Description
Add a CLI command that scaffolds a new environment from composable options, directional example:
style=nemo- NeMo Gym environment styleTwo composable flags:
agent=(interaction pattern)multistep(default) — scaffolds a resources server wired tosimple_agent. The agent runs the tool-call loop; the user just implements tool endpoints andverify().multiturn— scaffolds a resources server wired tomulti_turn_agent. The agent runs the outer generate-verify-feedback loop; the user implementsverify()with correction feedback.custom— scaffolds both a custom agent server and a resources server. Full control over the agent loop.verifier=(verification strategy)custom(default) — emptyverify()for the user to implement with their own logic.judge—verify()pre-wired with LLM-as-a-judge: judge model server reference in config, prompt template scaffolding, position-bias-aware dual evaluation, response parsing. Based on patterns frommath_with_judge/equivalence_llm_judge.rm—verify()pre-wired to call a reward model server and return its score as the reward.The CLI generates the full environment:
resources_servers/<name>/withapp.py, config YAML,data/example.jsonlplaceholder, tests, andrequirements.txt. Foragent=custom, also generatesresponses_api_agents/<name>/.style=gymnasium- Classic Gymnasium-compatible APIsFor users coming from the OpenAI Gym / Gymnasium ecosystem who expect
reset()/step()/rewardsemantics. Scaffolds an environment class with the familiar APIs, with NeMo Gym handling the translation to NeMo Gym's architecture behind the scenes.Design
What files should be touched? What logic should be written?
Out of scope
What are some items that this issue could be mistaken to cover that this issue should explicitly NOT cover?
Acceptance Criteria