[RL] Improve Prime verifier integration by taivu1998 · Pull Request #4825 · marin-community/marin

taivu1998 · 2026-04-16T09:42:56Z

Add the env prepare lifecycle hook and an OpenAI-compatible verifier client, rewrite PrimeIntellectEnv for supported single-turn chat environments, and thread response loss masks through rollout batching. This makes Prime verifier training usable while preserving existing single-turn RL behavior.

taivu1998 · 2026-04-16T09:43:50Z

🤖 Problem: PrimeIntellectEnv was not reliable for Marin RL because setup happened inside sampling, verifier generation only cleanly matched one backend path, Prime rollouts used a custom result processor, and Marin had no first-class way to mask response tokens for future multi-turn verifier trajectories.

Approach: Add a generic environment prepare lifecycle hook, add an OpenAI-compatible verifier client path for inference contexts, rewrite PrimeIntellectEnv around an explicit Phase 1 support matrix for single-turn chat verifier environments, remove the Prime-only vLLM processor, and add response_loss_mask plumbing from Rollout through training-batch construction.

Key code: Prime setup and support gates live in lib/marin/src/marin/rl/environments/prime_intellect_env.py, the verifier client contract lives in lib/marin/src/marin/rl/environments/inference_ctx/base.py and lib/marin/src/marin/rl/environments/inference_ctx/openai_compat.py, and the generic loss-mask threading lives in lib/marin/src/marin/rl/types.py and lib/marin/src/marin/rl/train_batch.py.

Tests: Verified with ./infra/pre-commit.py --all-files --fix, UV_CACHE_DIR=/tmp/uv-cache-pr-push uv run pytest tests/rl -m 'not slow' -q, and focused RL slices covering PrimeIntellectEnv, inference contexts, train-batch masking, replay/storage, and rollout worker behavior.

This gives RL environments an explicit one-time setup lifecycle instead of pushing provisioning into sample-time code. It also defines a verifier-facing inference contract that supports chat prompts across backends while keeping unsupported OpenAI features explicit.

taivu1998 changed the title ~~[RL] Stage Prime verifier integration through Phase 1 and loss-mask plumbing~~ [RL] Improve Prime verifier integration Apr 16, 2026

taivu1998 marked this pull request as ready for review April 20, 2026 16:47

taivu1998 added 3 commits April 22, 2026 10:34

[rl] Rewrite Prime verifier env for Phase 1

227cade

[rl] Add rollout response loss masks

90c3bda

taivu1998 force-pushed the tdv/prime-env-pr1-prepare-hook branch from 9acdb27 to 90c3bda Compare April 22, 2026 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RL] Improve Prime verifier integration#4825

[RL] Improve Prime verifier integration#4825
taivu1998 wants to merge 3 commits intomarin-community:mainfrom
taivu1998:tdv/prime-env-pr1-prepare-hook

taivu1998 commented Apr 16, 2026 •

edited

Loading

Uh oh!

taivu1998 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

taivu1998 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taivu1998 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

taivu1998 commented Apr 16, 2026 •

edited

Loading