Skip to content

[RL] Improve Prime verifier integration#4825

Open
taivu1998 wants to merge 3 commits intomarin-community:mainfrom
taivu1998:tdv/prime-env-pr1-prepare-hook
Open

[RL] Improve Prime verifier integration#4825
taivu1998 wants to merge 3 commits intomarin-community:mainfrom
taivu1998:tdv/prime-env-pr1-prepare-hook

Conversation

@taivu1998
Copy link
Copy Markdown

@taivu1998 taivu1998 commented Apr 16, 2026

Add the env prepare lifecycle hook and an OpenAI-compatible verifier client, rewrite PrimeIntellectEnv for supported single-turn chat environments, and thread response loss masks through rollout batching. This makes Prime verifier training usable while preserving existing single-turn RL behavior.

Copy link
Copy Markdown
Author

🤖 Problem: PrimeIntellectEnv was not reliable for Marin RL because setup happened inside sampling, verifier generation only cleanly matched one backend path, Prime rollouts used a custom result processor, and Marin had no first-class way to mask response tokens for future multi-turn verifier trajectories.

Approach: Add a generic environment prepare lifecycle hook, add an OpenAI-compatible verifier client path for inference contexts, rewrite PrimeIntellectEnv around an explicit Phase 1 support matrix for single-turn chat verifier environments, remove the Prime-only vLLM processor, and add response_loss_mask plumbing from Rollout through training-batch construction.

Key code: Prime setup and support gates live in lib/marin/src/marin/rl/environments/prime_intellect_env.py, the verifier client contract lives in lib/marin/src/marin/rl/environments/inference_ctx/base.py and lib/marin/src/marin/rl/environments/inference_ctx/openai_compat.py, and the generic loss-mask threading lives in lib/marin/src/marin/rl/types.py and lib/marin/src/marin/rl/train_batch.py.

Tests: Verified with ./infra/pre-commit.py --all-files --fix, UV_CACHE_DIR=/tmp/uv-cache-pr-push uv run pytest tests/rl -m 'not slow' -q, and focused RL slices covering PrimeIntellectEnv, inference contexts, train-batch masking, replay/storage, and rollout worker behavior.

@taivu1998 taivu1998 changed the title [RL] Stage Prime verifier integration through Phase 1 and loss-mask plumbing [RL] Improve Prime verifier integration Apr 16, 2026
@taivu1998 taivu1998 marked this pull request as ready for review April 20, 2026 16:47
This gives RL environments an explicit one-time setup lifecycle instead of pushing provisioning into sample-time code. It also defines a verifier-facing inference contract that supports chat prompts across backends while keeping unsupported OpenAI features explicit.
@taivu1998 taivu1998 force-pushed the tdv/prime-env-pr1-prepare-hook branch from 9acdb27 to 90c3bda Compare April 22, 2026 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant