[RL] Improve Prime verifier integration#4825
[RL] Improve Prime verifier integration#4825taivu1998 wants to merge 3 commits intomarin-community:mainfrom
Conversation
|
🤖 Problem: PrimeIntellectEnv was not reliable for Marin RL because setup happened inside sampling, verifier generation only cleanly matched one backend path, Prime rollouts used a custom result processor, and Marin had no first-class way to mask response tokens for future multi-turn verifier trajectories. Approach: Add a generic environment prepare lifecycle hook, add an OpenAI-compatible verifier client path for inference contexts, rewrite PrimeIntellectEnv around an explicit Phase 1 support matrix for single-turn chat verifier environments, remove the Prime-only vLLM processor, and add response_loss_mask plumbing from Rollout through training-batch construction. Key code: Prime setup and support gates live in lib/marin/src/marin/rl/environments/prime_intellect_env.py, the verifier client contract lives in lib/marin/src/marin/rl/environments/inference_ctx/base.py and lib/marin/src/marin/rl/environments/inference_ctx/openai_compat.py, and the generic loss-mask threading lives in lib/marin/src/marin/rl/types.py and lib/marin/src/marin/rl/train_batch.py. Tests: Verified with ./infra/pre-commit.py --all-files --fix, UV_CACHE_DIR=/tmp/uv-cache-pr-push uv run pytest tests/rl -m 'not slow' -q, and focused RL slices covering PrimeIntellectEnv, inference contexts, train-batch masking, replay/storage, and rollout worker behavior. |
This gives RL environments an explicit one-time setup lifecycle instead of pushing provisioning into sample-time code. It also defines a verifier-facing inference contract that supports chat prompts across backends while keeping unsupported OpenAI features explicit.
9acdb27 to
90c3bda
Compare
Add the env prepare lifecycle hook and an OpenAI-compatible verifier client, rewrite PrimeIntellectEnv for supported single-turn chat environments, and thread response loss masks through rollout batching. This makes Prime verifier training usable while preserving existing single-turn RL behavior.