Skip to content

Ahmed rl blog v2#12

Open
ahmeda14960 wants to merge 6 commits into
kevin/rl-blogfrom
ahmed_rl_blog_v2
Open

Ahmed rl blog v2#12
ahmeda14960 wants to merge 6 commits into
kevin/rl-blogfrom
ahmed_rl_blog_v2

Conversation

@ahmeda14960

Copy link
Copy Markdown
Contributor

No description provided.

ahmeda14960 and others added 6 commits March 9, 2026 16:38
Fix grammatical errors and typos in the intro paragraph (learns→learn,
missing "is", its/it's, notoriety, October 2025, etc.) and add a
collapsible dropdown summarizing prior JAX RL repos we evaluated and
why none fit our needs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add markdown="1" so Kramdown renders markdown inside <details>
- Bold and enlarge the dropdown summary text with more whitespace
- Remove GitHub star counts from all library entries

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove AMA inline notes, fix typos, reframe speculative claims as
hypotheses, remove non-RL resource contention bullet, add stable run
evidence after vLLM fix, expand evaluation lessons with estimator
details, and differentiate lesson 3 from lesson 1.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

To build Marin RL, we needed more than "a PPO implementation in JAX." We needed LLM post-training rather than classic control, TPU-first execution, asynchronous actor/trainer separation, fast weight sync, high end-to-end throughput, custom reward/verifier logic for math and code, and checkpointing with restart on preemptible TPU jobs. One important constraint: it was much easier for us to use smaller preemptible TPU slices and many small inference workers than to reserve one large stable TPU job, which pushed us toward a looser worker-based design.

**[Tunix](https://github.com/google/tunix)** — The closest open-source match to "LLM RL in JAX on TPUs." It supports PPO/GRPO-style methods, TPU execution, and checkpoint-and-resume. However, the async/disaggregated pieces only arrived incrementally through late September and October 2025, and in fall 2025 it did not yet look like a mature async RL base. Its disaggregated mode is also a fairly tight sub-mesh TPU job, whereas we wanted a looser worker-based async design with smaller preemptible slices.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it may also need pathways/gke which we can't use. CAn you check?

@@ -148,3 +173,44 @@
Thanks to Romain for driving forward the RL PR and suggesting to break it down into more manageable sub-PRs.
Thanks to David for reviewing the bf16 weight transfer optimization PR and for giving numerous pointers and guidance during weekly meetings.
And thanks to Percy for helping set goals and milestones for RL to keep us on course.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank TRC!


## Acknowledgements

Thanks to Ahmed for reviewing numerous RL PRs including loss improvements, weight transfer, Qwen support, logging, and codebase cleanup.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should just promote the marin folks to coauthors on this piece?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants