Ahmed rl blog v2 by ahmeda14960 · Pull Request #12 · marin-community/marin-community.github.io

ahmeda14960 · 2026-03-10T21:21:55Z

No description provided.

Fix grammatical errors and typos in the intro paragraph (learns→learn, missing "is", its/it's, notoriety, October 2025, etc.) and add a collapsible dropdown summarizing prior JAX RL repos we evaluated and why none fit our needs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add markdown="1" so Kramdown renders markdown inside <details> - Bold and enlarge the dropdown summary text with more whitespace - Remove GitHub star counts from all library entries Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove AMA inline notes, fix typos, reframe speculative claims as hypotheses, remove non-RL resource contention bullet, add stable run evidence after vLLM fix, expand evaluation lessons with estimator details, and differentiate lesson 3 from lesson 1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dlwh · 2026-05-21T22:28:46Z

+
+To build Marin RL, we needed more than "a PPO implementation in JAX." We needed LLM post-training rather than classic control, TPU-first execution, asynchronous actor/trainer separation, fast weight sync, high end-to-end throughput, custom reward/verifier logic for math and code, and checkpointing with restart on preemptible TPU jobs. One important constraint: it was much easier for us to use smaller preemptible TPU slices and many small inference workers than to reserve one large stable TPU job, which pushed us toward a looser worker-based design.
+
+**[Tunix](https://github.com/google/tunix)** — The closest open-source match to "LLM RL in JAX on TPUs." It supports PPO/GRPO-style methods, TPU execution, and checkpoint-and-resume. However, the async/disaggregated pieces only arrived incrementally through late September and October 2025, and in fall 2025 it did not yet look like a mature async RL base. Its disaggregated mode is also a fairly tight sub-mesh TPU job, whereas we wanted a looser worker-based async design with smaller preemptible slices.


i think it may also need pathways/gke which we can't use. CAn you check?

dlwh · 2026-05-21T22:37:59Z

@@ -148,3 +173,44 @@
 Thanks to Romain for driving forward the RL PR and suggesting to break it down into more manageable sub-PRs.
 Thanks to David for reviewing the bf16 weight transfer optimization PR and for giving numerous pointers and guidance during weekly meetings.
 And thanks to Percy for helping set goals and milestones for RL to keep us on course.


dlwh · 2026-05-21T22:39:12Z


 ## Acknowledgements

 Thanks to Ahmed for reviewing numerous RL PRs including loss improvements, weight transfer, Qwen support, logging, and codebase cleanup.


maybe we should just promote the marin folks to coauthors on this piece?

ahmeda14960 and others added 6 commits March 9, 2026 16:38

new updates

f13945a

better plots

5e173c4

final fixes

59900f9

dlwh approved these changes May 21, 2026

View reviewed changes

dlwh reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ahmed rl blog v2#12

Ahmed rl blog v2#12
ahmeda14960 wants to merge 6 commits into
kevin/rl-blogfrom
ahmed_rl_blog_v2

ahmeda14960 commented Mar 10, 2026

Uh oh!

dlwh May 21, 2026

Uh oh!

dlwh May 21, 2026

Uh oh!

dlwh May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		To build Marin RL, we needed more than "a PPO implementation in JAX." We needed LLM post-training rather than classic control, TPU-first execution, asynchronous actor/trainer separation, fast weight sync, high end-to-end throughput, custom reward/verifier logic for math and code, and checkpointing with restart on preemptible TPU jobs. One important constraint: it was much easier for us to use smaller preemptible TPU slices and many small inference workers than to reserve one large stable TPU job, which pushed us toward a looser worker-based design.

		[Tunix](https://github.com/google/tunix) — The closest open-source match to "LLM RL in JAX on TPUs." It supports PPO/GRPO-style methods, TPU execution, and checkpoint-and-resume. However, the async/disaggregated pieces only arrived incrementally through late September and October 2025, and in fall 2025 it did not yet look like a mature async RL base. Its disaggregated mode is also a fairly tight sub-mesh TPU job, whereas we wanted a looser worker-based async design with smaller preemptible slices.


		## Acknowledgements

		Thanks to Ahmed for reviewing numerous RL PRs including loss improvements, weight transfer, Qwen support, logging, and codebase cleanup.

Conversation

ahmeda14960 commented Mar 10, 2026

Uh oh!

dlwh May 21, 2026

Choose a reason for hiding this comment

Uh oh!

dlwh May 21, 2026

Choose a reason for hiding this comment

Uh oh!

dlwh May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants