Ahmed rl blog v2#12
Open
ahmeda14960 wants to merge 6 commits into
Open
Conversation
Fix grammatical errors and typos in the intro paragraph (learns→learn, missing "is", its/it's, notoriety, October 2025, etc.) and add a collapsible dropdown summarizing prior JAX RL repos we evaluated and why none fit our needs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add markdown="1" so Kramdown renders markdown inside <details> - Bold and enlarge the dropdown summary text with more whitespace - Remove GitHub star counts from all library entries Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove AMA inline notes, fix typos, reframe speculative claims as hypotheses, remove non-RL resource contention bullet, add stable run evidence after vLLM fix, expand evaluation lessons with estimator details, and differentiate lesson 3 from lesson 1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dlwh
approved these changes
May 21, 2026
|
|
||
| To build Marin RL, we needed more than "a PPO implementation in JAX." We needed LLM post-training rather than classic control, TPU-first execution, asynchronous actor/trainer separation, fast weight sync, high end-to-end throughput, custom reward/verifier logic for math and code, and checkpointing with restart on preemptible TPU jobs. One important constraint: it was much easier for us to use smaller preemptible TPU slices and many small inference workers than to reserve one large stable TPU job, which pushed us toward a looser worker-based design. | ||
|
|
||
| **[Tunix](https://github.com/google/tunix)** — The closest open-source match to "LLM RL in JAX on TPUs." It supports PPO/GRPO-style methods, TPU execution, and checkpoint-and-resume. However, the async/disaggregated pieces only arrived incrementally through late September and October 2025, and in fall 2025 it did not yet look like a mature async RL base. Its disaggregated mode is also a fairly tight sub-mesh TPU job, whereas we wanted a looser worker-based async design with smaller preemptible slices. |
Member
There was a problem hiding this comment.
i think it may also need pathways/gke which we can't use. CAn you check?
dlwh
reviewed
May 21, 2026
| @@ -148,3 +173,44 @@ | |||
| Thanks to Romain for driving forward the RL PR and suggesting to break it down into more manageable sub-PRs. | |||
| Thanks to David for reviewing the bf16 weight transfer optimization PR and for giving numerous pointers and guidance during weekly meetings. | |||
| And thanks to Percy for helping set goals and milestones for RL to keep us on course. | |||
dlwh
reviewed
May 21, 2026
|
|
||
| ## Acknowledgements | ||
|
|
||
| Thanks to Ahmed for reviewing numerous RL PRs including loss improvements, weight transfer, Qwen support, logging, and codebase cleanup. |
Member
There was a problem hiding this comment.
maybe we should just promote the marin folks to coauthors on this piece?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.