add x-token alignment foundation and arrow dataset wiring by avenkateshha · Pull Request #2253 · NVIDIA-NeMo/RL

avenkateshha · 2026-04-12T19:01:40Z

Port the baseline x-token alignment utilities and hook arrow_text dataset/config support needed for cross-tokenizer off-policy distillation on current main.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Port the baseline x-token alignment utilities and hook arrow_text dataset/config support needed for cross-tokenizer off-policy distillation on current main. Made-with: Cursor

copy-pr-bot · 2026-04-12T19:01:45Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

RayenTian · 2026-04-15T05:39:56Z

+#     --teacher-model Qwen/Qwen3-8B-Base \
+#     --initial-projection-path cross_tokenizer_data/transformation_counts_via_multitoken.pt
+
+token_aligner:


What if we move the token_aligner section below the distillation part? So that we know this is a distillation config at first glance?

RayenTian · 2026-04-15T05:44:48Z

+
+data:
+  dataset_name: "arrow_text"
+  arrow_files: "/lustre/fsw/portfolios/llmservice/users/sdiao/data/climb_nm5.5_phase3_400b_shuffled_text_only_global_shuffle/data-00[0-4][0-9][0-9]-of-02476.arrow"


If I understand it correctly, the .arrow file serves as the local cache for Hugging Face datasets when we download them using the load_dataset function. Is it possible for us to directly use the official Hugging Face dataset instead of relying on these local arrow cache files? By doing so, users would be able to conveniently reproduce the exact configuration and ensure consistency across different datasets.

WDYT about this? @yuki-97

add x-token alignment foundation and arrow dataset wiring

c264771

Port the baseline x-token alignment utilities and hook arrow_text dataset/config support needed for cross-tokenizer off-policy distillation on current main. Made-with: Cursor

avenkateshha requested a review from a team as a code owner April 12, 2026 19:01

github-actions bot added the community-request label Apr 12, 2026

chtruong814 added the needs-follow-up Issue needs follow-up label Apr 14, 2026

RayenTian reviewed Apr 15, 2026

View reviewed changes

Comment thread nemo_rl/algorithms/x_token/minimal_projection_generator.py

chtruong814 added waiting-on-customer Waiting on the original author to respond and removed needs-follow-up Issue needs follow-up labels Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add x-token alignment foundation and arrow dataset wiring#2253

add x-token alignment foundation and arrow dataset wiring#2253
avenkateshha wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
avenkateshha:xtoken/stack-pr1-foundation

avenkateshha commented Apr 12, 2026

Uh oh!

copy-pr-bot bot commented Apr 12, 2026

Uh oh!

RayenTian Apr 15, 2026

Uh oh!

RayenTian Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

avenkateshha commented Apr 12, 2026

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Apr 12, 2026

Uh oh!

RayenTian Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

RayenTian Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants