Adding EOS Tokens to Qwen Models #2512

ariG23498 · 2025-03-18T15:45:20Z

Current status 👇

from torchtune.models.qwen2 import Qwen2Tokenizer
from torchtune.data import Message

messages = [
    Message(role="user", content="Hello world!", masked=True),
    Message(role="assistant", content="How are you?", masked=False),
]

tokenizer = Qwen2Tokenizer(
    path="Qwen2-0.5B-Instruct/vocab.json",
    merges_file="Qwen2-0.5B-Instruct/merges.txt",
)
tokenized_text = tokenizer.tokenize_messages(messages)
print(tokenized_text)


# (
# [None, 151644, 872, 198, 9707, 1879, 0, 151645, 198, 151644, 77091, 198, 4340, 525, 498, 30, 151645, 198, 151645],
# [True, True, True, True, True, True, True, True, True, False, False, False, False, False, False, False, False, False, True]
# )

Not very sure why None comes as a token. Am I missing something obvious here? Any help would be great.

After the initial review, I will add tests.

pytorch-bot · 2025-03-18T15:45:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2512

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 3 Cancelled Jobs, 1 Unrelated Failure

As of commit c82effe with merge base 20bdf10 ():

NEW FAILURES - The following jobs have failed:

Build Docs / build_docs (3.11) (gh)
Process completed with exit code 2.
GPU tests / gpu_test (3.11, stable) (gh)
Process completed with exit code 2.
Unit Test / unit_tests (3.11) (gh)
Process completed with exit code 2.

CANCELLED JOBS - The following jobs were cancelled. Please retry:

GPU tests / gpu_test (3.9, stable) (gh)
Process completed with exit code 2.
Unit Test / unit_tests (3.10) (gh)
##[error]The operation was canceled.
Unit Test / unit_tests (3.9) (gh)
##[error]The operation was canceled.

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

GPU tests / gpu_test (3.10, stable) (gh) (trunk failure)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ariG23498 · 2025-04-28T06:07:27Z

A gentle ping on this.

joecummings · 2025-04-28T16:10:08Z

Thanks for the ping, sorry this fell through the cracks!

The none is coming from the mis-use of the bos_id here. bos_id is not actually defined for Qwen2 so it's showing up as none.

ariG23498 · 2025-04-29T02:35:28Z

Interesting.

Can I get a quick review on the changes I have made so far, just to be sure about my implementation.

joecummings · 2025-04-30T20:21:04Z

Interesting.

Can I get a quick review on the changes I have made so far, just to be sure about my implementation.

Code looks nice and clean and I don't see any obvious errors (with the exception of the bos_id thing I already pointed out to you). Of course, the real test will be the unit tests :)

ysurs · 2025-05-04T12:18:46Z

Hey @ariG23498 , @joecummings , I am new to torchtune and OSS, I was thinking of taking Qwen2_5. This PR will solve the issue for both Qwen2 and qwen2_5 ?

I think qwen2_5's tokenizer is inheriting from qwen2, so no need to implement for qwen2_5?

OK

ariG23498 · 2025-05-06T10:44:16Z

@joecummings I think the PR is in a good state to review. The tests pass on my end.

@ysurs I checked the Qwen 2.5 tokenizer, it does inherit Qwen 2's tokenizer, but the edits still need to be made I guess, as the current tokenizers API would have changed.

codecov-commenter · 2025-05-07T23:14:37Z

Codecov Report

Attention: Patch coverage is 93.75000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 60.70%. Comparing base (fb6d4cd) to head (3d979b7).

Files with missing lines	Patch %	Lines
torchtune/models/qwen2/_tokenizer.py	93.33%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2512      +/-   ##
==========================================
- Coverage   64.26%   60.70%   -3.56%     
==========================================
  Files         427      427              
  Lines       25988    25997       +9     
==========================================
- Hits        16700    15781     -919     
- Misses       9288    10216     +928

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ariG23498 · 2025-05-08T16:03:24Z

Hi @joecummings,

The PR was missing wrong docstrings, which was fixed. Can you trigger the tests, so that I can be certain about the changes?

joecummings · 2025-05-08T16:47:57Z

Hi @joecummings,

The PR was missing wrong docstrings, which was fixed. Can you trigger the tests, so that I can be certain about the changes?

If you look at the contributing guide, you should be able to run these tests very quickly on your local machine to ensure that you're passing the linter.

ebsmothers

Thanks for the PR @ariG23498! Two small comments, but other than that this looks good to me

tests/torchtune/models/qwen2/test_qwen2_tokenizer.py

ebsmothers · 2025-05-09T18:08:36Z

torchtune/models/qwen2/_tokenizer.py

-            mask.append(mask[-1])
+        if add_end_tokens:
+            tokens = tokens + [self.eos_id]
+            mask = mask + [mask[-1] if mask else True]


Noob q: why is this different than what we're doing for Llama3? ref

If we added True to mask if there was an end token this part of the test fails

torchtune/tests/torchtune/models/qwen2/test_qwen2_tokenizer.py

Line 65 in fb6d4cd

expected_mask = [True] * 67 + [False] * 121

due to the fact that it searches for False.

Thanks for catching this, I wanted to ask about the test itself, but forgot.

Do you think I should change this test?

OK

ariG23498 · 2025-05-27T02:26:14Z

Hi folks!

Let me know if I need to work on something to get this to the finishing line 🤗

adding add_end_token to Qwen

ca473b3

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 18, 2025

Merge branch 'main' into aritra/qwen_eos

914567a

ariG23498 added 2 commits May 6, 2025 09:37

Merge branch 'main' into aritra/qwen_eos

d276fa4

OK

fix tests

8034467

adding docstrings

22d24cc

fix: docstrings and style

24b9531

ebsmothers reviewed May 9, 2025

View reviewed changes

ariG23498 added 2 commits May 13, 2025 14:47

Merge branch 'main' into aritra/qwen_eos

ff7fe87

OK

remove print statements

3d979b7

joecummings requested a review from ebsmothers May 13, 2025 13:55

Merge branch 'main' into aritra/qwen_eos

c82effe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding EOS Tokens to Qwen Models #2512

Adding EOS Tokens to Qwen Models #2512

Uh oh!

ariG23498 commented Mar 18, 2025

Uh oh!

pytorch-bot bot commented Mar 18, 2025 •

edited

Loading

Uh oh!

ariG23498 commented Apr 28, 2025

Uh oh!

joecummings commented Apr 28, 2025

Uh oh!

ariG23498 commented Apr 29, 2025

Uh oh!

joecummings commented Apr 30, 2025

Uh oh!

ysurs commented May 4, 2025 •

edited

Loading

Uh oh!

ariG23498 commented May 6, 2025

Uh oh!

codecov-commenter commented May 7, 2025 •

edited

Loading

Uh oh!

ariG23498 commented May 8, 2025

Uh oh!

joecummings commented May 8, 2025

Uh oh!

ebsmothers left a comment

Uh oh!

Uh oh!

ebsmothers May 9, 2025

Uh oh!

ariG23498 May 13, 2025 •

edited

Loading

Uh oh!

ariG23498 commented May 27, 2025

Uh oh!

Uh oh!

Adding EOS Tokens to Qwen Models #2512

Are you sure you want to change the base?

Adding EOS Tokens to Qwen Models #2512

Uh oh!

Conversation

ariG23498 commented Mar 18, 2025

Uh oh!

pytorch-bot bot commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2512

❌ 3 New Failures, 3 Cancelled Jobs, 1 Unrelated Failure

Uh oh!

ariG23498 commented Apr 28, 2025

Uh oh!

joecummings commented Apr 28, 2025

Uh oh!

ariG23498 commented Apr 29, 2025

Uh oh!

joecummings commented Apr 30, 2025

Uh oh!

ysurs commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ariG23498 commented May 6, 2025

Uh oh!

codecov-commenter commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ariG23498 commented May 8, 2025

Uh oh!

joecummings commented May 8, 2025

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ebsmothers May 9, 2025

Choose a reason for hiding this comment

Uh oh!

ariG23498 May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ariG23498 commented May 27, 2025

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 18, 2025 •

edited

Loading

ysurs commented May 4, 2025 •

edited

Loading

codecov-commenter commented May 7, 2025 •

edited

Loading

ariG23498 May 13, 2025 •

edited

Loading