Skip to content

SFT for Qwen Models #594

@gabiguetta

Description

@gabiguetta

My setup is the following:

I take Qwen3-1.7B base model from here.
Then for the tokenizer and chat_template I use the ones that Qwen use for their instruct model here.
I then take OLMo's dolci-sft-instruct dataset, pretokenize it using open-instruct script here and use it to train the Qwen base model, hoping to achieve Qwen3-instruct performance on IFEval.

Result:
IFEval degrading throughout training, showing unrelated responses to instructions.

Debugging:
I immediately suspected chat_template and tokenization but while debugging the pretokenization on open-instruct I saw overall reasonable results and correct usage of Qwen's tokenizer and chat_template. Anyway I tried to switch to Olmo's chat_template but to no avail, same results again. Same for trying padding instead of packing.

After some debugging, it appears the issue lies in olmo-core's assumption (I debugged it to this method, I think) that tokenized SFT conversations end with an eos token (<|endoftext|> for olmo) such that the Dataset object (whether it's the Packed or Padded one) will dissect them accordingly. For Qwen models their default chat_template end all messages, including the last one in a conversation with an <|im_end|> token. But then switching the chat_template to the Olmo one should have helped, shouldn't it? It didn't, because Qwen-Instruct's eos token is <|im_end|> (look here), and while olmo-core tries to dissect data with each model's eos token, when doing it for a Qwen model, even when using Olmo's chat_template, all messages and conversations end with <|im_end|>, losing the ability to be dissected correctly.

Solution:
Eventually I modified Qwen's chat_template to end conversations with <|endoftext|> like Olmo's chat_template AND changed its eos token from <|im_end|> to <|endoftext|> in tokenizer_config.json such that olmo-core will take it from there.
The next SFT training ran correctly, I looked at the samples right before they get into the model and they look fine, and IFEval on the resulting model looks great and compares to Qwen-Instruct.

I'm writing this because I can keep doing what I'm doing but I'm not sure if this is the intended way to solve it (overriding the tokenizer's eos and chat_template of all models intended to be trained on olmo-core) and I'd be happy to hear your take on this.

Cheers,
Gabi.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions