SFT for Qwen Models

My setup is the following:

I take Qwen3-1.7B base model from [here](https://huggingface.co/Qwen/Qwen3-1.7B-Base). 
Then for the tokenizer and chat_template I use the ones that Qwen use for their instruct model [here](https://huggingface.co/Qwen/Qwen3-1.7B?chat_template=default).
I then take OLMo's [dolci-sft-instruct](https://huggingface.co/datasets/allenai/Dolci-Instruct-SFT) dataset, pretokenize it using open-instruct script [here](https://github.com/allenai/open-instruct/blob/main/scripts/data/convert_sft_data_for_olmocore.py) and use it to train the Qwen base model, hoping to achieve Qwen3-instruct performance on IFEval.

Result:
IFEval degrading throughout training, showing unrelated responses to instructions.

Debugging:
I immediately suspected chat_template and tokenization but while debugging the pretokenization on open-instruct I saw overall reasonable results and correct usage of Qwen's tokenizer and chat_template. Anyway I tried to switch to Olmo's chat_template but to no avail, same results again. Same for trying padding instead of packing.

After some debugging, it appears the issue lies in olmo-core's assumption (I debugged it to [this method](https://github.com/allenai/OLMo-core/blob/32e40e31fd0507178f10e0c096953c8d16cb06b6/src/olmo_core/data/utils.py#L170), I think) that tokenized SFT conversations end with an eos token (<|endoftext|> for olmo) such that the Dataset object (whether it's the Packed or Padded one) will dissect them accordingly. For Qwen models their default chat_template end *all* messages, including the last one in a conversation with an <|im_end|> token. But then switching the chat_template to the Olmo one should have helped, shouldn't it? It didn't, because Qwen-Instruct's eos token is <|im_end|> (look [here](https://huggingface.co/Qwen/Qwen3-1.7B/blob/main/tokenizer_config.json#L232)), and while olmo-core tries to dissect data with each model's eos token, when doing it for a Qwen model, even when using Olmo's chat_template, all messages and conversations end with <|im_end|>, losing the ability to be dissected correctly.

Solution:
Eventually I modified Qwen's chat_template to end conversations with <|endoftext|> like Olmo's chat_template AND changed its eos token from <|im_end|> to <|endoftext|> in tokenizer_config.json such that olmo-core will take it from there.
The next SFT training ran correctly, I looked at the samples right before they get into the model and they look fine, and IFEval on the resulting model looks great and compares to Qwen-Instruct.

I'm writing this because I can keep doing what I'm doing but I'm not sure if this is the intended way to solve it (overriding the tokenizer's eos and chat_template of all models intended to be trained on olmo-core) and I'd be happy to hear your take on this.

Cheers,
Gabi.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SFT for Qwen Models #594

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SFT for Qwen Models #594

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions