My setup is the following:
I take Qwen3-1.7B base model from here.
Then for the tokenizer and chat_template I use the ones that Qwen use for their instruct model here.
I then take OLMo's dolci-sft-instruct dataset, pretokenize it using open-instruct script here and use it to train the Qwen base model, hoping to achieve Qwen3-instruct performance on IFEval.
Result:
IFEval degrading throughout training, showing unrelated responses to instructions.
Debugging:
I immediately suspected chat_template and tokenization but while debugging the pretokenization on open-instruct I saw overall reasonable results and correct usage of Qwen's tokenizer and chat_template. Anyway I tried to switch to Olmo's chat_template but to no avail, same results again. Same for trying padding instead of packing.
After some debugging, it appears the issue lies in olmo-core's assumption (I debugged it to this method, I think) that tokenized SFT conversations end with an eos token (<|endoftext|> for olmo) such that the Dataset object (whether it's the Packed or Padded one) will dissect them accordingly. For Qwen models their default chat_template end all messages, including the last one in a conversation with an <|im_end|> token. But then switching the chat_template to the Olmo one should have helped, shouldn't it? It didn't, because Qwen-Instruct's eos token is <|im_end|> (look here), and while olmo-core tries to dissect data with each model's eos token, when doing it for a Qwen model, even when using Olmo's chat_template, all messages and conversations end with <|im_end|>, losing the ability to be dissected correctly.
Solution:
Eventually I modified Qwen's chat_template to end conversations with <|endoftext|> like Olmo's chat_template AND changed its eos token from <|im_end|> to <|endoftext|> in tokenizer_config.json such that olmo-core will take it from there.
The next SFT training ran correctly, I looked at the samples right before they get into the model and they look fine, and IFEval on the resulting model looks great and compares to Qwen-Instruct.
I'm writing this because I can keep doing what I'm doing but I'm not sure if this is the intended way to solve it (overriding the tokenizer's eos and chat_template of all models intended to be trained on olmo-core) and I'd be happy to hear your take on this.
Cheers,
Gabi.
My setup is the following:
I take Qwen3-1.7B base model from here.
Then for the tokenizer and chat_template I use the ones that Qwen use for their instruct model here.
I then take OLMo's dolci-sft-instruct dataset, pretokenize it using open-instruct script here and use it to train the Qwen base model, hoping to achieve Qwen3-instruct performance on IFEval.
Result:
IFEval degrading throughout training, showing unrelated responses to instructions.
Debugging:
I immediately suspected chat_template and tokenization but while debugging the pretokenization on open-instruct I saw overall reasonable results and correct usage of Qwen's tokenizer and chat_template. Anyway I tried to switch to Olmo's chat_template but to no avail, same results again. Same for trying padding instead of packing.
After some debugging, it appears the issue lies in olmo-core's assumption (I debugged it to this method, I think) that tokenized SFT conversations end with an eos token (<|endoftext|> for olmo) such that the Dataset object (whether it's the Packed or Padded one) will dissect them accordingly. For Qwen models their default chat_template end all messages, including the last one in a conversation with an <|im_end|> token. But then switching the chat_template to the Olmo one should have helped, shouldn't it? It didn't, because Qwen-Instruct's eos token is <|im_end|> (look here), and while olmo-core tries to dissect data with each model's eos token, when doing it for a Qwen model, even when using Olmo's chat_template, all messages and conversations end with <|im_end|>, losing the ability to be dissected correctly.
Solution:
Eventually I modified Qwen's chat_template to end conversations with <|endoftext|> like Olmo's chat_template AND changed its eos token from <|im_end|> to <|endoftext|> in tokenizer_config.json such that olmo-core will take it from there.
The next SFT training ran correctly, I looked at the samples right before they get into the model and they look fine, and IFEval on the resulting model looks great and compares to Qwen-Instruct.
I'm writing this because I can keep doing what I'm doing but I'm not sure if this is the intended way to solve it (overriding the tokenizer's eos and chat_template of all models intended to be trained on olmo-core) and I'd be happy to hear your take on this.
Cheers,
Gabi.