You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -132,3 +167,11 @@ Now the mask starts at `(batch, 1, 1, seq_len)` and broadcasting preserves the c
132
167
Rows 2 and 3 still attend to the earlier valid tokens, so the logits stay finite and the model trains normally.
133
168
134
169
**Lessons learned.** Masks are just tensors, so broadcast semantics matter. Printing the exact shapes before and after each operation (or writing a quick unit test) is a cheap way to catch mistakes that otherwise only show up hours into training.
170
+
171
+
172
+
### We don't add a special token for end of sentence
173
+
174
+
This makes the supervised fine-tuning task harder, because the model has to predict the end of sentence by itself.
175
+
176
+
For continue the sft, we choose to use `___` as the end of sentence token temporarily.
177
+
Now we have added `<eos>` token for the GPT tokenizer and model and retrained the model.
0 commit comments