-
Notifications
You must be signed in to change notification settings - Fork 16
Open
Description
hi, the mate-kd is an excellent work on NLP KD. Here I have a question about the codes of this paper.
In the section 4.1 of the paper, the authors said that two different teacher models (Roberta large and BERT base) were used in the two steps, but the codes showed that only one teacher model is used. Is it right?
on the other hand, the two steps should be trained separately? But the codes showed that in the training procedure, 10 steps for updating the params of generator, then 100 steps for updating the student model. That makes me feel wired.
Metadata
Metadata
Assignees
Labels
No labels