questions about MATE-KD

hi, the mate-kd is an excellent work on NLP KD. Here I have a question about the codes of this paper. 

In the section 4.1 of the paper, the authors said that two different teacher models (Roberta large and BERT base) were used in the two steps, but the codes showed that only one teacher model is used. Is it right?

on the other hand, the two steps should be trained separately? But the codes showed that in the training procedure, 10 steps for updating the params of generator, then 100 steps for updating the student model. That makes me feel wired. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

questions about MATE-KD #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

questions about MATE-KD #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions