How to train feature extracting?

Hi, I am interested in how to extract multimodal features for emotional detection.
In your code, you use an existing model to extract video features without re-training. So how to get the original model? How to get the CNN weight? 
For text, do you also use an existing model to extract features?