How can i use this code to train regions & snippets RNN model?

In this code, i only find how to use images and the images description sentences to train a multimodal RNN. But i don't see any founctions about how to use the regions & snippets to train the model.Just like the figure 5 or part 4.3 in the paper. 
How can i train my own model? How can i get the result just like the figure 5?