It seems that the defaut configurations is spitting a video in to clips that contain only one frame. So I think heatmap information from previous frame (whether of ground truth or prediction) is not used. If then, how can I set up config to achieve reported result in the paper?