In the context of multi-objective multi-class action detection, does the algorithm employ a multi-object tracking (MOT) method to associate the same target across video frames, forming tracklets? This would ensure consistent spatiotemporal feature extraction for each individual target throughout the sequence.