hello, could you describe how to train and inference on custom video dataset? Which part need to be modified.