This is a great piece of work. Could I ask about some specific details regarding the training?
Based on the parameter configurations for each stage in the scripts, it seems that the stages described in the README, the scripts themselves, and the stages mentioned in the paper do not fully align. Which one should be followed in practice?
Also, is there any dependency or progressive relationship between the stages? From the scripts, it appears that each stage might be trained independently—could you clarify this?
Finally, regarding training time: I am currently using 32 A800 GPUs (80GB each) to train Stage 3 (Image + Video + Ref-Video), but the training time is far longer than the expected 3–5 days. Could this be due to a mismatch in configuration, such as steps versus epochs, or some other setting?
This is a great piece of work. Could I ask about some specific details regarding the training?
Based on the parameter configurations for each stage in the scripts, it seems that the stages described in the README, the scripts themselves, and the stages mentioned in the paper do not fully align. Which one should be followed in practice?
Also, is there any dependency or progressive relationship between the stages? From the scripts, it appears that each stage might be trained independently—could you clarify this?
Finally, regarding training time: I am currently using 32 A800 GPUs (80GB each) to train Stage 3 (Image + Video + Ref-Video), but the training time is far longer than the expected 3–5 days. Could this be due to a mismatch in configuration, such as steps versus epochs, or some other setting?