Hi,
Thank you for the great work!
I am training on my own AV-joint model based on Ovi. After the S1 AR training stage, the AR generation quality is slightly worse than the base model, but still acceptable.
However, after the S2 Causal CD training stage, I found that the inference results are problematic: the generated video is almost static, and sometimes becomes very blurry. When the person is talking, there is corresponding audio, but the volume is very low.
So far, I have trained for roughly 2700 steps × 14 samples. Do you think this is simply because the training is still insufficient, or could there be something wrong with my training setup?
Hi,
Thank you for the great work!
I am training on my own AV-joint model based on Ovi. After the S1 AR training stage, the AR generation quality is slightly worse than the base model, but still acceptable.
However, after the S2 Causal CD training stage, I found that the inference results are problematic: the generated video is almost static, and sometimes becomes very blurry. When the person is talking, there is corresponding audio, but the volume is very low.
So far, I have trained for roughly 2700 steps × 14 samples. Do you think this is simply because the training is still insufficient, or could there be something wrong with my training setup?