theoretically possible to extend this to autoregressive video generation?

Thanks for this amazing work! 

I was wondering if it is possible to build on top of this paradigm for autoregressive video generation? Just like how we can distill bidirectional video diffusion models for autoregressive video generation (CausVid, Diffusion Forcing, Self-Forcing, etc.)