-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Hi @ZouYa99 ,
Thank you for this impressive work! Turbo-VAED shows amazing results in bringing high-resolution video decoding to mobile devices. I’ve learned a lot from your analysis on decoder redundancy.
I have a question regarding the implementation of the Decoupled 3D Pixel Shuffle described in Section 3.3.
I noticed that the current strategy performs in the following order:
- Temporal Transform: Converting channels to the temporal dimension.
- 2D Pixel Shuffle: Converting the remaining channels to spatial dimensions.
The paper mentions that this method yields slightly inferior reconstruction quality compared to the standard 3D pixel shuffle.
My Question: I am curious if you have experimented with the reverse order: performing the 2D Spatial Pixel Shuffle first, and then handling the Temporal Transform?
Intuitively, it seems that performing the spatial shuffle first might better preserve the local spatial correlations inherent in the channel packing, potentially narrowing the quality gap with the standard 3D pixel shuffle.
Was the choice of the current order (Channel -> Time -> Space) driven by specific hardware constraints on mobile devices (e.g., operator compatibility/efficiency on the iPhone NPU), or was it an empirical finding that this order simply performs better?
Thank you again for your time :)