Skip to content

[Question] Order of operations in Decoupled 3D Pixel Shuffle #13

@piddnad

Description

@piddnad

Hi @ZouYa99 ,

Thank you for this impressive work! Turbo-VAED shows amazing results in bringing high-resolution video decoding to mobile devices. I’ve learned a lot from your analysis on decoder redundancy.

I have a question regarding the implementation of the Decoupled 3D Pixel Shuffle described in Section 3.3.

I noticed that the current strategy performs in the following order:

  1. Temporal Transform: Converting channels to the temporal dimension.
  2. 2D Pixel Shuffle: Converting the remaining channels to spatial dimensions.

The paper mentions that this method yields slightly inferior reconstruction quality compared to the standard 3D pixel shuffle.

My Question: I am curious if you have experimented with the reverse order: performing the 2D Spatial Pixel Shuffle first, and then handling the Temporal Transform?

Intuitively, it seems that performing the spatial shuffle first might better preserve the local spatial correlations inherent in the channel packing, potentially narrowing the quality gap with the standard 3D pixel shuffle.

Was the choice of the current order (Channel -> Time -> Space) driven by specific hardware constraints on mobile devices (e.g., operator compatibility/efficiency on the iPhone NPU), or was it an empirical finding that this order simply performs better?

Thank you again for your time :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions