Why do we use MaskedLinear for the condition? #70

yangysc · 2025-07-26T02:32:52Z

yangysc
Jul 26, 2025

Description

It seems that in zuko we just cat features and condition and then use a MaskedLinear to handle them together (please correct me if I missed something). What if we use MaskedLinear only for features and a plain Linear layer for handling the condition?

zuko/zuko/flows/autoregressive.py

Lines 207 to 218 in 25fefe2

    
           def meta(self, c: Tensor, x: Tensor) -> Transform: 
        
               if c is not None: 
        
                   x = torch.cat(broadcast(x, c, ignore=1), dim=-1) 
        
               phi = self.hyper(x) 
        
               phi = phi.unflatten(-1, (-1, self.total)) 
        
               phi = unpack(phi, self.shapes) 
        
               return DependentTransform(self.univariate(*phi), 1) 
        
           def forward(self, c: Tensor = None) -> Transform: 
        
               return AutoregressiveTransform(partial(self.meta, c), self.passes)

where the hyper net is

zuko/zuko/flows/autoregressive.py

Line 152 in 25fefe2

self.hyper = MaskedMLP(adjacency, **kwargs)

Implementation

The implementation would be like MaskedLinear(cat(features, condition)) -> cat(MaskedLinear(features), Linear(condition))

Thanks in advance

Answered by francois-rozet

Sep 9, 2025

Hi @yangysc, sorry for the delay, I was very busy with deadlines and my thesis.

The hyper-network self.hyper is a MaskedMLP. The goal of this network is to make the parameters $\phi_i$ of the transformation $y_i = f(x_i; \phi_i)$ only dependent on preceding features $x_{<i}$ and the context $c$. This is done with a series of masks that depend on the ordering of the variables.

If the hyper-network was a single MaskedLinear layer, then what you propose would have (almost) worked (it would be MaskedLinear(features) + Linear(context)). However, we want $\phi_i$ to be a non-linear combination of $x_{<i}$ and $c$. Therefore, after the first layer we have to use MaskedLinear layers only.

P…

View full answer

francois-rozet · 2025-09-09T13:04:41Z

francois-rozet
Sep 9, 2025
Maintainer

Hi @yangysc, sorry for the delay, I was very busy with deadlines and my thesis.

The hyper-network self.hyper is a MaskedMLP. The goal of this network is to make the parameters $\phi_i$ of the transformation $y_i = f(x_i; \phi_i)$ only dependent on preceding features $x_{<i}$ and the context $c$. This is done with a series of masks that depend on the ordering of the variables.

If the hyper-network was a single MaskedLinear layer, then what you propose would have (almost) worked (it would be MaskedLinear(features) + Linear(context)). However, we want $\phi_i$ to be a non-linear combination of $x_{<i}$ and $c$. Therefore, after the first layer we have to use MaskedLinear layers only.

Please tell me if you have any other questions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do we use MaskedLinear for the condition? #70

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Why do we use MaskedLinear for the condition? #70

Uh oh!

Uh oh!

yangysc Jul 26, 2025

Description

Implementation

Replies: 1 comment

Uh oh!

Uh oh!

francois-rozet Sep 9, 2025 Maintainer

yangysc
Jul 26, 2025

francois-rozet
Sep 9, 2025
Maintainer