Hi, thanks for the great work! I have a question regarding the positional encoding design.
In the paper, it is mentioned that DCT-Basis coordinate encoding is used for pixel coordinates. However, in the code I noticed that:
In the T2I setting, RoPE is applied to the noisy pixel values.
class NerfEmbedder(nn.Module):
def __init__(self, in_channels, hidden_size_input, max_freqs):
super().__init__()
self.max_freqs = max_freqs
self.hidden_size_input = hidden_size_input
self.embedder = nn.Sequential(
nn.Linear(in_channels+max_freqs**2, hidden_size_input, bias=True),
)
@lru_cache
def fetch_pos(self, patch_size, device, dtype):
pos = precompute_freqs_cis_2d(self.max_freqs ** 2 * 2, patch_size, patch_size)
pos = pos[None, :, :].to(device=device, dtype=dtype)
return pos
In the C2I setting, DCT-Basis coordinate encoding is instead applied to the noisy pixel values.
class NerfEmbedder(nn.Module):
def __init__(self, in_channels, hidden_size_input, max_freqs):
super().__init__()
self.max_freqs = max_freqs
self.hidden_size_input = hidden_size_input
self.embedder = nn.Sequential(
nn.Linear(in_channels+max_freqs**2, hidden_size_input, bias=True),
)
@lru_cache
def fetch_pos(self, patch_size, device, dtype):
pos_x = torch.linspace(0, 1, patch_size, device=device, dtype=dtype)
pos_y = torch.linspace(0, 1, patch_size, device=device, dtype=dtype)
pos_y, pos_x = torch.meshgrid(pos_y, pos_x, indexing="ij")
pos_x = pos_x.reshape(-1, 1, 1)
pos_y = pos_y.reshape(-1, 1, 1)
freqs = torch.linspace(0, self.max_freqs, self.max_freqs, dtype=dtype, device=device)
freqs_x = freqs[None, :, None]
freqs_y = freqs[None, None, :]
coeffs = (1 + freqs_x * freqs_y) ** -1
dct_x = torch.cos(pos_x * freqs_x * torch.pi)
dct_y = torch.cos(pos_y * freqs_y * torch.pi)
dct = (dct_x * dct_y * coeffs).view(1, -1, self.max_freqs ** 2)
return dct
Could you please clarify why these two tasks adopt different strategies (RoPE vs. DCT-Basis) for encoding noisy pixel values? Was this difference intentional for architectural or performance reasons?
Thanks a lot for your time and for sharing this interesting work! 🙏
Hi, thanks for the great work! I have a question regarding the positional encoding design.
In the paper, it is mentioned that DCT-Basis coordinate encoding is used for pixel coordinates. However, in the code I noticed that:
In the T2I setting, RoPE is applied to the noisy pixel values.
In the C2I setting, DCT-Basis coordinate encoding is instead applied to the noisy pixel values.
Could you please clarify why these two tasks adopt different strategies (RoPE vs. DCT-Basis) for encoding noisy pixel values? Was this difference intentional for architectural or performance reasons?
Thanks a lot for your time and for sharing this interesting work! 🙏