-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Summary
I'm studying the Codebook Matching architecture and noticed what appears to be a discrepancy between the paper description and the code implementation regarding the Encoder's input. I would greatly appreciate clarification on this design choice.
Paper Description
In the SIGGRAPH 2024 paper "Categorical Codebook Matching for Embodied Character Controllers," Section 3 states:
"To learn such setup in a supervised manner, we propose a technique that we call Codebook Matching which enforces similarity between both latent probability distributions Z_X and Z_Y. Since our model takes both X and Y as input during training, simply concatenating both inputs would cause X to have little or no impact during inference due to the identity mapping between Y when reconstructing the outputs. Instead, in our setup the inputs are given to a separate encoder block that only learns to sample from the motion manifold and which is formed only between the outputs."
The phrase "formed only between the outputs" seems to suggest that the Encoder should only take Y (outputs) as input, learning a manifold structure based solely on the output space.
Code Implementation
However, in the code implementation, the Encoder takes the concatenation of both Y and X as input:
File: PyTorch/Models/CodebookMatching/Network.py
Line 79:
#Encode Y
target_logits = self.Encoder(torch.cat((t,x), dim=1))Line 149:
encoder=modules.LinearEncoder(input_dim + output_dim, encoder_dim, encoder_dim, codebook_size, dropout),This pattern is consistent across all implementations:
PyTorch/Models/CodebookMatching/Network.py(line 79)PyTorch/ToyExample/CodebookMatching.py(line 76)
Questions
I would like to understand:
-
Is this an intentional design choice? Does the Encoder need both X and Y to learn a conditional encoding (i.e., encoding "the pattern of Y given X") rather than just encoding Y's features?
-
Is this an implementation detail not explicitly mentioned in the paper? Perhaps the paper focused on the high-level architecture (dual-encoder design) while the specific input concatenation is an implementation detail?
-
Am I misunderstanding the paper's description? Does "formed only between the outputs" refer to the motion manifold's structure being determined by Y, while the Encoder still needs X to locate positions within that manifold?
Appreciation
Thank you for open-sourcing this excellent work! The codebase is well-structured and the demos are impressive. I'm learning a lot from studying the implementation, and any clarification on this architectural detail would be very helpful for my understanding.
Best regards