[SIGGRAPH 2024] Clarification on Encoder Input: Paper Description vs Code Implementation

## Summary

I'm studying the Codebook Matching architecture and noticed what appears to be a discrepancy between the paper description and the code implementation regarding the Encoder's input. I would greatly appreciate clarification on this design choice.

## Paper Description

In the SIGGRAPH 2024 paper "Categorical Codebook Matching for Embodied Character Controllers," Section 3 states:

> "To learn such setup in a supervised manner, we propose a technique that we call Codebook Matching which enforces similarity between both latent probability distributions Z_X and Z_Y. Since our model takes both X and Y as input during training, simply concatenating both inputs would cause X to have little or no impact during inference due to the identity mapping between Y when reconstructing the outputs. Instead, in our setup the inputs are given to a separate encoder block that only learns to sample from the motion manifold and which is formed only between the outputs."

The phrase **"formed only between the outputs"** seems to suggest that the Encoder should only take Y (outputs) as input, learning a manifold structure based solely on the output space.

## Code Implementation

However, in the code implementation, the Encoder takes the **concatenation of both Y and X** as input:

**File:** `PyTorch/Models/CodebookMatching/Network.py`

**Line 79:**
```python
#Encode Y
target_logits = self.Encoder(torch.cat((t,x), dim=1))
```

**Line 149:**
```python
encoder=modules.LinearEncoder(input_dim + output_dim, encoder_dim, encoder_dim, codebook_size, dropout),
```

This pattern is consistent across all implementations:
- `PyTorch/Models/CodebookMatching/Network.py` (line 79)
- `PyTorch/ToyExample/CodebookMatching.py` (line 76)

## Questions

I would like to understand:

1. **Is this an intentional design choice?** Does the Encoder need both X and Y to learn a conditional encoding (i.e., encoding "the pattern of Y given X") rather than just encoding Y's features?

2. **Is this an implementation detail not explicitly mentioned in the paper?** Perhaps the paper focused on the high-level architecture (dual-encoder design) while the specific input concatenation is an implementation detail?

3. **Am I misunderstanding the paper's description?** Does "formed only between the outputs" refer to the motion manifold's structure being determined by Y, while the Encoder still needs X to locate positions within that manifold?

## Appreciation

Thank you for open-sourcing this excellent work! The codebase is well-structured and the demos are impressive. I'm learning a lot from studying the implementation, and any clarification on this architectural detail would be very helpful for my understanding.

Best regards

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SIGGRAPH 2024] Clarification on Encoder Input: Paper Description vs Code Implementation #143

Summary

Paper Description

Code Implementation

Questions

Appreciation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[SIGGRAPH 2024] Clarification on Encoder Input: Paper Description vs Code Implementation #143

Description

Summary

Paper Description

Code Implementation

Questions

Appreciation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions