Skip to content

Model Modification for Autoregressive Usage #12

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

RongLirr
Copy link
Collaborator

@RongLirr RongLirr commented May 4, 2025

This pull request modifies the training pipeline to support autoregressive generation of fluent sign language poses, conditioned on a whole disfluent sequence and the previously generated fluent history.

1. data/load_data.py (SignLanguagePoseDataset):

  • Modified _getitem_ to enable autoregressive training. Instead of returning a fixed initial segment of fluent sequence: It now randomly samples a target chunk (data, length=chunk_len) from the ground truth fluent sequence. And it extracts the corresponding ground truth fluent pose history preceding this chunk and returns it as conditions['previous_output'].

  • The full disfluent sequence remains as conditions['input_sequence'].

  • Replaced the custom global mean/std calculation with pose_anonymization.data.normalization.normalize_mean_std. Data is now normalized by calling this function on the Pose objects after loading.

  • Ensured the sampled target_chunk is always padded (with zeros) to the fixed chunk_len within _getitem_ if the sampled segment is shorter (e.g., at the end of a sequence or for short sequences). The corresponding target_mask is padded with True (masked).

  • Parameter Renaming: The fluent_frames parameter in init is now internally referred to as chunk_len to better reflect its role in the autoregressive setup.

@RongLirr
Copy link
Collaborator Author

RongLirr commented May 4, 2025

2. core/models.py (SignLanguagePoseDiffusion):

  • Autoregressive Input: Modified the forward method signature to accept previous_output: Optional[torch.Tensor].

  • Updated the forward method's implementation: Encodes the previous_output tensor (using self.fluent_encoder). And concatenates the previous_output embedding with embeddings from the timestep (t), the disfluent sequence (disfluent_seq), and the noisy target chunk (fluent_clip/x) along the sequence dimension (dim=1) before feeding into the sequence_encoder.

3. core/training.py (PoseTrainingPortal):

  • Mask Handling for Loss: masked_l2_per_sample function

  • Unnormalization:
    evaluate_sampling now passes the normalized numpy arrays to export_samples.
    export_samples now uses the imported unnormalize_mean_std function on temporary Pose objects to unnormalize data before saving .pose files.

@RongLirr
Copy link
Collaborator Author

RongLirr commented May 7, 2025

  1. /infer_autoregressive.py

    • It iteratively generates chunks of the fluent pose sequence, feeding the previously generated chunk as a condition for the next.
    • Generation stops when the model predicts a "stop" signal (near-zero pose values) or a maximum length is reached.
    • Saves both the generated fluent poses and the original disfluent input poses in .pose and .npy formats.
  2. /data/load_data.py)

    • In SignLanguagePoseDataset.__getitem__:
      • Modified the handling of previous_output (history_chunk). When the actual history length is zero (e.g., for the first chunk of a sequence), previous_output is now initialized as a single frame of zeros (np.zeros((1,) + ...)).
      • Reason: This change ensures that the pose_format.torch.masked.collator.zero_pad_collator works correctly. Previously, a mix of zero-length tensors (shape [0, K, D]) and very short tensors (e.g., shape [1, K, D]) for previous_output within the same batch could cause a RuntimeError during torch.stack inside the collator.
  3. /core/training.py)

    • In PoseTrainingPortal.export_samples:
      • Added a call to normalize_pose_size(unnorm_pose) immediately after unnorm_pose = unnormalize_mean_std(pose_obj).
      • Reason: The unnormalize_mean_std function reverses the Z-score normalization (mean/std). However, for valid visualization normalize_pose_size (from pose_format.utils.generic) is needed.

@RongLirr
Copy link
Collaborator Author

RongLirr commented May 8, 2025

1. config/option.py

add parser.add_argument('--lambda_vel' and '--load_num'

2. training.py

Introduced a weight (lambda_vel) for the velocity loss term

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant