Skip to content

[Stage 2] Difficulty Reproducing Text-to-Motion FID Scores on A6000 (12.81 vs 11.79) - Hardware/Precision Inquiry #41

@bring-nirachornkul

Description

@bring-nirachornkul

Dear Authors,

Thank you for your impressive work on MotionStreamer. We are currently working on reproducing the results for the Stage 2 (Text-to-Motion) model using your official codebase and instructions, but we are observing a consistent gap in the FID score compared to the paper.

  1. Experimental Setup We adhered strictly to the provided training scripts and configurations:

Codebase: Original train_t2m.py (Stage 2).

Stage 1 Checkpoint: We used the official pre-trained casual_TAE checkpoints provided in the repository.

Hardware: 8x NVIDIA A6000 GPUs.

Hyperparameters: Batch size 32 per GPU (Global Batch Size = 256), Learning Rate 1e-4, 100k Iterations.

  1. Results Comparison While the downloaded checkpoints match the paper perfectly, our reproduction from scratch consistently lags behind by ~1.0 FID.
Model Source FID ↓ R@1 ↑ MM-D (Real) ↓
Paper Reported 11.79 0.631 15.15
Downloaded Checkpoint 11.80 0.631 15.15
Our Reproduction (A6000) 12.81 0.635 15.15
  1. Questions Since we are using the exact same code and global batch size, we suspect this might be related to hardware precision differences (e.g., A6000 defaults to TF32 whereas older GPUs might use true FP32).

Hardware: Could you please share which specific GPUs (and how many) were used for the final experiments reported in the paper?

Precision: Did you explicitly disable TF32 (allow_tf32 = False) or use specific mixed-precision settings that might not be default in the current script?

Any insights on closing this gap would be greatly appreciated.

Thank you!

Bring Nirachornkul

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions