-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Dear Authors,
Thank you for your impressive work on MotionStreamer. We are currently working on reproducing the results for the Stage 2 (Text-to-Motion) model using your official codebase and instructions, but we are observing a consistent gap in the FID score compared to the paper.
- Experimental Setup We adhered strictly to the provided training scripts and configurations:
Codebase: Original train_t2m.py (Stage 2).
Stage 1 Checkpoint: We used the official pre-trained casual_TAE checkpoints provided in the repository.
Hardware: 8x NVIDIA A6000 GPUs.
Hyperparameters: Batch size 32 per GPU (Global Batch Size = 256), Learning Rate 1e-4, 100k Iterations.
- Results Comparison While the downloaded checkpoints match the paper perfectly, our reproduction from scratch consistently lags behind by ~1.0 FID.
| Model Source | FID ↓ | R@1 ↑ | MM-D (Real) ↓ |
|---|---|---|---|
| Paper Reported | 11.79 | 0.631 | 15.15 |
| Downloaded Checkpoint | 11.80 | 0.631 | 15.15 |
| Our Reproduction (A6000) | 12.81 | 0.635 | 15.15 |
- Questions Since we are using the exact same code and global batch size, we suspect this might be related to hardware precision differences (e.g., A6000 defaults to TF32 whereas older GPUs might use true FP32).
Hardware: Could you please share which specific GPUs (and how many) were used for the final experiments reported in the paper?
Precision: Did you explicitly disable TF32 (allow_tf32 = False) or use specific mixed-precision settings that might not be default in the current script?
Any insights on closing this gap would be greatly appreciated.
Thank you!
Bring Nirachornkul