[Stage 2] Difficulty Reproducing Text-to-Motion FID Scores on A6000 (12.81 vs 11.79) - Hardware/Precision Inquiry

Dear Authors,

Thank you for your impressive work on MotionStreamer. We are currently working on reproducing the results for the Stage 2 (Text-to-Motion) model using your official codebase and instructions, but we are observing a consistent gap in the FID score compared to the paper.

1. Experimental Setup We adhered strictly to the provided training scripts and configurations:

Codebase: Original train_t2m.py (Stage 2).

Stage 1 Checkpoint: We used the official pre-trained casual_TAE checkpoints provided in the repository.

Hardware: 8x NVIDIA A6000 GPUs.

Hyperparameters: Batch size 32 per GPU (Global Batch Size = 256), Learning Rate 1e-4, 100k Iterations.

2. Results Comparison While the downloaded checkpoints match the paper perfectly, our reproduction from scratch consistently lags behind by ~1.0 FID.

| Model Source | FID ↓ | R@1 ↑ | MM-D (Real) ↓ |
| :--- | :--- | :--- | :--- |
| **Paper Reported** | **11.79** | **0.631** | **15.15** |
| **Downloaded Checkpoint** | **11.80** | **0.631** | **15.15** |
| **Our Reproduction (A6000)** | **12.81** | **0.635** | **15.15** |



3. Questions Since we are using the exact same code and global batch size, we suspect this might be related to hardware precision differences (e.g., A6000 defaults to TF32 whereas older GPUs might use true FP32).

Hardware: Could you please share which specific GPUs (and how many) were used for the final experiments reported in the paper?

Precision: Did you explicitly disable TF32 (allow_tf32 = False) or use specific mixed-precision settings that might not be default in the current script?

Any insights on closing this gap would be greatly appreciated.

Thank you!

Bring Nirachornkul

Model Source	FID ↓	R@1 ↑	MM-D (Real) ↓
Paper Reported	11.79	0.631	15.15
Downloaded Checkpoint	11.80	0.631	15.15
Our Reproduction (A6000)	12.81	0.635	15.15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Stage 2] Difficulty Reproducing Text-to-Motion FID Scores on A6000 (12.81 vs 11.79) - Hardware/Precision Inquiry #41

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Stage 2] Difficulty Reproducing Text-to-Motion FID Scores on A6000 (12.81 vs 11.79) - Hardware/Precision Inquiry #41

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions