Description
Hi, thank you for your great work on FoundationStereo — the results are very impressive, especially the zero-shot generalization across challenging real-world datasets.
While reading the paper and using the released model, I had a few clarifying questions:
⸻
- Training datasets used for the zero-shot foundation model
In Section 4.1, it is mentioned that the foundation model was trained on:
“a mixed dataset consisting of our proposed FSD, together with Scene Flow, Sintel, CREStereo, FallingThings, InStereo2K, and Virtual KITTI 2.”
However, in Table 1, additional datasets like TartanAir and IRS are summarized and compared. Could you please confirm:
• Were TartanAir or IRS used in any version of the model training?
• If not, is there a reason for excluding them (e.g., limited benefit, domain mismatch, or data quality concerns)?
⸻
- Released model: is it the zero-shot foundation model in the paper?
You have kindly released a pretrained FoundationStereo model. Could you please clarify:
• Does this released checkpoint correspond to the zero-shot foundation model described in the paper?
• Was it trained only on the datasets listed in Section 4.1, without using any of the evaluation/test sets (Middlebury, ETH3D, KITTI 2012/2015)?
⸻
- Table 2 results: which training data was used?
In Table 2, zero-shot generalization results are reported across four datasets (Middlebury, ETH3D, KITTI-12, KITTI-15).
• Can you confirm that the results in the second block of Table 2 (i.e., your strongest results) are based only on training with the datasets listed in Section 4.1, excluding any test-domain-specific data?
This clarification would help ensure reproducibility and give confidence to others using or fine-tuning the released model.
Activity