Questions on Compositionality in SEEM
Hi,
Thanks for the incredible work and for releasing the code! We are deeply interested in the compositionality of the model and have a couple of questions regarding its implementation and evaluation.
1. Visual & Text Prompts in Training
In the paper, it is mentioned that visual and text prompts were not trained together:
“In particular, the visual and textual prompts can be simply concatenated and fed to SEEM-Decoder, even though it was never trained in this way.”
However, upon inspecting the forward_seg function, it seems that both prompts are actually used during training. Could you clarify this discrepancy?
2. Also in the paper
Considering that visual prompts Pv come from image features while textual prompts Pt
come from the text encoder, we select matched output indices for visual and textual prompts by
matching them with the mask embeddings Om
or class embeddings Oc
, respectively
Equation 5 explains output indices are selected based on visual prompts for mask embeddings. But looking into the code ,
text embeddings Pt from the encoder are used for the selection of the mask for example look at evaluate_grounding
3. Evaluation: grounding vs grounding_spatial
When evaluating the model, I noticed a significant performance drop when switching from grounding to grounding_spatial.
- Grounding only
- Grounding + Spatial
The only change I made was overriding the eval_type in the pipeline file:
XDecoderPipeline.py#L128
eval_type = "grounding_spatial"
From my understanding, no additional modifications should be required, since in the build method the evaluator for grounding_spatial and grounding_refcoco is the same. Could you confirm if this is indeed the case, or if extra steps are needed for proper evaluation?
✅ Any clarification on these points would be very helpful!
Questions on Compositionality in SEEM
Hi,
Thanks for the incredible work and for releasing the code! We are deeply interested in the compositionality of the model and have a couple of questions regarding its implementation and evaluation.
1. Visual & Text Prompts in Training
In the paper, it is mentioned that visual and text prompts were not trained together:
However, upon inspecting the
forward_segfunction, it seems that both prompts are actually used during training. Could you clarify this discrepancy?2. Also in the paper
Equation 5 explains output indices are selected based on visual prompts for mask embeddings. But looking into the code ,
text embeddings Pt from the encoder are used for the selection of the mask for example look at evaluate_grounding
3. Evaluation:
groundingvsgrounding_spatialWhen evaluating the model, I noticed a significant performance drop when switching from
groundingtogrounding_spatial.The only change I made was overriding the
eval_typein the pipeline file:XDecoderPipeline.py#L128From my understanding, no additional modifications should be required, since in the
buildmethod the evaluator forgrounding_spatialandgrounding_refcocois the same. Could you confirm if this is indeed the case, or if extra steps are needed for proper evaluation?✅ Any clarification on these points would be very helpful!