Worse performance with Evaluator type "grounding_spatial"

# Questions on Compositionality in SEEM  

Hi,  

Thanks for the incredible work and for releasing the code! We are deeply interested in the **compositionality** of the model and have a couple of questions regarding its implementation and evaluation.  

---

### 1. Visual & Text Prompts in Training  

In the paper, it is mentioned that **visual and text prompts were not trained together**:  

> *“In particular, the visual and textual prompts can be simply concatenated and fed to SEEM-Decoder, even though it was never trained in this way.”*  

However, upon inspecting the [`forward_seg`](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once/blob/7b2e76dbb17d0b7831c6813a921fe2bc8de22926/modeling/architectures/seem_model_v1.py#L353C1-L361C1) function, it seems that **both prompts are actually used during training**. Could you clarify this discrepancy?  

---
### 2. Also in the paper 

>  Considering that visual prompts Pv come from image features while textual prompts Pt
> come from the text encoder, we select matched output indices for visual and textual prompts by
> matching them with the mask embeddings Om
> or class embeddings Oc
> , respectively

Equation 5 explains output indices are selected based on visual prompts for mask embeddings. But looking into the code , 
text embeddings Pt from the encoder are used for the selection of the mask for example look at [evaluate_grounding](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once/blob/7b2e76dbb17d0b7831c6813a921fe2bc8de22926/modeling/architectures/seem_model_v1.py#L750C1-L754C72)



  
---
### 3. Evaluation: `grounding` vs `grounding_spatial`  

When evaluating the model, I noticed a **significant performance drop** when switching from `grounding` to `grounding_spatial`.  

- **Grounding only**  
  ```
  mIoU: 90.9
  mDice: 94.9
  ```
- **Grounding + Spatial**  
  ```
  mIoU: 40.1
  mDice: 43.8
  ```

The **only change** I made was overriding the `eval_type` in the pipeline file:  

[`XDecoderPipeline.py#L128`](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once/blob/7b2e76dbb17d0b7831c6813a921fe2bc8de22926/pipeline/XDecoderPipeline.py#L128)  

```python
eval_type = "grounding_spatial"
```  

From my understanding, no additional modifications should be required, since in the `build` method the evaluator for **`grounding_spatial`** and **`grounding_refcoco`** is the same. Could you confirm if this is indeed the case, or if extra steps are needed for proper evaluation?  

---

✅ Any clarification on these points would be very helpful!  


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Worse performance with Evaluator type "grounding_spatial" #166

Questions on Compositionality in SEEM

1. Visual & Text Prompts in Training

2. Also in the paper

3. Evaluation: `grounding` vs `grounding_spatial`

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Worse performance with Evaluator type "grounding_spatial" #166

Description

Questions on Compositionality in SEEM

1. Visual & Text Prompts in Training

2. Also in the paper

3. Evaluation: grounding vs grounding_spatial

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

3. Evaluation: `grounding` vs `grounding_spatial`