XM3600 Ablation: Train Connector Without XM3600 to Isolate Visual Diversity Contribution

Train a matched connector (same architecture, hyperparameters, and data volume) without XM3600 in the alignment mix, then compare CVQA scores against the XM3600-augmented model from Issue #17. This isolates whether culturally diverse image-caption pairs (visual domain diversity) improve performance on culturally grounded benchmarks, independent of multilingual text coverage.

### Context

- The bulk of visual training data (CC3M, COCO, LLaVA-Pretrain) is Western-centric. Translation diversifies text but not images, creating a structural mismatch with benchmarks like CVQA that evaluate on culturally diverse content from 31 language communities.
- XM3600 (Crossmodal-3600) provides image-caption pairs across 36 languages with geographically diverse imagery.
- This ablation addresses Gap 10 (visual domain diversity in alignment data). Results will be reported as a known limitation analysis rather than a claimed solution.
- The key comparison is per-language CVQA breakdown to identify which languages benefit most from XM3600 augmentation.

### Dependencies

- Completed training from Issue #17 (the XM3600-augmented model serves as the treatment).
- Same training pipeline and hyperparameters, just with XM3600 removed from the data mix.

## Acceptance Criteria

- [ ] A matched connector is trained on the alignment mix minus XM3600, with all other variables held constant (architecture, hyperparameters, total training steps, random seed if feasible).
- [ ] Both models (with and without XM3600) are evaluated on CVQA.
- [ ] Per-language CVQA scores are reported for both models, highlighting which languages show the largest delta.
- [ ] A brief analysis is written: does visual diversity help, and for which language/cultural clusters?
- [ ] Results are framed as a limitation analysis (known gap, not a solution).

## Estimated Effort

2--3 days (one additional training run + CVQA evaluation + comparison write-up)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XM3600 Ablation: Train Connector Without XM3600 to Isolate Visual Diversity Contribution #21

Context

Dependencies

Acceptance Criteria

Estimated Effort

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

XM3600 Ablation: Train Connector Without XM3600 to Isolate Visual Diversity Contribution #21

Description

Context

Dependencies

Acceptance Criteria

Estimated Effort

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions