Document Multimodal Training Language Set and Partition CVQA/Kaleidoscope for Transfer Analysis

Document the exact set of languages present in the multimodal training data (alignment + any instruction-tuning), then partition the languages in CVQA (31 languages) and Kaleidoscope (18 languages) into two groups: adapter-seen (languages present in the multimodal training mix) and adapter-unseen (languages supported by Tiny Aya Base but absent from multimodal training). This partition enables Phase 3 reporting of delta_transfer, the accuracy gap between seen and unseen languages, which tests whether the 3.35B multilingual backbone enables zero-shot visual transfer to the remaining 45+ unseen languages via English as a bridge.

### Context

- Tiny Aya Base supports 70+ languages, but the multimodal adapter trains on only 20--25.
- The adapter-seen/unseen partition is a genuine research contribution (Gap 11), not just a reporting convenience. The result is informative regardless of direction: small delta_transfer suggests strong zero-shot transfer; large delta_transfer quantifies the cost of language absence in multimodal training.
- This is a setup task for Phase 3 analysis. The actual delta_transfer evaluation happens in Phase 3, but the partition must be defined now while the training data composition is fresh.

### Dependencies

- Finalized training data mix from Issue #17 (need the actual language list, not just the planned list).
- CVQA and Kaleidoscope benchmark language lists.

## Acceptance Criteria

- [ ] A document or table listing every language in the multimodal training data, with source dataset attribution (e.g., "Hindi: LLaVA-Pretrain translated, XM3600, M3IT").
- [ ] CVQA languages (31) partitioned into adapter-seen and adapter-unseen groups, with the partition table saved in a shareable format (markdown, CSV, or similar).
- [ ] Kaleidoscope languages (18) partitioned similarly.
- [ ] Any ambiguous cases are flagged (e.g., languages with minimal representation in training data that are technically "seen" but underrepresented).
- [ ] The partition is reviewed by at least one collaborator for accuracy before Phase 3 evaluation begins.

## Estimated Effort

0.5--1 day (primarily data auditing and documentation)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document Multimodal Training Language Set and Partition CVQA/Kaleidoscope for Transfer Analysis #23

Context

Dependencies

Acceptance Criteria

Estimated Effort

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Document Multimodal Training Language Set and Partition CVQA/Kaleidoscope for Transfer Analysis #23

Description

Context

Dependencies

Acceptance Criteria

Estimated Effort

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions