-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
Description
Document the exact set of languages present in the multimodal training data (alignment + any instruction-tuning), then partition the languages in CVQA (31 languages) and Kaleidoscope (18 languages) into two groups: adapter-seen (languages present in the multimodal training mix) and adapter-unseen (languages supported by Tiny Aya Base but absent from multimodal training). This partition enables Phase 3 reporting of delta_transfer, the accuracy gap between seen and unseen languages, which tests whether the 3.35B multilingual backbone enables zero-shot visual transfer to the remaining 45+ unseen languages via English as a bridge.
Context
- Tiny Aya Base supports 70+ languages, but the multimodal adapter trains on only 20--25.
- The adapter-seen/unseen partition is a genuine research contribution (Gap 11), not just a reporting convenience. The result is informative regardless of direction: small delta_transfer suggests strong zero-shot transfer; large delta_transfer quantifies the cost of language absence in multimodal training.
- This is a setup task for Phase 3 analysis. The actual delta_transfer evaluation happens in Phase 3, but the partition must be defined now while the training data composition is fresh.
Dependencies
- Finalized training data mix from Issue Train Adapter and Projection Layers on XM3600-Augmented Alignment Mix #17 (need the actual language list, not just the planned list).
- CVQA and Kaleidoscope benchmark language lists.
Acceptance Criteria
- A document or table listing every language in the multimodal training data, with source dataset attribution (e.g., "Hindi: LLaVA-Pretrain translated, XM3600, M3IT").
- CVQA languages (31) partitioned into adapter-seen and adapter-unseen groups, with the partition table saved in a shareable format (markdown, CSV, or similar).
- Kaleidoscope languages (18) partitioned similarly.
- Any ambiguous cases are flagged (e.g., languages with minimal representation in training data that are technically "seen" but underrepresented).
- The partition is reviewed by at least one collaborator for accuracy before Phase 3 evaluation begins.
Estimated Effort
0.5--1 day (primarily data auditing and documentation)
Reactions are currently unavailable