Skip to content

Document Multimodal Training Language Set and Partition CVQA/Kaleidoscope for Transfer Analysis #23

@engichang1467

Description

@engichang1467

Document the exact set of languages present in the multimodal training data (alignment + any instruction-tuning), then partition the languages in CVQA (31 languages) and Kaleidoscope (18 languages) into two groups: adapter-seen (languages present in the multimodal training mix) and adapter-unseen (languages supported by Tiny Aya Base but absent from multimodal training). This partition enables Phase 3 reporting of delta_transfer, the accuracy gap between seen and unseen languages, which tests whether the 3.35B multilingual backbone enables zero-shot visual transfer to the remaining 45+ unseen languages via English as a bridge.

Context

  • Tiny Aya Base supports 70+ languages, but the multimodal adapter trains on only 20--25.
  • The adapter-seen/unseen partition is a genuine research contribution (Gap 11), not just a reporting convenience. The result is informative regardless of direction: small delta_transfer suggests strong zero-shot transfer; large delta_transfer quantifies the cost of language absence in multimodal training.
  • This is a setup task for Phase 3 analysis. The actual delta_transfer evaluation happens in Phase 3, but the partition must be defined now while the training data composition is fresh.

Dependencies

Acceptance Criteria

  • A document or table listing every language in the multimodal training data, with source dataset attribution (e.g., "Hindi: LLaVA-Pretrain translated, XM3600, M3IT").
  • CVQA languages (31) partitioned into adapter-seen and adapter-unseen groups, with the partition table saved in a shareable format (markdown, CSV, or similar).
  • Kaleidoscope languages (18) partitioned similarly.
  • Any ambiguous cases are flagged (e.g., languages with minimal representation in training data that are technically "seen" but underrepresented).
  • The partition is reviewed by at least one collaborator for accuracy before Phase 3 evaluation begins.

Estimated Effort

0.5--1 day (primarily data auditing and documentation)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions