Skip to content

Implement Cross-Modal Merging and Ablate Merge Ratios (alpha = 0.3--0.7) #18

@engichang1467

Description

@engichang1467

Implement the cross-modal weight merging strategy from Aya Vision: interpolate between the multimodal-trained weights and the original text-only Tiny Aya Base weights at a tunable ratio alpha. This is the primary mechanism for recovering text-only capabilities that degrade during multimodal training (catastrophic forgetting).

Sweep alpha across at least five values in the range 0.3 to 0.7 and evaluate each on both visual grounding and text retention benchmarks to find the optimal balance.

Context

  • Aya Vision validated cross-modal merging at 8B and 32B scale. Whether it works at 3.35B, where models have less parameter redundancy, is an open question (Gap 3).
  • The merge formula is: W_merged = alpha * W_multimodal + (1 - alpha) * W_text_only.
  • A baseline without merging (alpha = 1.0, pure multimodal weights) should also be evaluated to isolate the merging contribution.

Dependencies

Acceptance Criteria

  • Merging script is implemented and tested: given a multimodal checkpoint and the text-only base, produces a merged checkpoint at a specified alpha.
  • At least 5 merge ratios evaluated: alpha in {0.3, 0.4, 0.5, 0.6, 0.7}, plus the no-merge baseline (alpha = 1.0).
  • Each merged checkpoint is evaluated on at least one visual grounding benchmark (CVQA or xMMMU) and one text retention benchmark (m-ArenaHard or GlobalMGSM).
  • Results are recorded in a table: alpha vs. visual grounding score vs. text retention score.
  • Optimal alpha (or alpha range) is identified and documented, with a brief note on the tradeoff curve.

Estimated Effort

2--3 days (merging implementation + 6 evaluation runs)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions