-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
Implement the cross-modal weight merging strategy from Aya Vision: interpolate between the multimodal-trained weights and the original text-only Tiny Aya Base weights at a tunable ratio alpha. This is the primary mechanism for recovering text-only capabilities that degrade during multimodal training (catastrophic forgetting).
Sweep alpha across at least five values in the range 0.3 to 0.7 and evaluate each on both visual grounding and text retention benchmarks to find the optimal balance.
Context
- Aya Vision validated cross-modal merging at 8B and 32B scale. Whether it works at 3.35B, where models have less parameter redundancy, is an open question (Gap 3).
- The merge formula is:
W_merged = alpha * W_multimodal + (1 - alpha) * W_text_only. - A baseline without merging (alpha = 1.0, pure multimodal weights) should also be evaluated to isolate the merging contribution.
Dependencies
- Trained checkpoint from Issue Train Adapter and Projection Layers on XM3600-Augmented Alignment Mix #17 (adapter + projection alignment)
- Original Tiny Aya Base weights (the text-only reference for merging).
Acceptance Criteria
- Merging script is implemented and tested: given a multimodal checkpoint and the text-only base, produces a merged checkpoint at a specified alpha.
- At least 5 merge ratios evaluated: alpha in {0.3, 0.4, 0.5, 0.6, 0.7}, plus the no-merge baseline (alpha = 1.0).
- Each merged checkpoint is evaluated on at least one visual grounding benchmark (CVQA or xMMMU) and one text retention benchmark (m-ArenaHard or GlobalMGSM).
- Results are recorded in a table: alpha vs. visual grounding score vs. text retention score.
- Optimal alpha (or alpha range) is identified and documented, with a brief note on the tradeoff curve.
Estimated Effort
2--3 days (merging implementation + 6 evaluation runs)
Reactions are currently unavailable