Ablate Differential LoRA Learning Rate Alongside Merge Ratio Sweeps

Investigate the effect of differential learning rates for LoRA matrices A and B during multimodal adapter training. Maya found that using equal learning rates for LoRA A and B at 8B scale led to LoRA failure, while Aya Vision used differential rates successfully. Whether this failure mode reproduces at 3.35B is an open question.

This ablation should run early in Phase 2 alongside the merge ratio sweeps (Issue #18 ) so that findings can inform the remaining training runs. At minimum, compare equal vs. differential LoRA learning rates (e.g., lr_A != lr_B following Aya Vision's configuration) and observe the effect on training stability and downstream performance.

### Context

- Maya (arXiv:2412.07112) reported LoRA instability at 8B with equal A/B learning rates.
- Aya Vision applied differential LoRA learning rates as part of their training recipe.
- At 3.35B, LoRA dynamics may differ due to the smaller model having different parameter redundancy characteristics.
- This directly feeds into the Phase 3 LoRA rank sweep (Gap 1).

### Dependencies

- Phase 1 infrastructure and training pipeline.
- Should run concurrently with or immediately before Issue #18 (merge ratio ablation), since the best LoRA configuration should be used for the merge sweeps.

## Acceptance Criteria

- [ ] At least two LoRA learning rate configurations compared: (1) equal A/B rates, (2) differential A/B rates (following Aya Vision's recipe or a principled variant).
- [ ] Training stability documented: loss curves, gradient norms, and any divergence/instability observed under equal rates.
- [ ] Both configurations evaluated on at least one benchmark to assess downstream impact.
- [ ] A clear recommendation is recorded for which LoRA LR configuration to use in subsequent Phase 2 and Phase 3 training.
- [ ] Results and rationale documented in a brief write-up or experiment log.

## Estimated Effort

1--2 days (2 training runs + comparison)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ablate Differential LoRA Learning Rate Alongside Merge Ratio Sweeps #19

Context

Dependencies

Acceptance Criteria

Estimated Effort

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ablate Differential LoRA Learning Rate Alongside Merge Ratio Sweeps #19

Description

Context

Dependencies

Acceptance Criteria

Estimated Effort

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions