Skip to content

Ablate Differential LoRA Learning Rate Alongside Merge Ratio Sweeps #19

@engichang1467

Description

@engichang1467

Investigate the effect of differential learning rates for LoRA matrices A and B during multimodal adapter training. Maya found that using equal learning rates for LoRA A and B at 8B scale led to LoRA failure, while Aya Vision used differential rates successfully. Whether this failure mode reproduces at 3.35B is an open question.

This ablation should run early in Phase 2 alongside the merge ratio sweeps (Issue #18 ) so that findings can inform the remaining training runs. At minimum, compare equal vs. differential LoRA learning rates (e.g., lr_A != lr_B following Aya Vision's configuration) and observe the effect on training stability and downstream performance.

Context

  • Maya (arXiv:2412.07112) reported LoRA instability at 8B with equal A/B learning rates.
  • Aya Vision applied differential LoRA learning rates as part of their training recipe.
  • At 3.35B, LoRA dynamics may differ due to the smaller model having different parameter redundancy characteristics.
  • This directly feeds into the Phase 3 LoRA rank sweep (Gap 1).

Dependencies

Acceptance Criteria

  • At least two LoRA learning rate configurations compared: (1) equal A/B rates, (2) differential A/B rates (following Aya Vision's recipe or a principled variant).
  • Training stability documented: loss curves, gradient norms, and any divergence/instability observed under equal rates.
  • Both configurations evaluated on at least one benchmark to assess downstream impact.
  • A clear recommendation is recorded for which LoRA LR configuration to use in subsequent Phase 2 and Phase 3 training.
  • Results and rationale documented in a brief write-up or experiment log.

Estimated Effort

1--2 days (2 training runs + comparison)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions