-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Problem Statement
We need to determine the optimal configuration for sub-4B multilingual VLMs across several untested dimensions, such as adapter depth, LoRA rank, and connector architecture
Proposed Solution
Implement an evaluation harness to systematically test and sweep:
- Merge ratios (
$\alpha=0.3$ to 0.7). - Vision encoders (CLIP vs. SigLIP2 vs. MoonViT-SO-400M).
- Connector architectures (2-layer MLP vs. 4-layer MLP vs. VICA sparse cross-attention vs. AlignVLM).
Use Case
This drives the core of Phase 3 Evaluation. It will generate the primary findings for our final paper regarding efficient model design.
Alternatives Considered
Relying exclusively on 8B and 32B scale findings from the Aya Vision paper and assuming they transfer perfectly to the 3.35B regime.
Additional Context
We will use Lightning Studio for our experiment management. For the vision encoder ablation, we must report per-benchmark breakdowns to isolate whether MoonViT's native-resolution encoding and 2D ROPE provide specific advantages on text-centric multilingual tasks like MTVQA.