[Feature]: Comprehensive Multimodal Ablation Suite

### Problem Statement

We need to determine the optimal configuration for sub-4B multilingual VLMs across several untested dimensions, such as adapter depth, LoRA rank, and connector architecture

### Proposed Solution

Implement an evaluation harness to systematically test and sweep:
- Merge ratios ($\alpha=0.3$ to 0.7).
- Vision encoders (CLIP vs. SigLIP2 vs. MoonViT-SO-400M).
- Connector architectures (2-layer MLP vs. 4-layer MLP vs. VICA sparse cross-attention vs. AlignVLM).

### Use Case

This drives the core of Phase 3 Evaluation. It will generate the primary findings for our final paper regarding efficient model design.

### Alternatives Considered

Relying exclusively on 8B and 32B scale findings from the Aya Vision paper and assuming they transfer perfectly to the 3.35B regime.

### Additional Context

We will use Lightning Studio for our experiment management. For the vision encoder ablation, we must report per-benchmark breakdowns to isolate whether MoonViT's native-resolution encoding and 2D ROPE provide specific advantages on text-centric multilingual tasks like MTVQA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Comprehensive Multimodal Ablation Suite #13

Problem Statement

Proposed Solution

Use Case

Alternatives Considered

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Comprehensive Multimodal Ablation Suite #13

Description

Problem Statement

Proposed Solution

Use Case

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions