Skip to content

[Feature]: Comprehensive Multimodal Ablation Suite #13

@sanggusti

Description

@sanggusti

Problem Statement

We need to determine the optimal configuration for sub-4B multilingual VLMs across several untested dimensions, such as adapter depth, LoRA rank, and connector architecture

Proposed Solution

Implement an evaluation harness to systematically test and sweep:

  • Merge ratios ($\alpha=0.3$ to 0.7).
  • Vision encoders (CLIP vs. SigLIP2 vs. MoonViT-SO-400M).
  • Connector architectures (2-layer MLP vs. 4-layer MLP vs. VICA sparse cross-attention vs. AlignVLM).

Use Case

This drives the core of Phase 3 Evaluation. It will generate the primary findings for our final paper regarding efficient model design.

Alternatives Considered

Relying exclusively on 8B and 32B scale findings from the Aya Vision paper and assuming they transfer perfectly to the 3.35B regime.

Additional Context

We will use Lightning Studio for our experiment management. For the vision encoder ablation, we must report per-benchmark breakdowns to isolate whether MoonViT's native-resolution encoding and 2D ROPE provide specific advantages on text-centric multilingual tasks like MTVQA.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions