Problem Statement
Many VQA benchmarks are partially solvable without images. We need to prevent conflating language model priors with actual multimodal capability.
Proposed Solution
Evaluate Tiny Aya Base (text-only, no image input) on all primary multimodal benchmarks (CVQA, XMMMU, Kaleidoscope, MaXM, MTVQA) before any training begins to establish the vision-independent performance floor.
Use Case
This is a critical Phase 1 testing step. We will calculate and report the vision gain ($A_vision$) by subtracting this blind baseline score from the final Tiny Aya Vision score per benchmark.
Alternatives Considered
Skipping blind baselines and assuming standard benchmarks perfectly isolate visual capabilities, which recent literature on MMMU and VQAv2 has proven false.
Additional Context
This directly addresses Gap 9 from our literature review regarding the blind solvability of multilingual multimodal benchmarks