-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Problem Statement
Many VQA benchmarks are partially solvable without images. We need to prevent conflating language model priors with actual multimodal capability.
Proposed Solution
Evaluate Tiny Aya Base (text-only, no image input) on all primary multimodal benchmarks (CVQA, XMMMU, Kaleidoscope, MaXM, MTVQA) before any training begins to establish the vision-independent performance floor.
Use Case
This is a critical Phase 1 testing step. We will calculate and report the vision gain (
Alternatives Considered
Skipping blind baselines and assuming standard benchmarks perfectly isolate visual capabilities, which recent literature on MMMU and VQAv2 has proven false.
Additional Context
This directly addresses Gap 9 from our literature review regarding the blind solvability of multilingual multimodal benchmarks