Skip to content

[Feature]: Evaluate Vision-Independent Performance Floor (Blind Baselines) #12

@sanggusti

Description

@sanggusti

Problem Statement

Many VQA benchmarks are partially solvable without images. We need to prevent conflating language model priors with actual multimodal capability.

Proposed Solution

Evaluate Tiny Aya Base (text-only, no image input) on all primary multimodal benchmarks (CVQA, XMMMU, Kaleidoscope, MaXM, MTVQA) before any training begins to establish the vision-independent performance floor.

Use Case

This is a critical Phase 1 testing step. We will calculate and report the vision gain ($A_vision$) by subtracting this blind baseline score from the final Tiny Aya Vision score per benchmark.

Alternatives Considered

Skipping blind baselines and assuming standard benchmarks perfectly isolate visual capabilities, which recent literature on MMMU and VQAv2 has proven false.

Additional Context

This directly addresses Gap 9 from our literature review regarding the blind solvability of multilingual multimodal benchmarks

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions