Blank page filtering

Skip comparisons where neither model produced meaningful text. Currently these waste judge calls and can distort ELO ratings.

Heuristic: if both outputs are below some character/word threshold, skip the pair.