Skip comparisons where neither model produced meaningful text. Currently these waste judge calls and can distort ELO ratings. Heuristic: if both outputs are below some character/word threshold, skip the pair.