CER/WER metrics alongside VLM judge

Add traditional OCR metrics (Character Error Rate, Word Error Rate) as an alternative/complement to VLM-as-judge scoring. Useful when ground truth text is available.

Would allow comparing VLM judge rankings vs metric-based rankings to validate the judge approach.