Bug description
The results of the COMET metric differ significantly when running pymarian-eval on CPU versus GPU. To me this inconsistency suggests a potential bug in the implementation/calculation of the metric.
How to reproduce
es.mt.txt
en.src.txt
es.ref.txt
- GPU command used:
pymarian-eval -m wmt22-comet-da -l comet -t es.mt.txt -s en.src.txt -r es.ref.txt -o {OUTPUT}.cpl -d 1
- CPU command used:
pymarian-eval -m wmt22-comet-da -l comet -t es.mt.txt -s en.src.txt -r es.ref.txt -o {OUTPUT}.cpl -c 8
- Results (rounded) for the same sample of three sentences
- GPU results:
0.95, 0.85, 0.71
- CPU results:
0.37, 0.85, 0.31
Context
- pymarian-eval version (installed with
pip) obtained from pymarian-eval --version: 1.12.31