pymarian-eval huge inconsistency of COMET scores across CPU and GPU

### Bug description
The results of the COMET metric differ significantly when running `pymarian-eval` on CPU versus GPU. To me this inconsistency suggests a potential bug in the implementation/calculation of the metric.

### How to reproduce
- Files used

[es.mt.txt](https://github.com/user-attachments/files/19645953/es.mt.txt)
[en.src.txt](https://github.com/user-attachments/files/19645952/en.src.txt)
[es.ref.txt](https://github.com/user-attachments/files/19645951/es.ref.txt)

- Commands 
1. GPU command used:

`pymarian-eval -m wmt22-comet-da -l comet -t es.mt.txt -s en.src.txt -r es.ref.txt -o {OUTPUT}.cpl -d 1`
 
2. CPU command used:

`pymarian-eval -m wmt22-comet-da -l comet -t es.mt.txt -s en.src.txt -r es.ref.txt -o {OUTPUT}.cpl -c 8`

- Results (rounded) for the same sample of three sentences
1. GPU results: `0.95, 0.85, 0.71`
2. CPU results: `0.37, 0.85, 0.31`

### Context
* pymarian-eval version (installed with `pip`) obtained from `pymarian-eval --version`: `1.12.31`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pymarian-eval huge inconsistency of COMET scores across CPU and GPU #1034

Bug description

How to reproduce

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

pymarian-eval huge inconsistency of COMET scores across CPU and GPU #1034

Description

Bug description

How to reproduce

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions