-
Notifications
You must be signed in to change notification settings - Fork 304
Description
Description:
Hi COMET team,
I'm trying to reproduce the example results for the Unbabel/wmt22-comet-da model, but the output scores I obtained are quite different from those shown in the official documentation and example scripts.
Environment
Python version: 3.10.8
COMET version: 2.2.7
PyTorch version: 2.8.0
Operating system: Windows 11
without GPU
Code used
from evaluate import load comet_metric = load('comet') source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."] hypothesis = ["The fire could be stopped", "Schools and kindergartens were open"] reference = ["They were able to control the fire", "Schools and kindergartens opened"] results = comet_metric.compute(predictions=hypothesis, references=reference, sources=source) print([round(v, 2) for v in results["scores"]])
Expected behavior
According to the example in the metrics/comet/README.md, the expected output should be [0.19, 0.92], but my result is [0.85, 0.97].
I used the same code as in the example and have loaded the correct model.
Expected results
My results
Questions
Has the Unbabel/wmt22-comet-da model been updated or rescaled recently?
Are there any changes in score normalization (e.g., z-score → 0–1 scaling)?
Is there a specific COMET or PyTorch version required to match the example results?
Additional context
I am using COMET for machine translation evaluation and need to ensure consistency with the official WMT22 scores. Any guidance or clarification would be greatly appreciated.
Thanks a lot for your time and support!