Skip to content

Unable to reproduce example results using the Unbabel/wmt22-comet-da model #711

@starbuild7505

Description

@starbuild7505

Description:
Hi COMET team,
I'm trying to reproduce the example results for the Unbabel/wmt22-comet-da model, but the output scores I obtained are quite different from those shown in the official documentation and example scripts.

Environment
Python version: 3.10.8
COMET version: 2.2.7
PyTorch version: 2.8.0
Operating system: Windows 11
without GPU

Code used
from evaluate import load comet_metric = load('comet') source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."] hypothesis = ["The fire could be stopped", "Schools and kindergartens were open"] reference = ["They were able to control the fire", "Schools and kindergartens opened"] results = comet_metric.compute(predictions=hypothesis, references=reference, sources=source) print([round(v, 2) for v in results["scores"]])

Expected behavior
According to the example in the metrics/comet/README.md, the expected output should be [0.19, 0.92], but my result is [0.85, 0.97].
I used the same code as in the example and have loaded the correct model.

Expected results

Image

My results

Image Image

Questions
Has the Unbabel/wmt22-comet-da model been updated or rescaled recently?
Are there any changes in score normalization (e.g., z-score → 0–1 scaling)?
Is there a specific COMET or PyTorch version required to match the example results?

Additional context
I am using COMET for machine translation evaluation and need to ensure consistency with the official WMT22 scores. Any guidance or clarification would be greatly appreciated.
Thanks a lot for your time and support!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions