Skip to content

评测MMBench数据集发现部分问题判别错误 #1480

@michaelxu1107

Description

@michaelxu1107

评测推理产生的文件为:Qwen3.5-27B_MMBench_DEV_CN_V11.xlsx

判别产生的文件为:Qwen3.5-27B_MMBench_DEV_CN_V11_GLM4.7_result.xlsx

eval命令:python run.py --data MMBench_DEV_CN_V11 --model Qwen3.5-27B --mode eval --reuse --judge GLM4.7 --verbose --judge-args '{"temperature":0, "chat_template_kwargs":{"enable_thinking":false}}'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions