Skip to content

[Bench Eval] How to reproduce reported GEdit-Bench metrics? #143

@ElliotQi

Description

@ElliotQi

Hi maintainers, thanks for releasing the GEdit-Bench evaluation code.

I'm trying to reproduce the reported GEdit-Bench metrics using GPT-4.1 as the judge, but I'm seeing two issues that make the scores hard to reproduce:

1) Azure OpenAI GPT-4.1 gets blocked by content_filter (jailbreak detected)

When calling GPT-4.1 on Azure OpenAI from the evaluation script, some requests fail with HTTP 400:

{
  "error": {
    "message": "The response was filtered due to the prompt triggering Azure OpenAI's content management policy. Please modify your prompt and retry...",
    "code": "content_filter",
    "status": 400,
    "innererror": {
      "code": "ResponsibleAIPolicyViolation",
      "content_filter_result": {
        "jailbreak": { "detected": true, "filtered": true }
      }
    }
  }
}

I found that changing the prompt order can reduce or eliminate the Azure blocking, but then the overall evaluation scores become consistently higher compared to the original prompt format.

2) OpenRouter’s GPT-4.1 gets higher scores

If I evaluate using OpenRouter’s GPT-4.1 API key (same dataset and same eval code logic), the scores are also consistently much higher than results on paper (even after I avoid the Azure content filter issue).

Questions:

  1. Are there known phrases in the current judge prompt that can trigger Azure “jailbreak” false positives?
  2. How to reproduce reported metrics? Could you recommend some other online judge models?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions