[Bench Eval] How to reproduce reported GEdit-Bench metrics?

Hi maintainers, thanks for releasing the GEdit-Bench evaluation code.

I'm trying to reproduce the reported GEdit-Bench metrics using GPT-4.1 as the judge, but I'm seeing two issues that make the scores hard to reproduce:

## 1) Azure OpenAI GPT-4.1 gets blocked by content_filter (jailbreak detected)
When calling GPT-4.1 on Azure OpenAI from the evaluation script, some requests fail with HTTP 400:

```json
{
  "error": {
    "message": "The response was filtered due to the prompt triggering Azure OpenAI's content management policy. Please modify your prompt and retry...",
    "code": "content_filter",
    "status": 400,
    "innererror": {
      "code": "ResponsibleAIPolicyViolation",
      "content_filter_result": {
        "jailbreak": { "detected": true, "filtered": true }
      }
    }
  }
}
```
I found that changing the prompt order can reduce or eliminate the Azure blocking, but then the overall evaluation scores become consistently higher compared to the original prompt format.

## 2) OpenRouter’s GPT-4.1 gets higher scores
If I evaluate using OpenRouter’s GPT-4.1 API key (same dataset and same eval code logic), the scores are also consistently much higher than  results on paper (even after I avoid the Azure content filter issue).

Questions:
1. Are there known phrases in the current judge prompt that can trigger Azure “jailbreak” false positives?
2. How to reproduce reported metrics? Could you recommend some other online judge models?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bench Eval] How to reproduce reported GEdit-Bench metrics? #143

1) Azure OpenAI GPT-4.1 gets blocked by content_filter (jailbreak detected)

2) OpenRouter’s GPT-4.1 gets higher scores

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bench Eval] How to reproduce reported GEdit-Bench metrics? #143

Description

1) Azure OpenAI GPT-4.1 gets blocked by content_filter (jailbreak detected)

2) OpenRouter’s GPT-4.1 gets higher scores

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions