-
Notifications
You must be signed in to change notification settings - Fork 92
Open
Description
Hi maintainers, thanks for releasing the GEdit-Bench evaluation code.
I'm trying to reproduce the reported GEdit-Bench metrics using GPT-4.1 as the judge, but I'm seeing two issues that make the scores hard to reproduce:
1) Azure OpenAI GPT-4.1 gets blocked by content_filter (jailbreak detected)
When calling GPT-4.1 on Azure OpenAI from the evaluation script, some requests fail with HTTP 400:
{
"error": {
"message": "The response was filtered due to the prompt triggering Azure OpenAI's content management policy. Please modify your prompt and retry...",
"code": "content_filter",
"status": 400,
"innererror": {
"code": "ResponsibleAIPolicyViolation",
"content_filter_result": {
"jailbreak": { "detected": true, "filtered": true }
}
}
}
}I found that changing the prompt order can reduce or eliminate the Azure blocking, but then the overall evaluation scores become consistently higher compared to the original prompt format.
2) OpenRouter’s GPT-4.1 gets higher scores
If I evaluate using OpenRouter’s GPT-4.1 API key (same dataset and same eval code logic), the scores are also consistently much higher than results on paper (even after I avoid the Azure content filter issue).
Questions:
- Are there known phrases in the current judge prompt that can trigger Azure “jailbreak” false positives?
- How to reproduce reported metrics? Could you recommend some other online judge models?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels