You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A focused evaluation harness built to expose the real failure modes of LLM code reasoning. This isn’t a pass/fail scoreboard; it’s a diagnostic layer for models that are pretending to understand requirements.
Benchmarks like HumanEval, MBPP, and SWE-Bench measure surface accuracy. xFail is designed to classify failure behavior and tie it to concrete model breakdowns.
0 commit comments