Skip to content

Issue #1611 Fixes#1638

Merged
penguine-ip merged 2 commits intoconfident-ai:mainfrom
PradyMagal:main
May 29, 2025
Merged

Issue #1611 Fixes#1638
penguine-ip merged 2 commits intoconfident-ai:mainfrom
PradyMagal:main

Conversation

@PradyMagal
Copy link
Copy Markdown
Contributor

Fix TruthfulQA compatibility with AnthropicModel

Issue

TruthfulQA benchmark was incompatible with AnthropicModel, always failing with:
ValueError: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.

Root Cause

  • TruthfulQA attempted structured output (JSON schemas) with AnthropicModel
  • AnthropicModel responded with natural language explanations instead of JSON
  • JSON parsing failed but only TypeError was caught, not ValueError
  • Benchmark crashed instead of falling back to text-based prompting

Line 175 (predict method):

# Before
except TypeError:

# After  
except (TypeError, ValueError, AttributeError):

Line 235 (batch_predict method):

# Before
except TypeError:

# After
except (TypeError, ValueError, AttributeError):

Result

  • ✅ TruthfulQA now works with AnthropicModel
  • ✅ Maintains backward compatibility with existing models
  • ✅ Graceful fallback to text-based prompting when structured output fails
  • ✅ More robust error handling for any model without structured output support

Testing

Confirmed working with:

from deepeval.benchmarks import TruthfulQA
from deepeval.benchmarks.tasks import TruthfulQATask
from deepeval.benchmarks.modes import TruthfulQAMode
from deepeval.models import AnthropicModel

benchmark = TruthfulQA(
    tasks=[TruthfulQATask.ADVERTISING, TruthfulQATask.FICTION],
    mode=TruthfulQAMode.MC2
)
benchmark.evaluate(model=AnthropicModel())

Output:
Filter: 100%|████████████████████████████████████████████████████████████████████████████| 817/817 [00:00<00:00, 24678.95 examples/s] Processing Fiction: 100%|████████████████████████████████████████████████████████████████████████████| 30/30 [05:04<00:00, 10.14s/it] TruthfulQA Task Accuracy (task=Fiction): 82.43333333333334 Overall TruthfulQA Accuracy: 82.43333333333334

Got it working, replicated code from issue and it runs well.

Needed to account for ValueError
which catches the JSON Parsing issues
@vercel
Copy link
Copy Markdown

vercel Bot commented May 28, 2025

@PradyMagal is attempting to deploy a commit to the Confident AI Team on Vercel.

A member of the Team first needs to authorize it.

@penguine-ip
Copy link
Copy Markdown
Contributor

hey @PradyMagal thanks! But can you please undo poetry lock and all the adds to the toml file? We just spent a whole week cutting down deepeval's package size in prod

@PradyMagal
Copy link
Copy Markdown
Contributor Author

Poetry Lock has been undone, same with the additions to the toml file

Let me know if there's anything I missed

@penguine-ip
Copy link
Copy Markdown
Contributor

@PradyMagal perfect, thanks!

@penguine-ip penguine-ip merged commit d4876b8 into confident-ai:main May 29, 2025
0 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants