RAIL Score custom metric: 8-dimension responsible AI evaluation for DeepEval #2590
SumitVermakgp
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Problem
Standard evaluation metrics (accuracy, BLEU, ROUGE, G-Eval criteria) measure quality and correctness but do not capture responsible AI dimensions like fairness, safety, privacy, or accountability. As LLM applications move into production, teams need structured evaluation across these dimensions to catch bias, safety gaps, and reliability issues before they reach users.
What RAIL Score provides
RAIL Score is a responsible AI evaluation API that scores LLM outputs across 8 dimensions, each on a 0-10 scale with a confidence estimate:
The Python SDK is available on PyPI:
pip install rail-score-sdkIntegration with DeepEval
The integration is a custom
RAILScoreMetricclass extendingBaseMetric. It calls the RAIL Score API inmeasure()/a_measure(), normalizes scores to 0-1, and populatesscore_breakdownwith all 8 dimension scores.Basic usage
Per-dimension scores
After evaluation, per-dimension scores are accessible via
score_breakdown:Example output:
Deep mode with domain context
For safety-sensitive domains, use deep mode to get per-dimension explanations:
Configuration options
thresholdmode"basic""basic"(fast, 1 credit) or"deep"(explanations, 3 credits)domain"general"dimensionsstrict_modeFalseasync_modeTrueResources
pip install rail-score-sdkA complete working example (metric class + evaluation script) is available in the linked PR.
Interested in feedback on this approach. Would a dedicated responsible AI evaluation metric category be useful for the DeepEval ecosystem?
Beta Was this translation helpful? Give feedback.
All reactions