RAIL Score custom metric: 8-dimension responsible AI evaluation for DeepEval #2590

SumitVermakgp · 2026-04-02T09:21:38Z

SumitVermakgp
Apr 2, 2026

Problem

Standard evaluation metrics (accuracy, BLEU, ROUGE, G-Eval criteria) measure quality and correctness but do not capture responsible AI dimensions like fairness, safety, privacy, or accountability. As LLM applications move into production, teams need structured evaluation across these dimensions to catch bias, safety gaps, and reliability issues before they reach users.

What RAIL Score provides

RAIL Score is a responsible AI evaluation API that scores LLM outputs across 8 dimensions, each on a 0-10 scale with a confidence estimate:

Dimension	What it measures
Fairness	Equitable treatment, absence of bias
Safety	Prevention of harmful content
Reliability	Factual accuracy, consistency
Transparency	Clear reasoning, disclosed limitations
Privacy	PII protection, data minimization
Accountability	Traceable decisions, auditable reasoning
Inclusivity	Accessible, culturally aware language
User Impact	Value delivered to the end user

The Python SDK is available on PyPI: pip install rail-score-sdk

Integration with DeepEval

The integration is a custom RAILScoreMetric class extending BaseMetric. It calls the RAIL Score API in measure() / a_measure(), normalizes scores to 0-1, and populates score_breakdown with all 8 dimension scores.

Basic usage

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from rail_score_metric import RAILScoreMetric

metric = RAILScoreMetric(
    threshold=0.5,   # Pass if overall >= 5/10
    mode="basic",    # "basic" (fast) or "deep" (with explanations)
    domain="general" # "general", "healthcare", "finance", "legal", etc.
)

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris.",
)

evaluate(test_cases=[test_case], metrics=[metric])

Per-dimension scores

After evaluation, per-dimension scores are accessible via score_breakdown:

metric.measure(test_case)

print(f"Overall: {metric.score:.2f}")
for dim, val in metric.score_breakdown.items():
    print(f"  {dim}: {val:.2f}")

Example output:

Overall: 0.82
  fairness:        0.85
  safety:          0.90
  reliability:     0.80
  transparency:    0.75
  privacy:         0.50
  accountability:  0.80
  inclusivity:     0.85
  user_impact:     0.90

Deep mode with domain context

For safety-sensitive domains, use deep mode to get per-dimension explanations:

metric = RAILScoreMetric(
    threshold=0.6,
    mode="deep",
    domain="healthcare",
)
metric.measure(medical_test_case)
print(metric.reason)  # Includes per-dimension explanations

Configuration options

Parameter	Default	Description
`threshold`	0.5	Minimum score (0-1) to pass
`mode`	`"basic"`	`"basic"` (fast, 1 credit) or `"deep"` (explanations, 3 credits)
`domain`	`"general"`	Evaluation domain context
`dimensions`	all 8	Subset of dimensions to evaluate
`strict_mode`	`False`	Binary 0/1 scoring
`async_mode`	`True`	Async evaluation (DeepEval default)

Resources

RAIL Score SDK on PyPI -- pip install rail-score-sdk
SDK Documentation
API Reference
Free API key signup

A complete working example (metric class + evaluation script) is available in the linked PR.

Interested in feedback on this approach. Would a dedicated responsible AI evaluation metric category be useful for the DeepEval ecosystem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAIL Score custom metric: 8-dimension responsible AI evaluation for DeepEval #2590

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

RAIL Score custom metric: 8-dimension responsible AI evaluation for DeepEval #2590

Uh oh!

SumitVermakgp Apr 2, 2026

Problem

What RAIL Score provides

Integration with DeepEval

Basic usage

Per-dimension scores

Deep mode with domain context

Configuration options

Resources

Replies: 0 comments

SumitVermakgp
Apr 2, 2026