Context
DeepEval evaluates LLM outputs on faithfulness, relevance, hallucination, etc. A related dimension that's missing: trust scoring — how trustworthy is the output based on what sources it used?
Two responses can score identically on faithfulness but have very different trust profiles:
- Response A sourced from SEC filings (high trust)
- Response B sourced from unverified blog posts (low trust)
Possible metric
A TrustScoreMetric that evaluates:
- Source tier — were the retrieval sources authoritative (T1-T2) or unverified (T4-T5)?
- Provenance completeness — does the output carry metadata about its origin?
- Verification status — was the output human-reviewed?
from deepeval.metrics import TrustScoreMetric
metric = TrustScoreMetric(
threshold=0.7,
source_tiers={"SEC filings": 1, "news": 3, "forums": 4, "AI inference": 5}
)
test_case = LLMTestCase(
input="What was Q3 revenue?",
actual_output="Revenue was $4.2B",
retrieval_context=["SEC 10-Q filing: Revenue $4.2B"]
)
metric.measure(test_case)
# trust_score: 0.95 (T1 source)
Why this matters
- EU AI Act Article 50 (August 2, 2026) — compliance requires trust transparency
- Enterprise RAG systems need to differentiate high-trust vs low-trust outputs
- Extends DeepEval's coverage from quality metrics to trust metrics
Reference
- AKF defines source tiers (T1-T5) and trust computation
Would the team consider trust scoring as an evaluation dimension?
Context
DeepEval evaluates LLM outputs on faithfulness, relevance, hallucination, etc. A related dimension that's missing: trust scoring — how trustworthy is the output based on what sources it used?
Two responses can score identically on faithfulness but have very different trust profiles:
Possible metric
A
TrustScoreMetricthat evaluates:Why this matters
Reference
Would the team consider trust scoring as an evaluation dimension?