feat: add regulatory compliance evaluation metrics with NIST AI RMF alignment by garyatwalAI · Pull Request #2717 · langfuse/langfuse-docs

garyatwalAI · 2026-03-25T20:42:51Z

Summary

Adds a new mlflow-regulatory-compliance plugin package (under cookbook/) providing four governance-focused evaluation metrics for LLM and ML model assessment: PII detection, legal privilege detection, factual grounding, and bias detection. Includes a composite NIST AI RMF compliance score and structured compliance report generator.

PII Detection Metric: Detects 9 categories of personally identifiable information using pattern matching and validation algorithms (Luhn check for credit cards, area number validation for SSNs)
Legal Privilege Detection Metric: Flags attorney-client, work product, and settlement communications using multi-layer detection with confidence scoring and false positive mitigation
Factual Grounding Metric: Measures RAG faithfulness by checking if model claims are supported by provided context using token and n-gram overlap
Bias Detection Metric: Detects demographic bias across 5 dimensions (gender, racial/ethnic, age, disability, socioeconomic) with configurable sensitivity levels
NIST AI RMF Composite Score: Weighted aggregate score mapping to GOVERN, MAP, MEASURE, MANAGE functions
Compliance Report Generator: Structured report with per-function scores, pass/fail status, and remediation recommendations
Full MLflow Integration: Works with mlflow.evaluate() via standard make_metric() / extra_metrics interface

Motivation

Organisations deploying AI in regulated industries (legal services, insurance, financial services, healthcare) need to evaluate models against governance and compliance criteria — not just accuracy and latency. Current MLflow evaluation metrics cover model quality but lack coverage for regulatory compliance dimensions that are increasingly required by frameworks like the NIST AI Risk Management Framework, ISO 42001, the EU AI Act, and state-level AI laws in Colorado, Texas, and California.

This plugin fills that gap by providing compliance evaluation metrics that integrate directly into existing MLflow evaluation workflows, so teams can assess governance alongside performance without switching tools.

NIST AI RMF Mapping

NIST Function	Metric	What It Measures
GOVERN	Meta-assessment	Whether all governance evaluators are active
MAP	Factual Grounding	Risk of hallucination and ungrounded claims
MEASURE	PII + Bias Detection	Compliance with data protection and fairness requirements
MANAGE	Legal Privilege	Runtime prevention of privileged information disclosure

Example Usage

import mlflow
from mlflow_regulatory_compliance import (
    pii_detection_metric,
    legal_privilege_metric,
    factual_grounding_metric,
    bias_detection_metric,
    nist_composite_metric,
)

results = mlflow.evaluate(
    data=eval_df,
    predictions="predictions",
    extra_metrics=[
        pii_detection_metric,
        legal_privilege_metric,
        factual_grounding_metric,
        bias_detection_metric,
        nist_composite_metric,
    ],
)

print(f"NIST Compliance Score: {results.metrics['nist_compliance_score/v1/mean']:.2f}")

Tests

PII detection — all 9 categories, Luhn validation, false positives, edge cases
Legal privilege — all 3 categories, confidence scoring, false positives
Factual grounding — grounded, partial, hallucinated inputs
Bias detection — all 5 dimensions, sensitivity levels, custom terms
NIST composite — weighting, thresholds, aggregation
Compliance evaluator — metric selection, configuration
Report generator — output format, status determination, recommendations
Edge cases — empty input, None input, Unicode handling
159 tests passing, ruff lint clean

Compatibility

MLflow >= 2.10
Python >= 3.9
No heavy ML dependencies in default mode (regex + pattern matching)

…gulatory-compliance Initialise the mlflow-regulatory-compliance plugin package with pyproject.toml, setup.py, requirements.txt, Apache 2.0 license, and shared utility modules including PII/privilege regex patterns, Luhn validation, bias indicator lists, and scoring helpers.

Detect 9 categories of personally identifiable information in model outputs: email addresses, phone numbers (US/UK/international), SSN/NIN, credit card numbers (with Luhn validation), physical addresses, names in identifying context, dates of birth, IP addresses, and passport/driving licence numbers. Returns per-row pii_detection_score with aggregate statistics.

Detect potentially privileged legal content across three categories: attorney-client privilege, work product doctrine, and settlement/mediation communications. Includes false positive mitigation for terms like "attorney general" and confidence-weighted scoring.

Measure how well model outputs are grounded in provided context for RAG systems. Extracts claims via sentence segmentation, checks each against context using combined token and n-gram overlap scoring. No heavy ML dependencies required — uses regex-based tokenisation by default.

Detect demographic bias across five dimensions: gender, racial/ethnic, age, disability, and socioeconomic. Uses curated indicator lists with configurable sensitivity levels (low/medium/high) and support for custom terms.

Combine PII detection, legal privilege, factual grounding, and bias detection into a single NIST AI RMF-aligned compliance score. Maps metrics to GOVERN, MAP, MEASURE, and MANAGE functions with configurable weights and pass/fail threshold determination.

Provide RegulatoryComplianceEvaluator class that produces a configurable list of MLflow evaluation metrics for use with mlflow.evaluate(extra_metrics=...). Supports enabling/disabling individual metrics and configuring bias sensitivity, context columns, NIST thresholds, and custom weights.

Generate structured compliance reports mapping evaluation results to NIST AI RMF functions. Outputs a pandas DataFrame with per-function scores, PASS/WARN/FAIL status, remediation recommendations, and supporting evidence. Includes to_dict() for mlflow.log_dict() integration.

159 tests covering all metrics individually and in combination: - PII detection: all 9 categories, Luhn validation, false positives, edge cases - Legal privilege: all 3 categories, confidence scoring, false positives - Factual grounding: grounded, partial, hallucinated inputs, custom thresholds - Bias detection: all 5 dimensions, sensitivity levels, custom terms - NIST composite: weighting, thresholds, aggregation, function score structure - Compliance evaluator: metric selection, configuration, enabled metrics - Report generator: output format, status determination, recommendations - Edge cases: empty input, None input, Unicode handling

Document all five compliance metrics, the RegulatoryComplianceEvaluator, and the NIST report generator. Include quick start example, configuration reference table, NIST AI RMF alignment mapping, and industry use cases for insurance, legal, financial services, and healthcare AI.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

vercel · 2026-03-25T20:42:57Z

@garyatwalAI is attempting to deploy a commit to the langfuse Team on Vercel.

A member of the Team first needs to authorize it.

garyatwalAI · 2026-03-25T20:46:43Z

Hi @marcklingen @jannikmaierhoefer — would appreciate a review when you get a chance. This adds an mlflow-regulatory-compliance plugin package to the cookbook with NIST AI RMF-aligned evaluation metrics (PII detection, legal privilege detection, factual grounding, bias detection) that integrate with mlflow.evaluate(). All 159 tests passing. Thanks!

garyatwalAI added 11 commits March 25, 2026 20:34

feat: add bias detection evaluation metric

f06a9a1

Detect demographic bias across five dimensions: gender, racial/ethnic, age, disability, and socioeconomic. Uses curated indicator lists with configurable sensitivity levels (low/medium/high) and support for custom terms.

chore: add .gitignore for build artifacts and caches

1d40f63

claude bot reviewed Mar 25, 2026

View reviewed changes

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. documentation Improvements or additions to documentation enhancement New feature or request labels Mar 25, 2026

Merge branch 'main' into cookbook/regulatory-compliance-evaluation

f745cad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add regulatory compliance evaluation metrics with NIST AI RMF alignment#2717

feat: add regulatory compliance evaluation metrics with NIST AI RMF alignment#2717
garyatwalAI wants to merge 12 commits intolangfuse:mainfrom
garyatwalAI:cookbook/regulatory-compliance-evaluation

garyatwalAI commented Mar 25, 2026

Uh oh!

claude bot left a comment

Uh oh!

vercel bot commented Mar 25, 2026

Uh oh!

garyatwalAI commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garyatwalAI commented Mar 25, 2026

Summary

Motivation

NIST AI RMF Mapping

Example Usage

Tests

Compatibility

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

vercel bot commented Mar 25, 2026

Uh oh!

garyatwalAI commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant