Skip to content

feat: add regulatory compliance evaluation metrics with NIST AI RMF alignment#2717

Open
garyatwalAI wants to merge 12 commits intolangfuse:mainfrom
garyatwalAI:cookbook/regulatory-compliance-evaluation
Open

feat: add regulatory compliance evaluation metrics with NIST AI RMF alignment#2717
garyatwalAI wants to merge 12 commits intolangfuse:mainfrom
garyatwalAI:cookbook/regulatory-compliance-evaluation

Conversation

@garyatwalAI
Copy link

Summary

Adds a new mlflow-regulatory-compliance plugin package (under cookbook/) providing four governance-focused evaluation metrics for LLM and ML model assessment: PII detection, legal privilege detection, factual grounding, and bias detection. Includes a composite NIST AI RMF compliance score and structured compliance report generator.

  • PII Detection Metric: Detects 9 categories of personally identifiable information using pattern matching and validation algorithms (Luhn check for credit cards, area number validation for SSNs)
  • Legal Privilege Detection Metric: Flags attorney-client, work product, and settlement communications using multi-layer detection with confidence scoring and false positive mitigation
  • Factual Grounding Metric: Measures RAG faithfulness by checking if model claims are supported by provided context using token and n-gram overlap
  • Bias Detection Metric: Detects demographic bias across 5 dimensions (gender, racial/ethnic, age, disability, socioeconomic) with configurable sensitivity levels
  • NIST AI RMF Composite Score: Weighted aggregate score mapping to GOVERN, MAP, MEASURE, MANAGE functions
  • Compliance Report Generator: Structured report with per-function scores, pass/fail status, and remediation recommendations
  • Full MLflow Integration: Works with mlflow.evaluate() via standard make_metric() / extra_metrics interface

Motivation

Organisations deploying AI in regulated industries (legal services, insurance, financial services, healthcare) need to evaluate models against governance and compliance criteria — not just accuracy and latency. Current MLflow evaluation metrics cover model quality but lack coverage for regulatory compliance dimensions that are increasingly required by frameworks like the NIST AI Risk Management Framework, ISO 42001, the EU AI Act, and state-level AI laws in Colorado, Texas, and California.

This plugin fills that gap by providing compliance evaluation metrics that integrate directly into existing MLflow evaluation workflows, so teams can assess governance alongside performance without switching tools.

NIST AI RMF Mapping

NIST Function Metric What It Measures
GOVERN Meta-assessment Whether all governance evaluators are active
MAP Factual Grounding Risk of hallucination and ungrounded claims
MEASURE PII + Bias Detection Compliance with data protection and fairness requirements
MANAGE Legal Privilege Runtime prevention of privileged information disclosure

Example Usage

import mlflow
from mlflow_regulatory_compliance import (
    pii_detection_metric,
    legal_privilege_metric,
    factual_grounding_metric,
    bias_detection_metric,
    nist_composite_metric,
)

results = mlflow.evaluate(
    data=eval_df,
    predictions="predictions",
    extra_metrics=[
        pii_detection_metric,
        legal_privilege_metric,
        factual_grounding_metric,
        bias_detection_metric,
        nist_composite_metric,
    ],
)

print(f"NIST Compliance Score: {results.metrics['nist_compliance_score/v1/mean']:.2f}")

Tests

  • PII detection — all 9 categories, Luhn validation, false positives, edge cases
  • Legal privilege — all 3 categories, confidence scoring, false positives
  • Factual grounding — grounded, partial, hallucinated inputs
  • Bias detection — all 5 dimensions, sensitivity levels, custom terms
  • NIST composite — weighting, thresholds, aggregation
  • Compliance evaluator — metric selection, configuration
  • Report generator — output format, status determination, recommendations
  • Edge cases — empty input, None input, Unicode handling
  • 159 tests passing, ruff lint clean

Compatibility

  • MLflow >= 2.10
  • Python >= 3.9
  • No heavy ML dependencies in default mode (regex + pattern matching)

…gulatory-compliance

Initialise the mlflow-regulatory-compliance plugin package with pyproject.toml,
setup.py, requirements.txt, Apache 2.0 license, and shared utility modules
including PII/privilege regex patterns, Luhn validation, bias indicator lists,
and scoring helpers.
Detect 9 categories of personally identifiable information in model outputs:
email addresses, phone numbers (US/UK/international), SSN/NIN, credit card
numbers (with Luhn validation), physical addresses, names in identifying
context, dates of birth, IP addresses, and passport/driving licence numbers.
Returns per-row pii_detection_score with aggregate statistics.
Detect potentially privileged legal content across three categories:
attorney-client privilege, work product doctrine, and settlement/mediation
communications. Includes false positive mitigation for terms like
"attorney general" and confidence-weighted scoring.
Measure how well model outputs are grounded in provided context for RAG
systems. Extracts claims via sentence segmentation, checks each against
context using combined token and n-gram overlap scoring. No heavy ML
dependencies required — uses regex-based tokenisation by default.
Detect demographic bias across five dimensions: gender, racial/ethnic, age,
disability, and socioeconomic. Uses curated indicator lists with configurable
sensitivity levels (low/medium/high) and support for custom terms.
Combine PII detection, legal privilege, factual grounding, and bias detection
into a single NIST AI RMF-aligned compliance score. Maps metrics to GOVERN,
MAP, MEASURE, and MANAGE functions with configurable weights and pass/fail
threshold determination.
Provide RegulatoryComplianceEvaluator class that produces a configurable list
of MLflow evaluation metrics for use with mlflow.evaluate(extra_metrics=...).
Supports enabling/disabling individual metrics and configuring bias sensitivity,
context columns, NIST thresholds, and custom weights.
Generate structured compliance reports mapping evaluation results to NIST AI
RMF functions. Outputs a pandas DataFrame with per-function scores, PASS/WARN/FAIL
status, remediation recommendations, and supporting evidence. Includes
to_dict() for mlflow.log_dict() integration.
159 tests covering all metrics individually and in combination:
- PII detection: all 9 categories, Luhn validation, false positives, edge cases
- Legal privilege: all 3 categories, confidence scoring, false positives
- Factual grounding: grounded, partial, hallucinated inputs, custom thresholds
- Bias detection: all 5 dimensions, sensitivity levels, custom terms
- NIST composite: weighting, thresholds, aggregation, function score structure
- Compliance evaluator: metric selection, configuration, enabled metrics
- Report generator: output format, status determination, recommendations
- Edge cases: empty input, None input, Unicode handling
Document all five compliance metrics, the RegulatoryComplianceEvaluator,
and the NIST report generator. Include quick start example, configuration
reference table, NIST AI RMF alignment mapping, and industry use cases
for insurance, legal, financial services, and healthcare AI.
Copy link

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@vercel
Copy link

vercel bot commented Mar 25, 2026

@garyatwalAI is attempting to deploy a commit to the langfuse Team on Vercel.

A member of the Team first needs to authorize it.

@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. documentation Improvements or additions to documentation enhancement New feature or request labels Mar 25, 2026
@garyatwalAI
Copy link
Author

Hi @marcklingen @jannikmaierhoefer — would appreciate a review when you get a chance. This adds an mlflow-regulatory-compliance plugin package to the cookbook with NIST AI RMF-aligned evaluation metrics (PII detection, legal privilege detection, factual grounding, bias detection) that integrate with mlflow.evaluate(). All 159 tests passing. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant