feat: add regulatory compliance evaluation metrics with NIST AI RMF alignment#2717
Open
garyatwalAI wants to merge 12 commits intolangfuse:mainfrom
Open
feat: add regulatory compliance evaluation metrics with NIST AI RMF alignment#2717garyatwalAI wants to merge 12 commits intolangfuse:mainfrom
garyatwalAI wants to merge 12 commits intolangfuse:mainfrom
Conversation
…gulatory-compliance Initialise the mlflow-regulatory-compliance plugin package with pyproject.toml, setup.py, requirements.txt, Apache 2.0 license, and shared utility modules including PII/privilege regex patterns, Luhn validation, bias indicator lists, and scoring helpers.
Detect 9 categories of personally identifiable information in model outputs: email addresses, phone numbers (US/UK/international), SSN/NIN, credit card numbers (with Luhn validation), physical addresses, names in identifying context, dates of birth, IP addresses, and passport/driving licence numbers. Returns per-row pii_detection_score with aggregate statistics.
Detect potentially privileged legal content across three categories: attorney-client privilege, work product doctrine, and settlement/mediation communications. Includes false positive mitigation for terms like "attorney general" and confidence-weighted scoring.
Measure how well model outputs are grounded in provided context for RAG systems. Extracts claims via sentence segmentation, checks each against context using combined token and n-gram overlap scoring. No heavy ML dependencies required — uses regex-based tokenisation by default.
Detect demographic bias across five dimensions: gender, racial/ethnic, age, disability, and socioeconomic. Uses curated indicator lists with configurable sensitivity levels (low/medium/high) and support for custom terms.
Combine PII detection, legal privilege, factual grounding, and bias detection into a single NIST AI RMF-aligned compliance score. Maps metrics to GOVERN, MAP, MEASURE, and MANAGE functions with configurable weights and pass/fail threshold determination.
Provide RegulatoryComplianceEvaluator class that produces a configurable list of MLflow evaluation metrics for use with mlflow.evaluate(extra_metrics=...). Supports enabling/disabling individual metrics and configuring bias sensitivity, context columns, NIST thresholds, and custom weights.
Generate structured compliance reports mapping evaluation results to NIST AI RMF functions. Outputs a pandas DataFrame with per-function scores, PASS/WARN/FAIL status, remediation recommendations, and supporting evidence. Includes to_dict() for mlflow.log_dict() integration.
159 tests covering all metrics individually and in combination: - PII detection: all 9 categories, Luhn validation, false positives, edge cases - Legal privilege: all 3 categories, confidence scoring, false positives - Factual grounding: grounded, partial, hallucinated inputs, custom thresholds - Bias detection: all 5 dimensions, sensitivity levels, custom terms - NIST composite: weighting, thresholds, aggregation, function score structure - Compliance evaluator: metric selection, configuration, enabled metrics - Report generator: output format, status determination, recommendations - Edge cases: empty input, None input, Unicode handling
Document all five compliance metrics, the RegulatoryComplianceEvaluator, and the NIST report generator. Include quick start example, configuration reference table, NIST AI RMF alignment mapping, and industry use cases for insurance, legal, financial services, and healthcare AI.
|
@garyatwalAI is attempting to deploy a commit to the langfuse Team on Vercel. A member of the Team first needs to authorize it. |
Author
|
Hi @marcklingen @jannikmaierhoefer — would appreciate a review when you get a chance. This adds an |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new
mlflow-regulatory-complianceplugin package (undercookbook/) providing four governance-focused evaluation metrics for LLM and ML model assessment: PII detection, legal privilege detection, factual grounding, and bias detection. Includes a composite NIST AI RMF compliance score and structured compliance report generator.mlflow.evaluate()via standardmake_metric()/extra_metricsinterfaceMotivation
Organisations deploying AI in regulated industries (legal services, insurance, financial services, healthcare) need to evaluate models against governance and compliance criteria — not just accuracy and latency. Current MLflow evaluation metrics cover model quality but lack coverage for regulatory compliance dimensions that are increasingly required by frameworks like the NIST AI Risk Management Framework, ISO 42001, the EU AI Act, and state-level AI laws in Colorado, Texas, and California.
This plugin fills that gap by providing compliance evaluation metrics that integrate directly into existing MLflow evaluation workflows, so teams can assess governance alongside performance without switching tools.
NIST AI RMF Mapping
Example Usage
Tests
Compatibility