-
Notifications
You must be signed in to change notification settings - Fork 161
GovOps: KPIs and Dashboards
-
Policy Store Health Score
- Description: Composite metric of validation pass rate, conflict detection results, and Cedar analysis outcomes.
- Calculation: Weighted average of (validation pass rate × 0.4) + (conflict-free rate × 0.3) + (Cedar analysis pass rate × 0.3)
- Target: ≥ 95% health score
- Measurement: Real-time calculation across all policy stores, updated every 5 minutes
- Alert Threshold: < 90% triggers warning, < 85% triggers critical alert
-
Policy Evaluation Latency (P99)
- Description: Measures sub-second enforcement guarantees across all Cedarling instances.
- Calculation: 99th percentile of policy evaluation response times across all Cedarling instances
- Target: P99 < 100ms for standard policies, P99 < 500ms for complex policies with external data
- Measurement: Continuous monitoring via Hub System log collection, aggregated hourly
- Alert Threshold: P99 > 200ms triggers performance investigation
-
Policy Drift Incidents per Day
- Description: Counts divergence between deployed stores and GitHub Releases, including stale versions still in use.
- Calculation: Daily count of Cedarling instances running policy versions older than latest GitHub Release
- Target: 0 drift incidents per day
- Measurement: Comparison of deployed versions (from Hub System) vs. latest GitHub Release tags
- Alert Threshold: > 0 incidents triggers immediate notification
-
Policy Conflict Detection Rate
- Description: Percentage of newly authored policies that introduce conflicts before merge.
- Calculation: (Number of policies with detected conflicts / Total policies authored) × 100
- Target: < 5% conflict rate
- Measurement: Pre-merge analysis via Cedar Analysis Tools integration
- Alert Threshold: > 10% triggers review of policy authoring process
-
Policy Regression Failure Rate
- Description: Number of policy changes that break historical decisions in regression testing.
- Calculation: (Number of policy changes causing regression failures / Total policy changes) × 100
- Target: < 2% regression failure rate
- Measurement: Automated regression testing against historical decision log before deployment
- Alert Threshold: > 5% triggers mandatory policy review workflow
-
Policy Store Adoption Rate
- Description: Percentage of AI agents and infrastructure components actively using policy stores.
- Calculation: (Agents with active policy enforcement / Total registered agents) × 100
- Target: ≥ 98% adoption rate
- Measurement: Hub System tracking of Cedarling instance registration and activity
- Alert Threshold: < 95% triggers investigation of non-adopting agents
-
Control Coverage Percentage
- Description: Percentage of controls (from profiles/baselines) mapped to component definitions, policies, and evidence.
- Calculation: (Controls with complete mappings / Total controls in active profiles) × 100
- Target: ≥ 90% coverage for critical controls, ≥ 75% for all controls
- Measurement: OSCAL artifact analysis across catalogs, profiles, and component definitions
- Alert Threshold: < 80% triggers coverage gap analysis
-
Compliance Drift Frequency
- Description: Frequency of infrastructure components losing compliance over time.
- Calculation: Number of components transitioning from compliant to non-compliant per week
- Target: < 5% of components per week
- Measurement: Continuous compliance assessment engine comparing current state to baseline
- Alert Threshold: > 10% triggers automated remediation workflow
-
Automated Evidence Collection Completeness
- Description: Percentage of evidence generated automatically vs. manual evidence.
- Calculation: (Automated evidence items / Total evidence items) × 100
- Target: ≥ 85% automated evidence collection
- Measurement: Evidence collection engine tracking source (automated vs. manual) per control
- Alert Threshold: < 75% triggers review of automation opportunities
-
Assessment Pass Rate (per Framework)
- Description: Percentage of OSCAL profile controls passing automated evaluation (NIST, ISO, GDPR, custom frameworks).
- Calculation: (Passing controls / Total controls in framework profile) × 100, calculated per framework
- Target: ≥ 95% pass rate per framework
- Measurement: Continuous compliance assessment engine evaluating controls against infrastructure
- Alert Threshold: < 90% triggers compliance remediation workflow
-
Mean Time to Compliance Remediation (MTCR)
- Description: Average time from compliance violation detection to remediation completion.
- Calculation: Sum of remediation times / Number of remediated violations
- Target: < 24 hours for critical violations, < 72 hours for standard violations
- Measurement: Compliance engine tracking violation timestamps and remediation completion
- Alert Threshold: > 48 hours triggers escalation to governance officers
-
OSCAL Artifact Freshness
- Description: Percentage of OSCAL artifacts (catalogs, profiles, component definitions) updated within last 90 days.
- Calculation: (Artifacts updated in last 90 days / Total active artifacts) × 100
- Target: ≥ 80% freshness rate
- Measurement: Version control tracking of OSCAL artifact modification dates
- Alert Threshold: < 70% triggers artifact review workflow
-
Schema Compatibility Score
- Description: Measures backward/forward compatibility across schema versions, including number of breaking changes avoided.
- Calculation: (Compatible schema transitions / Total schema transitions) × 100, weighted by breaking change severity
- Target: ≥ 95% compatibility score
- Measurement: Schema registry compatibility analysis during version transitions
- Alert Threshold: < 90% triggers schema review and migration planning
-
Schema Validation Error Rate
- Description: Percentage of policy data validation failures due to schema mismatches.
- Calculation: (Schema validation errors / Total validation attempts) × 100
- Target: < 1% error rate
- Measurement: Protobuf validator tracking validation outcomes during policy evaluation
- Alert Threshold: > 2% triggers schema documentation and migration support
-
Schema Version Adoption Rate
- Description: Percentage of policies using latest schema versions within 30 days of release.
- Calculation: (Policies on latest schema version / Total policies) × 100
- Target: ≥ 85% adoption within 30 days
- Measurement: Schema registry tracking policy-to-schema version mappings
- Alert Threshold: < 75% triggers migration assistance workflow
-
Policy Distribution Success Rate
- Description: Percentage of successful policy store deployments to Cedarling instances.
- Calculation: (Successful deployments / Total deployment attempts) × 100
- Target: ≥ 99% success rate
- Measurement: Hub System tracking deployment outcomes per Cedarling instance
- Alert Threshold: < 95% triggers distribution failure investigation
-
Log Ingestion Throughput
- Description: Number of decision logs successfully ingested per second from all Cedarling instances.
- Calculation: Total logs ingested / Time period (logs per second)
- Target: ≥ 10,000 logs/second sustained throughput
- Measurement: Hub System log ingestion metrics
- Alert Threshold: < 8,000 logs/second triggers capacity scaling
-
Log Ingestion Completeness
- Description: Percentage of expected logs successfully collected from Cedarling instances.
- Calculation: (Logs received / Expected logs based on agent activity) × 100
- Target: ≥ 99.9% completeness
- Measurement: Comparison of expected log volume (from agent activity) vs. actual ingestion
- Alert Threshold: < 99% triggers connectivity and buffering investigation
-
Policy Store Release Adoption Time
- Description: Average time from GitHub Release to 95% of Cedarling instances adopting the new version.
- Calculation: Time difference between release timestamp and 95% adoption timestamp
- Target: < 4 hours for standard releases, < 1 hour for critical security updates
- Measurement: Hub System tracking version adoption across Cedarling instances
- Alert Threshold: > 8 hours triggers distribution optimization review
-
Audit Trail Completeness
- Description: Percentage of AI agent actions and policy decisions captured in audit logs.
- Calculation: (Logged actions / Total agent actions) × 100
- Target: 100% completeness (all actions logged)
- Measurement: Audit engine comparison of agent activity vs. log entries
- Alert Threshold: < 99.9% triggers critical audit integrity investigation
-
Anomaly Detection Accuracy
- Description: Percentage of detected anomalies that are confirmed as actual violations or risks.
- Calculation: (Confirmed anomalies / Total detected anomalies) × 100
- Target: ≥ 80% accuracy (precision)
- Measurement: Audit engine tracking anomaly confirmation by governance officers
- Alert Threshold: < 70% triggers anomaly detection model tuning
-
Mean Time to Detection (MTTD)
- Description: Average time from policy violation or compliance failure to detection.
- Calculation: Sum of detection times / Number of incidents
- Target: < 5 minutes for critical violations, < 1 hour for standard violations
- Measurement: Audit engine tracking violation timestamps vs. detection timestamps
- Alert Threshold: > 15 minutes triggers real-time monitoring optimization
-
Audit Log Query Performance (P95)
- Description: 95th percentile response time for audit log queries and investigations.
- Calculation: 95th percentile of query response times across all audit log queries
- Target: P95 < 2 seconds for standard queries, P95 < 10 seconds for complex investigations
- Measurement: Analytics engine tracking query performance metrics
- Alert Threshold: P95 > 5 seconds triggers query optimization
-
Agent Risk Level Distribution
- Description: Percentage of AI agents operating at each risk level (low, medium, high, critical).
- Calculation: (Agents at risk level / Total agents) × 100, calculated per risk level
- Target: ≥ 80% at low risk, < 5% at high/critical risk
- Measurement: Governance analysis engine classifying agents based on behavior patterns
- Alert Threshold: > 10% at high/critical triggers immediate governance review
-
Policy Violation Rate per Agent
- Description: Average number of policy violations per agent per day.
- Calculation: Total violations / (Number of agents × Days)
- Target: < 0.1 violations per agent per day
- Measurement: Audit engine counting violations from decision logs
- Alert Threshold: > 1 violation per agent per day triggers agent-specific review
-
Sensitive Action Anomaly Detection Rate
- Description: Percentage of sensitive actions (high-risk operations) flagged as anomalies.
- Calculation: (Anomalous sensitive actions / Total sensitive actions) × 100
- Target: < 5% anomaly rate (most sensitive actions should be expected)
- Measurement: Anomaly detection engine analyzing temporal patterns of sensitive actions
- Alert Threshold: > 10% triggers review of agent behavior patterns and policies
-
Agent-to-Resource Interaction Coverage
- Description: Percentage of agent-resource interactions monitored and evaluated.
- Calculation: (Monitored interactions / Total agent-resource interactions) × 100
- Target: 100% coverage (all interactions monitored)
- Measurement: Governance analysis engine tracking interaction graph completeness
- Alert Threshold: < 99% triggers monitoring gap investigation
-
Dashboard Load Time (P95)
- Description: 95th percentile time to load and render dashboard visualizations.
- Calculation: 95th percentile of dashboard load times across all dashboard types
- Target: P95 < 3 seconds for standard dashboards, P95 < 5 seconds for complex executive dashboards
- Measurement: Frontend performance monitoring of dashboard rendering
- Alert Threshold: P95 > 5 seconds triggers dashboard optimization
-
Real-Time Update Latency
- Description: Average delay between data change and dashboard update.
- Calculation: Average time difference between data event and dashboard refresh
- Target: < 5 seconds for real-time dashboards
- Measurement: WebSocket connection monitoring and data freshness tracking
- Alert Threshold: > 10 seconds triggers real-time update optimization
-
KPI Calculation Accuracy
- Description: Percentage of KPI calculations verified as correct through manual audit.
- Calculation: (Verified correct KPIs / Total KPIs audited) × 100
- Target: 100% accuracy (all KPIs mathematically correct)
- Measurement: Periodic manual verification of KPI calculations against source data
- Alert Threshold: < 99% triggers KPI calculation engine review
-
Unified Governance Score
- Description: Composite score combining policy health, compliance posture, and schema integrity.
- Calculation: Weighted average of (Policy Store Health Score × 0.4) + (Compliance Pass Rate × 0.4) + (Schema Compatibility Score × 0.2)
- Target: ≥ 90% unified score
- Measurement: Dashboard KPI Engine calculating composite metric hourly
- Alert Threshold: < 85% triggers executive governance review, < 80% triggers critical governance incident
-
Top Risk Resolution Time
- Description: Average time to resolve top 5 identified governance risks.
- Calculation: Sum of resolution times for top 5 risks / 5
- Target: < 48 hours for critical risks, < 1 week for standard risks
- Measurement: Governance command center tracking risk lifecycle from identification to resolution
- Alert Threshold: > 72 hours triggers escalation workflow
-
Audit Integrity Score
- Description: Composite metric of log completeness, immutability verification, and cryptographic integrity.
- Calculation: Weighted average of (Log Completeness × 0.5) + (Immutability Verification × 0.3) + (Cryptographic Integrity × 0.2)
- Target: 100% integrity score
- Measurement: Audit engine performing continuous integrity checks
- Alert Threshold: < 99.9% triggers critical audit integrity investigation
Purpose: Show the correctness, deployability, and health of all policy stores. Visuals:
- Validation pass/fail trendline
- Conflict detection heatmap
- Cedar Analysis results (always-allow / always-deny / unreachable rules)
- Version distribution across agents
- GitHub Release version adoption timeline
Purpose: Show live state of Cedarling agents and enforcement. Visuals:
- Policy decision latency (P50/P90/P99)
- Policy evaluation counts per agent
- Decision breakdown (permit/deny/error)
- Network partition map + “offline but enforcing cached policies”
Purpose: Demonstrates traceability from controls → policies → components. Visuals:
- Controls mapped / unmapped
- Baseline coverage by control severity
- Components missing mappings
- Drift detection over time
Purpose: Real-time compliance posture. Visuals:
- Compliance score (per framework)
- Per-component global compliance state
- Failed controls trend
- Remediation workflow status
Purpose: Visualize automated evidence generation. Visuals:
- Auto vs manual evidence volume
- Evidence freshness (last update timestamps)
- Evidence coverage per control
- Downloadable OSCAL reports
Purpose: Health of Protobuf schemas governing typed policy data. Visuals:
- Schema version lineage graph
- Compatibility matrix (color-coded breaking vs non-breaking)
- Schema usage across policies
- Schema validation errors over time
Purpose: Continuous oversight of AI agent decisioning. Visuals:
- Per-agent risk level distribution
- Violations triggered by agents
- Temporal anomaly detection (spikes in sensitive actions)
- Agent-to-resource interaction graph
flowchart TB
subgraph Agents["AI Agents"]
A1["AI Agent 1"]
A2["AI Agent 2"]
A3["AI Agent N"]
end
subgraph Cedarling["Cedarling Enforcement Points"]
C1["Policy Evaluation\n(permit/deny/error)"]
C2["Risk Attribute Extraction"]
C3["Action & Resource Logging"]
end
subgraph Hub["Hub System"]
H1["Decision Logs"]
H2["Contextual Metadata"]
H3["Behavior Telemetry"]
end
subgraph Analysis["Governance Analysis Engine"]
G1["Risk Level Classification"]
G2["Anomaly Detection"]
G3["Violation Detection"]
G4["Agent-to-Resource Graph Builder"]
end
subgraph Dashboard["AI Agent Behavior Dashboard"]
D1["Real-Time Agent Risk Levels"]
D2["Policy Violations Timeline"]
D3["Anomaly Spike Chart"]
D4["Agent ↔ Resource Interaction Graph"]
end
%% Connections
A1 --> C1
A1 --> C2
A1 --> C3
A2 --> C1
A2 --> C2
A2 --> C3
A3 --> C1
A3 --> C2
A3 --> C3
C1 --> H1
C2 --> H2
C3 --> H3
H1 --> G3
H2 --> G1
H3 --> G2
H3 --> G4
G1 --> D1
G3 --> D2
G2 --> D3
G4 --> D4
Purpose: Track policy distribution and log ingestion health. Visuals:
- Policy store release adoption map
- Cedarling upgrade/status grid
- Log ingestion throughput and backlog
- Missed log packets or retries
Purpose: Identify risks before they become incidents. Visuals:
- Drift score per component or agent
- Incidents by severity (policy violations, compliance failures)
- Root-cause analysis highlights
- MTTR (mean time to remediation)
Purpose: 1-page view for CISO / Governance Officers. Visuals:
- Unified Governance Score (policy + compliance + schema)
- Top 5 risks
- Top non-compliant components
- Policy conflicts/violations trend
- Compliance framework posture matrix
- Audit integrity and log completeness