Formal verification framework for dataset lineage, policy compliance, and training-time safety guarantees.
| Tag | Outcome | Success Metric |
|---|---|---|
| DSS-1 | Lineage Proof Framework — Lean proofs that every ETL transform is hash-chain-consistent | 100% of sample ETL DAG nodes emit T(d) hash matching formal spec; regression test passes on 10k-row dataset |
| DSS-2 | Policy Filter Pack covering HIPAA PHI, COPPA, GDPR minor-data, and custom regex rules | ≥99% recall / 0 false-negatives on curated PHI+COPPA benchmark |
| DSS-3 | Optimizer-Invariant Suite for SGD, AdamW, Lion (β-version) | Proof scripts compile <3s; runbook shows bound ⇒ ∇ explosion impossible for given hyper-params |
| DSS-4 | Shape-Safety Verifier that can consume ONNX / PyTorch FX graphs, generate Lean shape spec, and prove ∀ ℓ, out=shapeSpec… | End-to-end demo verifies GPT-2 (124M) computation graph in <45s |
| DSS-5 | ETL Guard Extractor (Rust/Python) auto-generated from Lean predicates | Guard library adds <5ms overhead / 1M rows; published to PyPI & crates.io |
# Clone the repository
git clone https://github.com/SentinelOps-CI/dataset-safety-specs.git
# Install dependencies
pip install -r requirements.txt
# Build the project
lake build
# Run tests
lake exe test_suite
# Run benchmarks
lake exe benchmark_suite
# Extract guards from Lean predicates
lake exe extract_guard
# Verify shape safety of ONNX model
lake exe shapesafe_verify model.onnx
# Convert ONNX model to Lean shape spec
python python/onnx2lean_shape.py model.onnx output.lean
# Create distributable bundle
./bundle.sh bundle- Dataset Lineage Module: Hash-chain consistent ETL transforms
- Data Policy Filters: HIPAA PHI, COPPA, GDPR compliance
- Optimizer Invariants: Formal proofs for SGD, AdamW, Lion stability
- Shape Safety Verifier: ONNX/PyTorch graph verification
- Guard Extractor: Auto-generated Rust/Python guards
Formal verification of ETL transform hash-chain consistency.
-- Create a dataset
let dataset := Lineage.mk_dataset ["row1", "row2", "row3"]
-- Apply a transform
let transform := {
name := "filter_even"
params := ["even_only"]
transform_fn := fun d => { d with data := d.data.filter isEven }
}
-- Verify the transform
let verified := Lineage.verify_transform transformHIPAA PHI, COPPA, and GDPR compliance filters.
-- Check for PHI
let row := {
phi := ["SSN: 123-45-6789"]
age := some 25
gdpr_special := []
custom_fields := []
}
let has_phi := Policy.has_phi row
let is_minor := Policy.is_minor rowFormal proofs for optimizer stability.
-- SGD energy stability theorem
theorem sgd_energy_stability
(w : Vector) (grad : Gradient) (lr : LearningRate)
(h_lr : lr > 0) (h_lr_bound : lr ≤ 1) :
let w_new := sgd_update w grad lr
energy w_new ≤ energy w := by sorryONNX/PyTorch graph shape verification.
# Convert ONNX model to Lean shape spec
python python/onnx2lean_shape.py model.onnx output.lean
# Verify shape safety
lake exe shapesafe_verify model.onnxAuto-generated Rust and Python guards.
# Extract guards from Lean predicates
lake exe extract_guard
# Use in Python
from ds_guard import phi_guard, coppa_guardComprehensive regression tests for all deliverables:
# Run regression tests
python python/regression_tests.py
# Tests include:
# - 10k-row dataset processing
# - GPT-2 shape proof performance (≤45s)
# - ETL throughput profiling
# - Compliance validation
# - SentinelOps schema complianceProduction-ready safety kernel for ML pipeline integration:
from python.runtime_safety_kernel import create_safety_kernel, RuntimeConfig
# Create safety kernel
config = RuntimeConfig(compliance_level="strict")
kernel = create_safety_kernel(config)
# Register model assets
kernel.register_model_asset(model_asset)
# Check data safety
safety_result = kernel.check_data_safety(data)Generate compliance bundles matching SentinelOps schema:
from python.sentinelops_bundle import create_sentinelops_bundle
# Create compliance bundle
success = create_sentinelops_bundle(
dataset_path="dataset.parquet",
dataset_name="my_dataset",
output_path="compliance_bundle.zip",
compliance_level="strict"
)Robust extraction of ONNX node names with fallback strategies:
from python.onnx_node_extractor import extract_onnx_node_names
# Extract node names
result = extract_onnx_node_names("model.onnx", "node_names_report.json")CI workflow for experimental Lion optimizer proofs:
# CI automatically detects experimental proofs
# and gates testing appropriately
# See .github/workflows/lion_proof_gating.ymlComplex mathematical proofs for optimizer stability:
-- See src/DatasetSafetySpecs/OptimizerInduction.lean
-- Contains hard induction sub-lemmas for Lion optimizer
-- with proof status tracking and CI gating- API Reference: See
docs/index.mdfor complete API documentation - Tutorial: "Prove Your First PHI Filter" tutorial in the documentation
- Contributing: See
CONTRIBUTING.mdfor development guidelines
# Install dependencies
lake update
# Format code
lake exe lean --run src/format.lean
# Type check
lake build DatasetSafetySpecs
# Run all tests
python python/run_all_tests.py
# Run specific test suites
lake exe test_suite
lake exe benchmark_suite
lake exe lineage_regression_test
python python/regression_tests.pyThe project includes a comprehensive bundling system:
# Create complete bundle (Lean specs + Rust/Python guards)
./bundle.sh bundle
# Create specific bundles
./bundle.sh lean # Lean specifications only
./bundle.sh python # Python wheel only
./bundle.sh rust # Rust crate only
# Clean bundle directory
./bundle.sh cleanUse our issue templates for different types of issues:
- [POLICY]: Data policy and compliance issues
- [SHAPE]: Shape safety verification issues
- [BUG]: General bugs
- [FEATURE]: Feature requests
See CONTRIBUTING.md for development guidelines.
MIT License - see LICENSE for details.