| Package | |
| CI / Release | |
| Meta | |
| Documentation | View Full Documentation |
pyos_data_validation is a lightweight Python package for defining, validating, and comparing data contracts for tabular datasets. It enables data scientists to formalize assumptions about their dataβsuch as schema, missingness constraints, numeric ranges, and categorical domainsβand to automatically validate new datasets against those expectations. The package supports reproducible workflows and CI-friendly automation by producing structured validation outputs and clear, actionable error messages suitable for use in unit tests and GitHub Actions.
| Function | Purpose | Primary Input | Primary Output | Use Case |
|---|---|---|---|---|
infer_contract() |
Learn data contract from existing data | DataFrame | Contract |
Bootstrap validation rules from sample data |
validate_contract() |
Validate data against a contract | DataFrame + Contract | ValidationResult |
Check if new data meets expectations |
compare_contracts() |
Detect schema and distribution drift | Two Contracts | DriftReport |
Monitor data evolution over time |
summarize_violations() |
Rank and group validation issues | ValidationResult | Summary |
Prioritize data quality fixes |
Infers a data contract from a pandas DataFrame by automatically learning column types, acceptable missingness levels, numeric bounds, and categorical domains.
Signature:
infer_contract(df: pd.DataFrame) -> ContractParameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
df |
pd.DataFrame |
Yes | - | The input DataFrame to analyze and create a contract from |
Returns:
| Type | Description |
|---|---|
Contract |
A contract object containing schema definitions, column types, value constraints, and validation rules |
Contract Object Structure:
Contract(
columns={
'column_name': ColumnRule(
dtype='int64', # Inferred data type as string
max_missing_frac=0.05, # Observed missing fraction
min_value=10, # Minimum value (numeric only)
max_value=100, # Maximum value (numeric only)
allowed_values={'A', 'B', 'C'} # Unique values (categorical only)
),
# ... more columns
}
)Inference Behavior:
| Column Type | dtype | max_missing_frac | min_value / max_value | allowed_values |
|---|---|---|---|---|
| Numeric (int, float) | String representation of dtype | Fraction of NaN values | Min and max of non-null values | None |
| String/Object | 'object', 'str', or 'string' |
Fraction of NaN values | None |
Set of unique non-null values (as strings) |
| Boolean | 'bool' |
Fraction of NaN values | None |
Set of unique values (as strings) |
| Categorical | 'category' |
Fraction of NaN values | None |
Set of unique values (as strings) |
Raises:
| Exception | When |
|---|---|
TypeError |
If df is not a pandas DataFrame |
Example:
import pandas as pd
from pyos_data_validation.infer_contract import infer_contract
# Sample data
df = pd.DataFrame({
"age": [23, 45, 31, 28, 52],
"income": [50000, 72000, 61000, 58000, 95000],
"city": ["NYC", "SF", "NYC", "LA", "SF"],
"score": [8.5, 9.2, 7.8, 8.1, 9.5]
})
# Infer contract from data
contract = infer_contract(df)
# Access contract details
print(contract.columns['age'].dtype) # 'int64'
print(contract.columns['age'].min_value) # 23
print(contract.columns['age'].max_value) # 52
print(contract.columns['city'].allowed_values) # {'NYC', 'SF', 'LA'}
print(contract.columns['score'].min_value) # 7.8
# Check missingness
print(contract.columns['age'].max_missing_frac) # 0.0 (no missing values)Validates a DataFrame against a contract by checking for schema compliance, type mismatches, missingness violations, range violations, and unexpected categorical values. All columns defined in the contract are treated as required.
Signature:
validate_contract(
df: pd.DataFrame,
contract: Contract,
strict: bool = True
) -> ValidationResultParameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
df |
pd.DataFrame |
Yes | - | The DataFrame to validate |
contract |
Contract |
Yes | - | The contract defining expected schema and constraints |
strict |
bool |
No | True |
If True, extra columns cause validation issues; if False, extra columns are ignored |
Returns:
| Type | Description |
|---|---|
ValidationResult |
Object containing validation status and list of issues |
ValidationResult Object Structure:
ValidationResult(
ok=True, # False if any validation issues occurred
issues=[
Issue(
kind='missing_column' | 'extra_column' | 'dtype' |
'missingness' | 'range' | 'category',
message='Detailed description of the issue',
column='column_name', # or None for dataset-level issues
expected=<expected_value>,
observed=<observed_value>
),
# ... more issues
]
)Validation Checks Performed:
| Check Type | kind Value |
Description | Fails Validation? |
|---|---|---|---|
| Missing columns | 'missing_column' |
Expected columns not present in DataFrame | Yes |
| Extra columns | 'extra_column' |
DataFrame has columns not in contract | Yes (if strict=True) |
| Dtype mismatches | 'dtype' |
Column type doesn't match contract (with normalization for string types) | Yes |
| Missingness violations | 'missingness' |
Missing value fraction exceeds max_missing_frac |
Yes |
| Range violations | 'range' |
Numeric values outside [min_value, max_value] |
Yes |
| Category violations | 'category' |
Categorical values not in allowed_values |
Yes |
Type Normalization:
- String types (
'object','str','string') are treated as equivalent - Other dtypes must match exactly
Example:
import pandas as pd
from pyos_data_validation.validate_contract import validate_contract
# Assume we have an inferred contract from training data
df_new = pd.DataFrame({
"age": [25, 150, 30], # 150 violates range
"income": [52000, 68000, 59000],
"city": ["NYC", "Paris", "SF"], # "Paris" not in allowed_values
})
# Validate with strict mode (default)
result = validate_contract(df_new, contract, strict=True)
if result.ok:
print("β Validation passed!")
else:
print(f"β Validation failed with {len(result.issues)} issues:")
for issue in result.issues:
print(f" [{issue.kind}] {issue.column}: {issue.message}")
# Example output:
# β Validation failed with 2 issues:
# [range] age: age: max value 150 exceeds 52
# [category] city: city: invalid values {'Paris'}
# Validate with non-strict mode (extra columns allowed)
df_extra = df_new.copy()
df_extra['extra_col'] = [1, 2, 3]
result_lenient = validate_contract(df_extra, contract, strict=False)
print(f"Strict mode would have {len(result.issues)} issues")
print(f"Non-strict mode has {len(result_lenient.issues)} issues")Accessing Issue Details:
# Get all issues of a specific kind
dtype_issues = [issue for issue in result.issues if issue.kind == 'dtype']
# Get all issues for a specific column
age_issues = [issue for issue in result.issues if issue.column == 'age']
# Print detailed information
for issue in result.issues:
print(f"Column: {issue.column}")
print(f"Type: {issue.kind}")
print(f"Expected: {issue.expected}")
print(f"Observed: {issue.observed}")
print(f"Message: {issue.message}")
print("---")Compares two contracts to identify schema drift (structural changes) and constraint drift (value range or domain changes). Useful for monitoring data evolution between training and production datasets.
Signature:
compare_contracts(
contract_a: Contract,
contract_b: Contract
) -> DriftReportParameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
contract_a |
Contract |
Yes | - | The baseline contract (e.g., from training data) |
contract_b |
Contract |
Yes | - | The comparison contract (e.g., from new/production data) |
Returns:
| Type | Description |
|---|---|
DriftReport |
Object containing detected drift categorized by type |
DriftReport Object Structure:
DriftReport(
added_columns=set(), # Columns in contract_b but not in contract_a
removed_columns=set(), # Columns in contract_a but not in contract_b
dtype_changes={}, # {column: (old_dtype, new_dtype)}
range_changes=set(), # Columns with min_value or max_value changes
category_changes=set(), # Columns with allowed_values changes
missingness_changes={} # {column: (old_max_missing_frac, new_max_missing_frac)}
)Drift Detection Logic:
| Drift Type | Field | Detection Criteria |
|---|---|---|
| Column additions | added_columns |
Column exists in contract_b.columns but not in contract_a.columns |
| Column removals | removed_columns |
Column exists in contract_a.columns but not in contract_b.columns |
| Type changes | dtype_changes |
For shared columns, dtype differs between contracts |
| Range changes | range_changes |
For shared columns (same dtype), min_value or max_value differs |
| Category changes | category_changes |
For shared columns (same dtype), allowed_values differs |
| Missingness changes | missingness_changes |
For shared columns, max_missing_frac differs |
Important Notes:
- Range and category changes are only detected when dtypes match between contracts
- The comparison is directional: "added" means present in
contract_bonly - All fields return empty collections when no drift is detected
Raises:
| Exception | When |
|---|---|
TypeError |
If either contract is not a Contract instance |
TypeError |
If any column rule is not a ColumnRule instance |
ValueError |
If max_missing_frac is not numeric or outside [0, 1] range |
ValueError |
If min_value exceeds max_value for any column |
Example:
from pyos_data_validation.compare_contracts import compare_contracts
# Training data contract (January)
df_train = pd.DataFrame({
"age": [23, 45, 31, 28],
"income": [50000, 72000, 61000, 58000],
"city": ["NYC", "SF", "NYC", "LA"]
})
contract_train = infer_contract(df_train)
# Production data contract (6 months later)
df_prod = pd.DataFrame({
"age": [22, 48, 35, 29, 95], # Age range expanded
"income": [51000, 73000, 62000, 59000, 120000], # Income range expanded
"city": ["NYC", "SF", "Berlin", "LA", "Tokyo"], # New cities
"subscription_tier": ["basic", "premium", "basic", "premium", "premium"] # New column
})
contract_prod = infer_contract(df_prod)
# Compare contracts
drift = compare_contracts(contract_train, contract_prod)
# Check for schema drift
print(f"New columns: {drift.added_columns}")
# Output: {'subscription_tier'}
print(f"Removed columns: {drift.removed_columns}")
# Output: set()
print(f"Type changes: {drift.dtype_changes}")
# Output: {}
# Check for distribution drift
print(f"Range changes: {drift.range_changes}")
# Output: {'age', 'income'}
print(f"Category changes: {drift.category_changes}")
# Output: {'city'}
print(f"Missingness changes: {drift.missingness_changes}")
# Output: {} (if missingness rates are the same)
# Programmatic drift detection
has_schema_drift = bool(
drift.added_columns or
drift.removed_columns or
drift.dtype_changes
)
has_distribution_drift = bool(
drift.range_changes or
drift.category_changes or
drift.missingness_changes
)
if has_schema_drift:
print("β οΈ Schema drift detected - model retraining may be required")
if has_distribution_drift:
print("π Distribution drift detected - monitor model performance")Accessing Detailed Drift Information:
# Examine dtype changes in detail
for col, (old_dtype, new_dtype) in drift.dtype_changes.items():
print(f"{col}: {old_dtype} β {new_dtype}")
# Examine missingness changes in detail
for col, (old_frac, new_frac) in drift.missingness_changes.items():
change_pct = (new_frac - old_frac) / old_frac * 100
print(f"{col}: missingness changed by {change_pct:.1f}%")
# Check specific columns for range drift
if 'age' in drift.range_changes:
old_range = (contract_train.columns['age'].min_value,
contract_train.columns['age'].max_value)
new_range = (contract_prod.columns['age'].min_value,
contract_prod.columns['age'].max_value)
print(f"Age range changed from {old_range} to {new_range}")Converts a ValidationResult into an actionable summary by ranking issues by severity, counting issues by type, and highlighting the most critical problems to address first.
Signature:
summarize_violations(
result: ValidationResult,
*,
top_k: int = 5,
weights: Optional[Dict[str, Union[int, float]]] = None
) -> SummaryParameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
result |
ValidationResult |
Yes | - | The validation result to summarize |
top_k |
int |
No | 5 |
Number of top issues to highlight (must be positive) |
weights |
Dict[str, Union[int, float]] |
No | None |
Custom weights for issue severity ranking |
Default Weights (when weights=None):
| Issue Kind | Weight | Rationale |
|---|---|---|
missing_column |
10 | Most severe - data cannot be processed |
extra_column |
8 | High - unexpected data structure |
dtype |
7 | High - may cause runtime errors |
range |
5 | Medium - data quality issue |
category |
5 | Medium - unexpected values |
missingness |
3 | Low-Medium - may affect analysis |
Custom Weights Behavior:
- If
weightsis provided, it completely replaces the defaults - Issue kinds not in the custom dict default to weight
1 - All weights must be positive numeric values
Returns:
| Type | Description |
|---|---|
Summary |
Object containing prioritized issues and statistics |
Summary Object Structure:
Summary(
ok=True, # Same as result.ok
top_issues=[...], # Top K most severe issues (sorted by weight)
counts_by_kind={ # Count of issues by kind
'dtype': 2,
'range': 3,
'category': 1
}
)Sorting Logic:
Issues are sorted by the following criteria (in order):
- Weight (descending) - Higher weight = higher priority
- Column name - Issues with
column=Nonesort first, then alphabetically - Kind (alphabetically)
Raises:
| Exception | When |
|---|---|
TypeError |
If result is not a ValidationResult instance |
TypeError |
If top_k is not an integer |
TypeError |
If weights is not a dict or None |
ValueError |
If top_k is not positive |
ValueError |
If any weight value is non-numeric |
ValueError |
If any weight value is not positive |
Example - Basic Usage:
from pyos_data_validation.summarize_violations import summarize_violations
# Assume we have validation results with issues
result = validate_contract(df_messy, contract)
# Basic summary with default settings
summary = summarize_violations(result)
print(f"Validation passed: {summary.ok}")
print(f"Total issues: {len(result.issues)}")
# Show top 5 most severe issues
print("\nπ΄ Top Issues to Fix:")
for i, issue in enumerate(summary.top_issues, 1):
print(f"{i}. [{issue.kind}] {issue.column}: {issue.message}")
# Example output:
# π΄ Top Issues to Fix:
# 1. [missing_column] user_id: Missing required column: user_id
# 2. [extra_column] debug_flag: Unexpected extra column: debug_flag
# 3. [dtype] age: age: expected int64, got object
# 4. [range] salary: salary: max value 250000 exceeds 150000
# 5. [category] status: status: invalid values {'archived'}
# Group by issue type
print("\nπ Issues by Type:")
for kind, count in summary.counts_by_kind.items():
print(f" {kind}: {count}")
# Example output:
# π Issues by Type:
# missing_column: 1
# extra_column: 1
# dtype: 2
# range: 3
# category: 1pyos_data_validation is inspired by production-grade data validation frameworks but serves a different purpose:
| Feature | pyos_data_validation | Pandera | Great Expectations | Pydantic |
|---|---|---|---|---|
| Target Use Case | Educational, lightweight validation | Production data validation | Enterprise data quality | API input validation |
| Learning Curve | Low | Medium | High | Low-Medium |
| Contract Inference | β Automatic | β Profiling | β Manual only | |
| Drift Detection | β Built-in | β No | β Via profiling | β No |
| Tabular Data Focus | β Yes | β Yes | β Yes | β No (objects) |
| CI/CD Friendly | β Simple integration | β Yes | β Yes | |
| Minimal Dependencies | β pandas only | β Heavy | β Minimal | |
| Validation Customization | β Extensive | β Extensive | β Extensive |
When to use pyos_data_validation:
- Small to medium projects
- Educational purposes and learning data validation concepts
- Quick prototyping of validation logic
- Lightweight CI/CD checks without complex infrastructure
- When you need simple drift detection out of the box
When to use alternatives:
- Pandera: Production ML pipelines with complex validation rules and custom checks
- Great Expectations: Enterprise data quality monitoring with extensive reporting and data docs
- Pydantic: API request/response validation or configuration management with type safety
You can install this package locally into your preferred Python environment using pip:
$ pip install -e .
A typical workflow is to infer a contract from an existing dataset and then validate new data against it.
import pandas as pd
from pyos_data_validation.infer_contract import infer_contract
from pyos_data_validation.validate_contract import validate_contract
# example data
df = pd.DataFrame({
"age": [23, 45, 31],
"income": [50000, 72000, 61000],
"city": ["A", "B", "A"],
})
# infer a contract from the data
contract = infer_contract(df)
# validate the same (or new) data against the contract
result = validate_contract(df, contract)
print(result.passed)
print(result.issues)The validation result contains a boolean pass/fail flag and a list of issues describing any contract violations. This makes the package suitable for lightweight data checks and CI-friendly workflows.
This section is for contributors and developers working on the pyos_data_validation package.
Clone the repository:
git clone https://github.com/UBC-MDS/DSCI_524_G26_Data_Validation.git
cd DSCI_524_G26_Data_ValidationCreate and activate the conda environment:
conda env create -f environment.yml
conda activate pyos_data_validationInstall the package in editable mode with development dependencies:
pip install -e ".[dev,tests]"Run the test suite with coverage:
pytest --cov=pyos_data_validation --cov-report=term --cov-branchRun tests for a specific module:
pytest tests/unit/test_infer_contract.py -vCheck code style with ruff:
ruff check .Format code with ruff:
ruff format .Generate the API reference documentation:
quartodoc buildBuild the documentation locally using Quarto:
quarto renderThe built documentation will be available in the docs/ directory. You can open it in your browser:
open docs/index.htmlTo view the website after rendering from the terminal, run:
quarto previewDocumentation is automatically built and deployed to GitHub Pages when changes are pushed to the main branch. The deployment is handled by the .github/workflows/docs-publish.yml workflow.
View the live documentation at: https://ubc-mds.github.io/DSCI_524_G26_Data_Validation/
- Manikanth Goud Gurujala
- Eduardo Rafael Sanchez Martinez
- Yonas Gebre Marie
- Rahiq Raees
- Copyright Β© 2026 Eduardo, Yonas, Manikanth, Rahiq.
- Free software distributed under the MIT License.