Validate Polars DataFrames and log results to SQLite.
A lightweight Python library for data engineers who need to validate pipeline data without the overhead of heavyweight frameworks.
pip install data-quality-checkerOr with uv:
uv add data-quality-checkerimport polars as pl
from data_quality_checker import DataQualityChecker, DBConnector
# Setup
db = DBConnector("validation_logs.db")
checker = DataQualityChecker(db)
# Load data
df = pl.DataFrame({
"user_id": [1, 2, 3],
"email": ["a@test.com", "b@test.com", "c@test.com"],
"status": ["active", "inactive", "active"],
})
# Run checks
checker.is_column_unique(df, "user_id") # True
checker.is_column_not_null(df, "email") # True
checker.is_column_enum(df, "status", ["active", "inactive", "pending"]) # True
# View results
db.print_all_logs()| Check | Method | Description |
|---|---|---|
| Unique | is_column_unique(df, column) |
All values are unique |
| Not Null | is_column_not_null(df, column) |
No null values |
| Accepted Values | is_column_enum(df, column, values) |
All values in accepted list |
| Referential Integrity | are_tables_referential_integral(parent, child, pk, fk) |
All foreign keys exist in parent |
Each check:
- Returns
True(pass) orFalse(fail) - Logs the result to SQLite with timestamp and context
- Raises
ValueErrorif the column doesn't exist
Install and run checks from the command line with a YAML config file:
# Install
uv add data-quality-checker
# Run checks against a data file
dqc check data.csv --config checks.yml
# View logged results
dqc logs validation_logs.db# checks.yml
db: validation_logs.db
checks:
- type: not_null
column: uuid_inventario
- type: unique
column: uuid_inventario
- type: accepted_values
column: occurrence_type
values:
- "Monitoração de Drenagem Profunda"
- "Monitoração de OAE"YAML type |
Method | Extra Fields |
|---|---|---|
not_null |
is_column_not_null() |
column |
unique |
is_column_unique() |
column |
accepted_values |
is_column_enum() |
column, values |
.csv— loaded withpolars.read_csv().parquet— loaded withpolars.read_parquet()
0— all checks passed1— one or more checks failed, or an error occurred
Manages SQLite connection and logging.
log(check_type, result, additional_params=None)- Log a validation resultprint_all_logs()- Print all logged results
Validates Polars DataFrames.
is_column_unique(df, column)- Check uniquenessis_column_not_null(df, column)- Check for nullsis_column_enum(df, column, accepted_values)- Check accepted valuesare_tables_referential_integral(parent_df, child_df, parent_key, child_key)- Check referential integrity
┌─────────────────┐
│ Data Engineer │
│ [Person] │
└────────┬─────────┘
│ Uses
▼
┌─────────────────────────┐
│ DataQualityChecker │
│ │
│ - is_column_unique() │
│ - is_column_not_null() │
│ - is_column_enum() │
│ - are_tables_ │
│ referential_integral()│
└────────┬────────────────┘
│ Logs via
▼
┌─────────────────────────┐
│ DBConnector │
│ │
│ - log() │
│ - print_all_logs() │
└────────┬────────────────┘
│ Writes to
▼
┌─────────────────────────┐
│ SQLite (.db file) │
│ │
│ validation_log table │
└─────────────────────────┘
# Clone and install
git clone <repo-url>
cd data-quality-checker
uv sync --all-extras
# Run tests
uv run pytest tests/
# Run with coverage
uv run pytest tests/ --cov=src/data_quality_checker- Dataset size: < 100GB (single-machine processing)
- Input type: Polars DataFrames only
- Result storage: SQLite3
- Python: >= 3.9
| Topic | Description |
|---|---|
| SQL Guide | Query patterns, optimization, window functions |
| Python & PySpark | Data processing with Python ecosystem |
| Pipeline Patterns | ETL/ELT design and orchestration |
| Data Quality | Validation, testing, and monitoring |
| CLI Tools | Essential command-line references |
| Architecture | Data platform design patterns |
MIT