Source: Start Data Engineering - Demonstrate Python Expertise by Building Libraries Purpose: Agent-friendly reference for building the data-quality-checker library Status: Active project, follow this methodology
Question: How do data engineers demonstrate Python expertise in the age of drag-and-drop tools and LLM code generation?
Answer: Build and publish libraries that solve real problems and that other engineers use.
- Demonstrate Expertise - Show employers you can add value before the interview
- Reduce Duplication - Automate repetitive tasks across repos
- Career Advancement - Build a portfolio of published, used packages
- Economics of Coding - Showcase value in a changing landscape
- boto3 wrapper - Enforce company-standard S3 paths, secrets management, infrastructure selection
- Database connection manager - Handle connections to multiple pipeline databases
- Workflow automation - Convert manual processes (e.g., sales validation) to CLI/webapp
- Data quality checker ← THIS PROJECT
"I'm planning to write test scripts to make sure all the end-to-end pipeline tests are satisfied during the development phase. I was looking for a reference on which tools would be the best to use for SQL scripting or Python?"
- Who: Data Engineers validating pipeline data
- What: Tool to validate data quality and log results
- Accept input datasets
- Validate data against rules
- Log validation results
| Aspect | Decision | Rationale |
|---|---|---|
| Dataset Size | < 100GB | Small-to-medium datasets, single-machine processing |
| Input Type | Polars DataFrames only | Modern, fast, reduces complexity |
| Result Storage | SQLite3 table | Lightweight, no external dependencies |
| Validation Types | unique, not_null, accepted_values, relationships |
Match dbt out-of-box tests (most common needs) |
- Pandera - Too complex for simple use cases
- Cuallee - API doesn't match requirements
- Great Expectations - Feature-rich but heavyweight
Our Goal: Build exactly what's needed, no more, no less.
┌─────────────────┐
│ Data Engineer │
│ [Person] │
│ │
│ Validates data │
│ using library │
└────────┬────────┘
│ Uses
▼
┌─────────────────────────┐
│ Data Quality Checker │
│ [Software System] │
│ │
│ Validates Polars DFs, │
│ logs to SQLite, returns │
│ pass/fail │
└────────┬────────────────┘
│ Reads/Writes
▼
┌─────────────────────────┐
│ SQLite Database │
│ [External System] │
│ │
│ Stores validation │
│ results │
└─────────────────────────┘
┌────────────────────────────────────────────────┐
│ Data Quality Checker │
│ [Software System] │
│ │
│ ┌──────────────────────┐ │
│ │ DataQualityChecker │ │
│ │ [Class] │ │
│ │ │ │
│ │ - is_column_unique() │ │
│ │ - is_column_not_null()│ │
│ │ - is_column_enum() │ │
│ │ - are_tables_ │ │
│ │ referential_ │ │
│ │ integral() │ │
│ └──────────┬───────────┘ │
│ │ Uses │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ DBConnector │ │
│ │ [Class] │ │
│ │ │ │
│ │ - log() │ │
│ │ - print_all_logs() │ │
│ └──────────┬───────────┘ │
│ │ Logs to │
└─────────────┼──────────────────────────────────┘
▼
┌─────────────────────┐
│ SQLite Database │
│ [.db file] │
│ │
│ log table: │
│ - id │
│ - timestamp │
│ - check_type │
│ - result │
│ - additional_params│
└─────────────────────┘
Key Observations:
- Orthogonal Systems - Validator and Logger are independent (can swap DBConnector for different output)
- Future Extensibility - Can add InputReader class for S3, CSV, Parquet, SFTP, etc.
- Engineer-Defined Components - Function signatures defined by engineer, not LLM (prevents scope creep)
- LLM Role - Implementation only, not architecture
Recommended Reading: Python Design Patterns
# Initialize library project
uv init --lib data_quality_checker
cd data_quality_checker
# Install dependencies
uv add polars
uv add --dev pytest pytest-cov pytest-mock
# Create structure
mkdir -p ./src/data_quality_checker/connector
touch ./src/data_quality_checker/main.py
touch ./src/data_quality_checker/connector/output_log.py
mkdir -p ./tests/unit
touch ./tests/conftest.pyFile: ./src/data_quality_checker/main.py
from typing import List, Optional
import polars as pl
from .connector.output_log import DBConnector
class DataQualityChecker:
"""Validates Polars DataFrames against data quality rules."""
def __init__(self, db_connector: DBConnector):
"""
Initialize the data quality checker.
Args:
db_connector: DBConnector instance for logging results
"""
self.db_connector = db_connector
def is_column_unique(
self,
df: pl.DataFrame,
column: str
) -> bool:
"""
Check if column values are unique.
Args:
df: Polars DataFrame to validate
column: Column name to check
Returns:
True if all values are unique, False otherwise
"""
pass
def is_column_not_null(
self,
df: pl.DataFrame,
column: str
) -> bool:
"""
Check if column has no null values.
Args:
df: Polars DataFrame to validate
column: Column name to check
Returns:
True if no nulls, False otherwise
"""
pass
def is_column_enum(
self,
df: pl.DataFrame,
column: str,
accepted_values: List[str]
) -> bool:
"""
Check if column values are in accepted list.
Args:
df: Polars DataFrame to validate
column: Column name to check
accepted_values: List of valid values
Returns:
True if all values are in accepted list, False otherwise
"""
pass
def are_tables_referential_integral(
self,
parent_df: pl.DataFrame,
child_df: pl.DataFrame,
parent_key: str,
child_key: str
) -> bool:
"""
Check referential integrity between parent and child tables.
Args:
parent_df: Parent table DataFrame
child_df: Child table DataFrame
parent_key: Primary key column in parent
child_key: Foreign key column in child
Returns:
True if all child keys exist in parent, False otherwise
"""
passFile: ./src/data_quality_checker/connector/output_log.py
import sqlite3
from datetime import datetime
from typing import Dict, Any, Optional
from pathlib import Path
class DBConnector:
"""Manages SQLite database connection and logging."""
def __init__(self, db_path: Path):
"""
Initialize database connector.
Args:
db_path: Path to SQLite database file
"""
self.db_path = db_path
self._initialize_db()
def _initialize_db(self) -> None:
"""Create log table if it doesn't exist."""
pass
def log(
self,
check_type: str,
result: bool,
additional_params: Optional[Dict[str, Any]] = None
) -> None:
"""
Log validation result to database.
Args:
check_type: Type of validation (e.g., 'unique', 'not_null')
result: True if validation passed, False otherwise
additional_params: Additional context (column name, values, etc.)
"""
pass
def print_all_logs(self) -> None:
"""Print all logged validation results."""
passInstructions for LLM:
- Implement the functions based on the signatures and docstrings
- Use Polars DataFrame operations (no pandas)
- Ensure proper error handling
- Log each validation result via DBConnector
- Keep code simple and readable
Reference: See full working code
Required Reading: Pytest Usage Documentation
File: ./tests/conftest.py
import pytest
from pathlib import Path
import tempfile
from data_quality_checker.connector.output_log import DBConnector
@pytest.fixture(scope="session")
def temp_db():
"""Create temporary database for testing."""
with tempfile.NamedTemporaryFile(suffix='.db', delete=False) as tmp:
db_path = Path(tmp.name)
yield db_path
# Cleanup
db_path.unlink()
@pytest.fixture
def db_connector(temp_db):
"""Create DBConnector instance with temporary database."""
return DBConnector(temp_db)
@pytest.fixture
def mock_db_connector(mocker):
"""Mock DBConnector to test without actual DB writes."""
mock = mocker.MagicMock()
mock.log = mocker.MagicMock()
return mockTest Structure:
- Test DBConnector independently (using temp_db fixture)
- Test DataQualityChecker with mocked DBConnector (using mock_db_connector)
- Test all validation functions with edge cases
Run Tests:
uv run pytest tests/
uv run pytest tests/ --cov=src/data_quality_checker # with coverage-
Installation
pip install data-quality-checker
-
Quick Start Example
import polars as pl from data_quality_checker import DataQualityChecker, DBConnector # Setup db = DBConnector("validation_logs.db") checker = DataQualityChecker(db) # Load data df = pl.read_csv("data.csv") # Run checks checker.is_column_unique(df, "user_id") checker.is_column_not_null(df, "email") # View results db.print_all_logs()
-
Feature List
- Unique value validation
- Not-null validation
- Enum/accepted values validation
- Referential integrity validation
-
API Reference
- Full function signatures
- Parameter descriptions
- Return values
- Examples
-
Architecture Diagrams
- System Context (C4 Level 1)
- Container Diagram (C4 Level 2)
-
Development Guide
- How to contribute
- How to run tests
- How to build locally
-
License
- Use MIT License for public packages
LLM Prompt: "Generate a README with the above sections for the data-quality-checker library. Include code examples and make it beginner-friendly."
Reference: Example README
-
Create Account at test.pypi.org
-
Enable 2FA (Required for package creation)
- Go to Account Settings → Security
-
Create API Token
- Account Settings → API tokens → Add API token
- Scope: "Entire account"
- SAVE THE TOKEN - shown only once
-
Build & Upload
# Build distribution uv build # Upload to test PyPI uv publish --publish-url https://test.pypi.org/legacy/ # When prompted: # Username: __token__ # Password: <paste your API token>
-
Test Installation
uv pip install --index-url https://test.pypi.org/simple/ data-quality-checker
-
Verify
- Check your package page:
https://test.pypi.org/project/data-quality-checker/ - Test all functionality in a fresh environment
- Check your package page:
-
Create Account at pypi.org
-
Repeat 2FA & API Token Setup
-
Publish
uv publish # Username: __token__ # Password: <production API token>
-
Verify
pip install data-quality-checker
-
Problem Scoping
- Define: Who uses it? What does it do?
- Set clear constraints (input types, size limits, output format)
- Start small, iterate later
-
Design Architecture
- Draw C4 diagrams (System Context + Container)
- Define function signatures and classes
- Engineer leads, LLM implements
-
LLM-Guided Implementation
- Provide detailed type hints and docstrings
- Generate code with LLM
- Review and refine
-
Create Tests
- Unit tests for each component
- Integration tests for workflows
- Aim for >80% coverage
-
Publish & Document
- Comprehensive README
- Test on Test PyPI first
- Publish to PyPI when ready
- Identify 3 repetitive tasks in your current work
- Pick the simplest one and apply this framework
- Publish to PyPI within 2 weeks
- Add to resume and GitHub portfolio
When working on the data-quality-checker project or similar libraries:
- Follow the architecture - Don't deviate from the defined class structure
- Implement one component at a time - DBConnector first, then DataQualityChecker
- Write tests alongside code - Each function needs corresponding tests
- Keep it simple - Resist feature creep, stick to the defined scope
- Document as you go - Update README with each new feature
- Use type hints everywhere - They guide LLM generation and catch errors
- Test on Test PyPI first - Always validate before production publish
- Expertise = Building, not memorizing - Libraries demonstrate real value
- Scope ruthlessly - Better to ship something small and useful than nothing big and perfect
- Architecture first, code second - LLMs implement, humans design
- Tests prevent regression - Critical for library maintenance
- Publishing matters - It's not real until others can
pip installit
- Article: Start Data Engineering - Building Libraries
- Example Repo: data_quality_checker on GitHub
- Tools:
- Design Patterns: Refactoring Guru
- C4 Model: c4model.com