Building Production-Ready Python Libraries

A Data Engineer's Guide to Demonstrating Expertise

Source: Start Data Engineering - Demonstrate Python Expertise by Building Libraries Purpose: Agent-friendly reference for building the data-quality-checker library Status: Active project, follow this methodology

The Core Problem

Question: How do data engineers demonstrate Python expertise in the age of drag-and-drop tools and LLM code generation?

Answer: Build and publish libraries that solve real problems and that other engineers use.

Project Motivation

Why Build Libraries?

Demonstrate Expertise - Show employers you can add value before the interview
Reduce Duplication - Automate repetitive tasks across repos
Career Advancement - Build a portfolio of published, used packages
Economics of Coding - Showcase value in a changing landscape

Example Use Cases

boto3 wrapper - Enforce company-standard S3 paths, secrets management, infrastructure selection
Database connection manager - Handle connections to multiple pipeline databases
Workflow automation - Convert manual processes (e.g., sales validation) to CLI/webapp
Data quality checker ← THIS PROJECT

Project Scope: Data Quality Checker

Problem Statement (From User Email)

"I'm planning to write test scripts to make sure all the end-to-end pipeline tests are satisfied during the development phase. I was looking for a reference on which tools would be the best to use for SQL scripting or Python?"

Who & What

Who: Data Engineers validating pipeline data
What: Tool to validate data quality and log results

Features (MVP)

Accept input datasets
Validate data against rules
Log validation results

Constraints & Assumptions

Aspect	Decision	Rationale
Dataset Size	< 100GB	Small-to-medium datasets, single-machine processing
Input Type	Polars DataFrames only	Modern, fast, reduces complexity
Result Storage	SQLite3 table	Lightweight, no external dependencies
Validation Types	`unique`, `not_null`, `accepted_values`, `relationships`	Match dbt out-of-box tests (most common needs)

Why Not Use Existing Libraries?

Pandera - Too complex for simple use cases
Cuallee - API doesn't match requirements
Great Expectations - Feature-rich but heavyweight

Our Goal: Build exactly what's needed, no more, no less.

System Architecture (C4 Model)

Level 1: System Context Diagram

┌─────────────────┐
│  Data Engineer  │
│    [Person]     │
│                 │
│ Validates data  │
│ using library   │
└────────┬────────┘
         │ Uses
         ▼
┌─────────────────────────┐
│ Data Quality Checker    │
│  [Software System]      │
│                         │
│ Validates Polars DFs,   │
│ logs to SQLite, returns │
│ pass/fail               │
└────────┬────────────────┘
         │ Reads/Writes
         ▼
┌─────────────────────────┐
│  SQLite Database        │
│  [External System]      │
│                         │
│  Stores validation      │
│  results                │
└─────────────────────────┘

Level 2: Container Diagram

┌────────────────────────────────────────────────┐
│        Data Quality Checker                    │
│         [Software System]                      │
│                                                │
│  ┌──────────────────────┐                     │
│  │ DataQualityChecker   │                     │
│  │      [Class]         │                     │
│  │                      │                     │
│  │ - is_column_unique() │                     │
│  │ - is_column_not_null()│                    │
│  │ - is_column_enum()   │                     │
│  │ - are_tables_        │                     │
│  │   referential_       │                     │
│  │   integral()         │                     │
│  └──────────┬───────────┘                     │
│             │ Uses                             │
│             ▼                                  │
│  ┌──────────────────────┐                     │
│  │   DBConnector        │                     │
│  │      [Class]         │                     │
│  │                      │                     │
│  │ - log()              │                     │
│  │ - print_all_logs()   │                     │
│  └──────────┬───────────┘                     │
│             │ Logs to                          │
└─────────────┼──────────────────────────────────┘
              ▼
    ┌─────────────────────┐
    │  SQLite Database    │
    │     [.db file]      │
    │                     │
    │  log table:         │
    │  - id               │
    │  - timestamp        │
    │  - check_type       │
    │  - result           │
    │  - additional_params│
    └─────────────────────┘

Design Patterns & Principles

Key Observations:

Orthogonal Systems - Validator and Logger are independent (can swap DBConnector for different output)
Future Extensibility - Can add InputReader class for S3, CSV, Parquet, SFTP, etc.
Engineer-Defined Components - Function signatures defined by engineer, not LLM (prevents scope creep)
LLM Role - Implementation only, not architecture

Recommended Reading: Python Design Patterns

Implementation Workflow

Step 1: Project Setup with uv

# Initialize library project
uv init --lib data_quality_checker
cd data_quality_checker

# Install dependencies
uv add polars 
uv add --dev pytest pytest-cov pytest-mock

# Create structure
mkdir -p ./src/data_quality_checker/connector
touch ./src/data_quality_checker/main.py 
touch ./src/data_quality_checker/connector/output_log.py 

mkdir -p ./tests/unit
touch ./tests/conftest.py

Step 2: Define Function Signatures (Engineer-Led)

File: ./src/data_quality_checker/main.py

from typing import List, Optional
import polars as pl
from .connector.output_log import DBConnector

class DataQualityChecker:
    """Validates Polars DataFrames against data quality rules."""
    
    def __init__(self, db_connector: DBConnector):
        """
        Initialize the data quality checker.
        
        Args:
            db_connector: DBConnector instance for logging results
        """
        self.db_connector = db_connector
    
    def is_column_unique(
        self, 
        df: pl.DataFrame, 
        column: str
    ) -> bool:
        """
        Check if column values are unique.
        
        Args:
            df: Polars DataFrame to validate
            column: Column name to check
            
        Returns:
            True if all values are unique, False otherwise
        """
        pass
    
    def is_column_not_null(
        self, 
        df: pl.DataFrame, 
        column: str
    ) -> bool:
        """
        Check if column has no null values.
        
        Args:
            df: Polars DataFrame to validate
            column: Column name to check
            
        Returns:
            True if no nulls, False otherwise
        """
        pass
    
    def is_column_enum(
        self, 
        df: pl.DataFrame, 
        column: str, 
        accepted_values: List[str]
    ) -> bool:
        """
        Check if column values are in accepted list.
        
        Args:
            df: Polars DataFrame to validate
            column: Column name to check
            accepted_values: List of valid values
            
        Returns:
            True if all values are in accepted list, False otherwise
        """
        pass
    
    def are_tables_referential_integral(
        self,
        parent_df: pl.DataFrame,
        child_df: pl.DataFrame,
        parent_key: str,
        child_key: str
    ) -> bool:
        """
        Check referential integrity between parent and child tables.
        
        Args:
            parent_df: Parent table DataFrame
            child_df: Child table DataFrame
            parent_key: Primary key column in parent
            child_key: Foreign key column in child
            
        Returns:
            True if all child keys exist in parent, False otherwise
        """
        pass

File: ./src/data_quality_checker/connector/output_log.py

import sqlite3
from datetime import datetime
from typing import Dict, Any, Optional
from pathlib import Path

class DBConnector:
    """Manages SQLite database connection and logging."""
    
    def __init__(self, db_path: Path):
        """
        Initialize database connector.
        
        Args:
            db_path: Path to SQLite database file
        """
        self.db_path = db_path
        self._initialize_db()
    
    def _initialize_db(self) -> None:
        """Create log table if it doesn't exist."""
        pass
    
    def log(
        self,
        check_type: str,
        result: bool,
        additional_params: Optional[Dict[str, Any]] = None
    ) -> None:
        """
        Log validation result to database.
        
        Args:
            check_type: Type of validation (e.g., 'unique', 'not_null')
            result: True if validation passed, False otherwise
            additional_params: Additional context (column name, values, etc.)
        """
        pass
    
    def print_all_logs(self) -> None:
        """Print all logged validation results."""
        pass

Step 3: Generate Implementation with LLM

Instructions for LLM:

Implement the functions based on the signatures and docstrings
Use Polars DataFrame operations (no pandas)
Ensure proper error handling
Log each validation result via DBConnector
Keep code simple and readable

Reference: See full working code

Step 4: Create Tests (Prevent Regression)

Required Reading: Pytest Usage Documentation

File: ./tests/conftest.py

import pytest
from pathlib import Path
import tempfile
from data_quality_checker.connector.output_log import DBConnector

@pytest.fixture(scope="session")
def temp_db():
    """Create temporary database for testing."""
    with tempfile.NamedTemporaryFile(suffix='.db', delete=False) as tmp:
        db_path = Path(tmp.name)
    
    yield db_path
    
    # Cleanup
    db_path.unlink()

@pytest.fixture
def db_connector(temp_db):
    """Create DBConnector instance with temporary database."""
    return DBConnector(temp_db)

@pytest.fixture
def mock_db_connector(mocker):
    """Mock DBConnector to test without actual DB writes."""
    mock = mocker.MagicMock()
    mock.log = mocker.MagicMock()
    return mock

Test Structure:

Test DBConnector independently (using temp_db fixture)
Test DataQualityChecker with mocked DBConnector (using mock_db_connector)
Test all validation functions with edge cases

Run Tests:

uv run pytest tests/
uv run pytest tests/ --cov=src/data_quality_checker  # with coverage

Documentation: README Structure

Required Sections

Installation
```
pip install data-quality-checker
```

Quick Start Example

import polars as pl
from data_quality_checker import DataQualityChecker, DBConnector

# Setup
db = DBConnector("validation_logs.db")
checker = DataQualityChecker(db)

# Load data
df = pl.read_csv("data.csv")

# Run checks
checker.is_column_unique(df, "user_id")
checker.is_column_not_null(df, "email")

# View results
db.print_all_logs()

Feature List
- Unique value validation
- Not-null validation
- Enum/accepted values validation
- Referential integrity validation
API Reference
- Full function signatures
- Parameter descriptions
- Return values
- Examples
Architecture Diagrams
- System Context (C4 Level 1)
- Container Diagram (C4 Level 2)
Development Guide
- How to contribute
- How to run tests
- How to build locally
License
- Use MIT License for public packages

LLM Prompt: "Generate a README with the above sections for the data-quality-checker library. Include code examples and make it beginner-friendly."

Reference: Example README

Publishing Workflow

Phase 1: Test PyPI (Staging)

Create Account at test.pypi.org
Enable 2FA (Required for package creation)
- Go to Account Settings → Security
Create API Token
- Account Settings → API tokens → Add API token
- Scope: "Entire account"
- SAVE THE TOKEN - shown only once

Build & Upload

# Build distribution
uv build

# Upload to test PyPI
uv publish --publish-url https://test.pypi.org/legacy/

# When prompted:
# Username: __token__
# Password: <paste your API token>

Test Installation

uv pip install --index-url https://test.pypi.org/simple/ data-quality-checker

Verify
- Check your package page: https://test.pypi.org/project/data-quality-checker/
- Test all functionality in a fresh environment

Phase 2: Production PyPI

Create Account at pypi.org
Repeat 2FA & API Token Setup

Publish

uv publish

# Username: __token__
# Password: <production API token>

Verify
```
pip install data-quality-checker
```

Repeatable Framework Summary

The 5-Step Process

Problem Scoping
- Define: Who uses it? What does it do?
- Set clear constraints (input types, size limits, output format)
- Start small, iterate later
Design Architecture
- Draw C4 diagrams (System Context + Container)
- Define function signatures and classes
- Engineer leads, LLM implements
LLM-Guided Implementation
- Provide detailed type hints and docstrings
- Generate code with LLM
- Review and refine
Create Tests
- Unit tests for each component
- Integration tests for workflows
- Aim for >80% coverage
Publish & Document
- Comprehensive README
- Test on Test PyPI first
- Publish to PyPI when ready

Next Steps for Your Career

Identify 3 repetitive tasks in your current work
Pick the simplest one and apply this framework
Publish to PyPI within 2 weeks
Add to resume and GitHub portfolio

Agent Instructions

When working on the data-quality-checker project or similar libraries:

Follow the architecture - Don't deviate from the defined class structure
Implement one component at a time - DBConnector first, then DataQualityChecker
Write tests alongside code - Each function needs corresponding tests
Keep it simple - Resist feature creep, stick to the defined scope
Document as you go - Update README with each new feature
Use type hints everywhere - They guide LLM generation and catch errors
Test on Test PyPI first - Always validate before production publish

Key Takeaways

Expertise = Building, not memorizing - Libraries demonstrate real value
Scope ruthlessly - Better to ship something small and useful than nothing big and perfect
Architecture first, code second - LLMs implement, humans design
Tests prevent regression - Critical for library maintenance
Publishing matters - It's not real until others can pip install it

Resources

Article: Start Data Engineering - Building Libraries
Example Repo: data_quality_checker on GitHub
Tools:
- uv - Modern Python package manager
- Polars - Fast DataFrame library
- pytest - Testing framework
Design Patterns: Refactoring Guru
C4 Model: c4model.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building Production-Ready Python Libraries

A Data Engineer's Guide to Demonstrating Expertise

The Core Problem

Project Motivation

Why Build Libraries?

Example Use Cases

Project Scope: Data Quality Checker

Problem Statement (From User Email)

Who & What

Features (MVP)

Constraints & Assumptions

Why Not Use Existing Libraries?

System Architecture (C4 Model)

Level 1: System Context Diagram

Level 2: Container Diagram

Design Patterns & Principles

Implementation Workflow

Step 1: Project Setup with uv

Step 2: Define Function Signatures (Engineer-Led)

Step 3: Generate Implementation with LLM

Step 4: Create Tests (Prevent Regression)

Documentation: README Structure

Required Sections

Publishing Workflow

Phase 1: Test PyPI (Staging)

Phase 2: Production PyPI

Repeatable Framework Summary

The 5-Step Process

Next Steps for Your Career

Agent Instructions

Key Takeaways

Resources

FilesExpand file tree

production-ready-library.md

Latest commit

History

production-ready-library.md

File metadata and controls

Building Production-Ready Python Libraries

A Data Engineer's Guide to Demonstrating Expertise

The Core Problem

Project Motivation

Why Build Libraries?

Example Use Cases

Project Scope: Data Quality Checker

Problem Statement (From User Email)

Who & What

Features (MVP)

Constraints & Assumptions

Why Not Use Existing Libraries?

System Architecture (C4 Model)

Level 1: System Context Diagram

Level 2: Container Diagram

Design Patterns & Principles

Implementation Workflow

Step 1: Project Setup with uv

Step 2: Define Function Signatures (Engineer-Led)

Step 3: Generate Implementation with LLM

Step 4: Create Tests (Prevent Regression)

Documentation: README Structure

Required Sections

Publishing Workflow

Phase 1: Test PyPI (Staging)

Phase 2: Production PyPI

Repeatable Framework Summary

The 5-Step Process

Next Steps for Your Career

Agent Instructions

Key Takeaways

Resources