welcome to life, you are a data engineer mentor. write inthis repo your commands and guidelines that will help you trought your ork in .md files, user the repo structure that better suits you
# Data Engineering Mentor A comprehensive reference guide for data engineering
- commands,
patterns, and best practices.
docs/ ├── sql/ # SQL commands, patterns, and optimization ├── python/ # Python & PySpark for data engineering
Located at docs/sql/README.md, includes:
- Deduplication patterns
- Incremental load patterns
- SCD Type 2 slowly changing dimensions
- Window functions (running totals, LAG/LEAD, percentiles)
- Performance optimization (indexing, partitioning)
- Data quality queries (null analysis, duplicates, eferential integrity)
- Schema management and migration patterns
✢ Creating SQL guide… (esc to interrupt · ctrl+t to hide odos) ⎿ ☒ Create repository structure with directories ☒ Write main README with overview ☐ Create SQL commands and best practices guide ☐ Create Python/PySpark data engineering guide ☐ Create data quality patterns guide ☐ Create ETL/ELT pipeline patterns guide ☐ Create CLI tools and commands reference
Create a skill at
.claude/skills/data-engineering-mentor/SKILL.md`:
---
name: data-engineering-mentor
description: Data engineering expert providing guidance on
ata pipelines, quality checks, ETL design, and
rchitecture. Use for design reviews, mentorship, and best
ractice questions.
argument-hint: [topic] or [code-to-review]
---
You are an experienced data engineering mentor with deep
xpertise in:
1. **Data Pipelines & ETL** - Architecture, orchestration
Airflow, dbt), quality
2. **Data Quality** - Validation design, anomaly detection,
governance
3. **Database Design** - Schema design, partitioning,
ptimization
4. **Tools** - SQL, Python (pandas, PySpark, Polars),
odern data stack
## Mentoring Approach
- Start with fundamentals before implementation
- Provide concrete, runnable examples
- Discuss tradeoffs and context
- Review code for data patterns and optimization
## Reference Documentation
Consult these project docs for patterns:
- SQL patterns: docs/sql/README.md
- Python patterns: docs/python/README.md
- Pipeline patterns: docs/pipelines/README.md
- Data quality: docs/data-quality/README.md
- CLI reference: docs/cli/README.md
- Architecture: docs/architecture/README.mdCreate at .claude/agents/data-engineering-mentor.md if
ou need:
- Task isolation (doesn't clutter main conversation)
- Persistent memory across sessions (
memory: user) - Restricted tool access
- Different model (opus for capability, haiku for speed)
.claude/
├── skills/
│ └── data-engineering-mentor/
│ ├── SKILL.md # Main agent instructions
│ ├── docs/ # Supporting
ocumentation
│ │ ├── data-quality-patterns.md
│ │ └── etl-best-practices.md
│ └── examples/ # Code examples
│ ├── validation.sql
│ └── quality-checks.py
└── agents/ # Alternative: subagent
pproach
└── data-engineering-mentor.md
Content to include:
- Core data processing patterns (pandas, Polars, PySpark)
- File I/O patterns (parquet, JSON, CSV)
- Data validation and transformation patterns
- Performance optimization (vectorization, partitioning)
- Common utilities and helper functions
- Type hints best practices
Content to include:
- Airflow DAG patterns and best practices
- dbt patterns (models, tests, macros)
- Error handling and retry strategies
- Idempotency patterns
- Incremental processing patterns
- Pipeline testing strategies
Content to include:
- Great Expectations patterns and setup
- dbt testing patterns
- Validation strategies (schema, business logic, freshness)
- Data quality metrics and monitoring
- Anomaly detection approaches
- Quality reporting patterns
Content to include:
- Database CLIs (psql, mysql, bq, snowsql)
- Cloud CLIs (aws, gcloud, az) for data ops
- Data file tools (jq, csvkit, parquet-tools)
- dbt CLI reference
- Docker commands for data infrastructure
- Shell scripting patterns for data ops
docs/architecture/README.md)
Content to include:
- Data modeling (star schema, snowflake, data vault)
- Data lake/lakehouse (Delta Lake, Iceberg)
- Stream vs batch processing
- Data warehouse design patterns
- Data mesh concepts
- CDC patterns
- Schema evolution strategies
- Create agent structure:
mkdir -p claude/skills/data-engineering-mentor/{docs,examples} - Write SKILL.md: Use template above
- Fill remaining docs: Complete the 5 empty sections
- Test agent:
/data-engineering-mentor How should I esign quality checks? - Iterate: Refine based on usage
- Data Quality First - Validate early, validate often
- Idempotency - Operations should be safely repeatable
- Observability - Log, monitor, alert on everything
- Schema Evolution - Plan for change from day one
- Documentation - If it's not documented, it doesn't
- exist