- Curriculum Overview
- Learning Objectives
- Role Definition & Career Context
- Prerequisites & Preparation
- Curriculum Structure
- Detailed Module Breakdown
- Project-Based Learning
- Assessment Framework
- Study Plans
- Skills Matrix
- Technology Stack
- Environment Setup
- Learning Resources
- Career Advancement
This comprehensive curriculum transforms infrastructure engineers into specialized ML Platform Engineers who design, build, and maintain self-service platforms that enable data scientists and ML engineers to be productive at scale.
Core Focus Areas:
- Platform Architecture: Design multi-tenant, scalable ML platforms
- API Development: Build intuitive APIs and SDKs for ML workloads
- Feature Stores: Implement enterprise-grade feature management systems
- Workflow Orchestration: Create DAG-based ML pipeline systems
- Model Management: Build model registries with versioning and governance
- Developer Experience: Design exceptional tools and documentation
- Observability: Monitor and optimize platform performance
- Security & Governance: Implement authentication, authorization, and compliance
This curriculum emphasizes:
- Hands-On Projects: 80% practical implementation, 20% theory
- Progressive Complexity: Each project builds on previous knowledge
- Real-World Scenarios: Based on production platform patterns
- Best Practices: Industry-standard tools and architectural patterns
- Portfolio Development: Build demonstrable expertise
Part-Time (15-20 hours/week):
- Duration: 6-8 months
- Module study: 2-3 hours/week
- Project work: 12-17 hours/week
- Assessment: 1-2 hours/week
Full-Time (40 hours/week):
- Duration: 3-4 months
- Module study: 5-8 hours/week
- Project work: 30-35 hours/week
- Assessment: 2-3 hours/week
Total Estimated Hours: 600-700 hours
By completing this curriculum, you will be able to:
- Design multi-tenant ML platforms with proper isolation and resource management
- Create API-first architectures that abstract infrastructure complexity
- Implement scalable platform services supporting hundreds of users
- Build extensible plugin systems for platform customization
- Design for high availability with proper failover and disaster recovery
- Design RESTful APIs following OpenAPI specifications
- Implement gRPC services for high-performance operations
- Build Python SDKs with excellent developer experience
- Create CLI tools for platform operations
- Write comprehensive API documentation with examples
- Implement feature stores with online and offline serving
- Build data pipelines for feature engineering at scale
- Design point-in-time correct data retrieval systems
- Implement feature versioning and lineage tracking
- Monitor data quality and detect feature drift
- Design DAG-based workflow systems for ML pipelines
- Implement task scheduling with dependency resolution
- Build distributed execution engines using Kubernetes
- Create retry and error handling mechanisms
- Monitor workflow performance and debug failures
- Build model registries with versioning and metadata
- Implement model deployment workflows and promotion gates
- Track model lineage from training to production
- Create A/B testing frameworks for model evaluation
- Monitor model performance and detect degradation
- Design intuitive APIs that ML practitioners love
- Create interactive documentation and tutorials
- Build self-service onboarding flows
- Implement feedback loops for continuous improvement
- Measure platform adoption and satisfaction
- Instrument services with metrics, logs, and traces
- Build monitoring dashboards for platform health
- Create alerting rules for proactive incident response
- Implement distributed tracing for debugging
- Define SLIs and SLOs for platform reliability
- Implement authentication (SSO, SAML, OIDC)
- Design RBAC systems with fine-grained permissions
- Ensure data privacy and regulatory compliance
- Build audit logging for governance requirements
- Implement secrets management securely
- Deploy platforms to Kubernetes clusters
- Implement CI/CD pipelines for platform services
- Manage infrastructure as code (Terraform/Pulumi)
- Perform capacity planning and cost optimization
- Handle incident response and on-call rotations
- Make architectural decisions with proper trade-off analysis
- Mentor junior engineers on platform best practices
- Write technical specifications and design documents
- Collaborate with stakeholders (data scientists, SREs, product)
- Drive platform adoption across the organization
An ML Platform Engineer specializes in building internal platforms that enable data scientists and ML engineers to train, deploy, and operate machine learning models efficiently at scale. This role sits at the intersection of:
- Infrastructure Engineering: Kubernetes, cloud, distributed systems
- Software Engineering: API design, SDK development, testing
- Data Engineering: Data pipelines, feature engineering, storage
- ML Engineering: Understanding ML workflows and requirements
- Platform Engineering: Developer experience, self-service tooling
- Build Self-Service Platforms: Create tools that reduce manual work for ML teams
- Maintain ML Infrastructure: Ensure platform reliability and performance
- Design APIs and SDKs: Provide excellent developer interfaces
- Implement Governance: Enforce security, compliance, and best practices
- Drive Platform Adoption: Work with users to improve experience
- Scale ML Operations: Support growth from 10 to 1000+ models in production
Junior AI Infrastructure Engineer (Level 0)
↓
AI Infrastructure Engineer (Level 1)
↓
Senior AI Infrastructure Engineer (Level 2)
↓
┌─────────────────┴─────────────────┐
│ │
ML Platform Engineer (Level 2.5A) MLOps Engineer (Level 2.5B)
│ │
└─────────────────┬─────────────────┘
↓
AI Infrastructure Architect (Level 3)
↓
Senior AI Infrastructure Architect (Level 4)
↓
Principal AI Infrastructure Architect (Level 5A)
This curriculum positions you at: ML Platform Engineer (Level 2.5A)
- ML Platform Engineer: $150,000 - $220,000
- Senior ML Platform Engineer: $180,000 - $280,000
- Staff ML Platform Engineer: $220,000 - $350,000
- Principal ML Platform Engineer: $280,000 - $450,000+
Varies by location, company size, and experience. Top tech companies (FAANG) pay at upper end.
- Big Tech: Google, Meta, Amazon, Microsoft, Apple
- ML-First Companies: OpenAI, Anthropic, Cohere, Hugging Face
- Tech Unicorns: Uber, Airbnb, Netflix, Spotify, LinkedIn
- Enterprises: Banks, healthcare, retail, manufacturing with ML initiatives
- Startups: ML infrastructure companies, MLOps platforms
Before starting this curriculum, you should have:
-
Python 3.11+: Advanced proficiency
- Object-oriented programming
- Async/await patterns
- Type hints and mypy
- Context managers and decorators
- Testing with pytest
-
Shell Scripting: Bash proficiency for automation
Assessment: Can you build a REST API with FastAPI from scratch?
- Core Concepts: Pods, Services, Deployments, ConfigMaps, Secrets
- Advanced Topics: StatefulSets, DaemonSets, Custom Resource Definitions
- Operations: kubectl, Helm, debugging failed pods
- Networking: Service mesh basics, ingress controllers
- Storage: PersistentVolumes, StorageClasses
Assessment: Can you deploy a multi-tier application to Kubernetes?
-
SQL: PostgreSQL or MySQL
- Complex queries with joins
- Index optimization
- Transaction management
-
NoSQL: MongoDB or Redis
- Document modeling
- Query optimization
- Caching strategies
Assessment: Can you design a database schema for a multi-tenant application?
- REST: HTTP methods, status codes, versioning
- API Design: Resource modeling, pagination, filtering
- Authentication: OAuth2, JWT, API keys
- Documentation: OpenAPI/Swagger
Assessment: Can you design a RESTful API for a CRUD application?
Choose at least one:
- AWS: EC2, S3, RDS, EKS, IAM
- GCP: Compute Engine, GCS, Cloud SQL, GKE, IAM
- Azure: VMs, Blob Storage, SQL Database, AKS, RBAC
Assessment: Can you provision infrastructure using cloud CLI or console?
- Git: Branching, merging, rebasing
- GitHub Actions or GitLab CI: Pipeline definition, jobs
- Docker: Building images, multi-stage builds
- Testing: Unit tests, integration tests, e2e tests
Assessment: Can you create a CI/CD pipeline that builds and deploys an app?
- Terraform or Pulumi: Basic resource provisioning
- Configuration Management: Understanding of declarative configs
Assessment: Can you define infrastructure for a simple web app?
- ML Workflows: Training, validation, testing, deployment
- Frameworks: Familiarity with PyTorch or TensorFlow
- Model Serving: Basic understanding of inference
- Data Processing: Feature engineering concepts
Assessment: Can you explain the ML lifecycle from data to production?
If you're missing any prerequisites, complete these first:
- Python: Python for DevOps
- Kubernetes: Kubernetes Up & Running
- Databases: Designing Data-Intensive Applications
- API Design: REST API Design Rulebook
- Cloud: Platform-specific training (AWS, GCP, Azure)
- ML Basics: Introduction to Machine Learning with Python
Test your readiness: assessments/quizzes/prerequisite-quiz.py
Passing Score: 70% (if you score lower, review prerequisites)
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 1: FOUNDATIONS │
│ (Weeks 1-3, ~50 hours) │
│ │
│ Module 01: Platform Fundamentals │
│ Module 02: API Design for ML Platforms │
│ Module 03: Multi-Tenancy & Resource Management │
│ │
│ Goal: Understand platform engineering principles │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 2: CORE COMPETENCIES │
│ (Weeks 4-9, ~100 hours) │
│ │
│ Module 04: Feature Store Architecture │
│ Module 05: Workflow Orchestration │
│ Module 06: Model Management & Registry │
│ Module 07: Developer Experience & Tooling │
│ Module 08: Observability & Monitoring │
│ Module 09: Security & Governance │
│ │
│ Goal: Master ML platform components │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 3: HANDS-ON PROJECTS │
│ (Weeks 10-25, ~500 hours) │
│ │
│ Project 01: Self-Service ML Platform Core (120h) │
│ Project 02: Enterprise Feature Store (120h) │
│ Project 03: ML Workflow Orchestration (120h) │
│ Project 04: Model Registry & Management (120h) │
│ Project 05: Developer Portal & SDK (120h) │
│ │
│ Goal: Build production-quality platform components │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 4: ASSESSMENT & CERTIFICATION │
│ (Weeks 26-27, ~40 hours) │
│ │
│ Capstone Project: Complete ML Platform │
│ Portfolio Review │
│ Final Practical Exam │
│ │
│ Goal: Validate competency and build portfolio │
└─────────────────────────────────────────────────────────────────┘
- Lecture Notes: Conceptual foundations (~20% of time)
- Guided Exercises: Step-by-step practice (~20% of time)
- Hands-On Projects: Build real systems (~50% of time)
- Assessments: Quizzes and exams (~10% of time)
Monday-Wednesday (2-3 hours/day):
- Study module lecture notes
- Complete guided exercises
- Watch supplementary videos
Thursday-Sunday (3-5 hours/day):
- Work on project implementation
- Debug and refine code
- Write documentation
- Peer code review
Sunday Evening (1 hour):
- Complete module quiz
- Update progress tracker
- Plan next week's work
Duration: 8 hours | Week: 1
- Understand what ML platform engineering entails
- Learn platform thinking and abstraction design
- Study multi-tenancy architecture patterns
- Explore API-first platform development
- Analyze production ML platform examples
-
Introduction to ML Platform Engineering (1.5 hours)
- What is a platform vs infrastructure?
- ML platform engineer responsibilities
- Platform value proposition
- Case studies: Uber Michelangelo, Netflix Metaflow
-
Platform Thinking (2 hours)
- Abstraction design principles
- Self-service vs full-service
- Developer experience (DX) fundamentals
- Platform adoption strategies
-
Multi-Tenancy Patterns (2 hours)
- Namespace isolation in Kubernetes
- Resource quota management
- Security boundaries
- Cost allocation models
-
API-First Development (1.5 hours)
- API design principles
- Versioning strategies
- Backward compatibility
- Documentation requirements
-
Platform Architecture Patterns (1 hour)
- Microservices for platforms
- Event-driven architectures
- Plugin systems
- Extension points
- Exercise 01: Design API for a simple resource provisioning system
- Exercise 02: Implement namespace isolation in Kubernetes
- Exercise 03: Create resource quota management for teams
- Exercise 04: Build a simple plugin system
- Quiz: 15 questions on platform concepts
- Practical: Design API for ML resource provisioning
Duration: 10 hours | Week: 2
- Design RESTful APIs following best practices
- Implement gRPC services for performance-critical operations
- Create comprehensive API documentation
- Build Python SDKs with excellent DX
- Handle API versioning and evolution
-
RESTful API Design (3 hours)
- Resource modeling for ML workloads
- HTTP methods and status codes
- Pagination, filtering, sorting
- Error handling and validation
- Rate limiting and throttling
-
gRPC for ML Platforms (2 hours)
- Protocol Buffers definition
- Service definition best practices
- Streaming RPCs
- Error handling in gRPC
- Performance considerations
-
API Versioning (1.5 hours)
- Versioning strategies (URL, header, content negotiation)
- Backward compatibility
- Deprecation policies
- Migration strategies
-
API Documentation (1.5 hours)
- OpenAPI/Swagger specifications
- Interactive documentation (Swagger UI, ReDoc)
- Code examples and tutorials
- Changelog maintenance
-
SDK Development (2 hours)
- SDK design principles
- Type hints and IDE support
- Error handling and retries
- Testing SDK clients
- Documentation and examples
- Exercise 01: Design RESTful API for training job management
- Exercise 02: Implement gRPC service for feature serving
- Exercise 03: Create OpenAPI specification
- Exercise 04: Build Python SDK with comprehensive tests
- Quiz: 15 questions on API design
- Practical: Build RESTful and gRPC APIs for model deployment
Duration: 12 hours | Week: 3
- Implement secure multi-tenancy in Kubernetes
- Design resource quota and limit systems
- Build RBAC with fine-grained permissions
- Create cost allocation and chargeback systems
- Implement priority-based scheduling
-
Multi-Tenancy in Kubernetes (3 hours)
- Namespace design patterns
- Network policies for isolation
- Pod security policies
- Resource quotas and limit ranges
- Cross-tenant security
-
Resource Management (2.5 hours)
- CPU and memory quotas
- GPU resource allocation
- Storage quotas
- Fair-share scheduling
- Preemption policies
-
Authentication & Authorization (3 hours)
- Authentication methods (SSO, SAML, OIDC)
- RBAC design for ML platforms
- Custom roles and permissions
- Service account management
- API authentication (JWT, API keys)
-
Cost Allocation (2 hours)
- Resource usage tracking
- Cost attribution models
- Chargeback vs showback
- Budget alerts and enforcement
- Cost optimization strategies
-
Priority Scheduling (1.5 hours)
- Priority classes in Kubernetes
- Queue management
- Preemption policies
- Fair queuing
- SLA enforcement
- Exercise 01: Implement namespace isolation with network policies
- Exercise 02: Create resource quota system for teams
- Exercise 03: Build RBAC with custom roles
- Exercise 04: Implement cost tracking and allocation
- Quiz: 15 questions on multi-tenancy and resource management
- Practical: Build multi-tenant platform with quotas and RBAC
Duration: 12 hours | Week: 4
- Understand feature store concepts and use cases
- Implement online and offline feature stores
- Design point-in-time correct feature retrieval
- Build feature versioning and lineage tracking
- Create real-time feature serving pipelines
-
Feature Store Fundamentals (2.5 hours)
- What is a feature store?
- Online vs offline stores
- Feature consistency problem
- Training-serving skew
- Case studies: Uber Palette, Airbnb Zipline
-
Online Feature Store (2.5 hours)
- Low-latency requirements (<10ms)
- Redis architecture for features
- Feature materialization
- Cache invalidation strategies
- Multi-key batch retrieval
-
Offline Feature Store (2 hours)
- Batch feature retrieval
- Point-in-time correct joins
- Parquet/Avro storage
- Partition strategies
- Historical feature serving
-
Feature Engineering Pipeline (2.5 hours)
- Feature transformation DSL
- Windowed aggregations
- Feature validation
- Backfilling features
- Feature monitoring
-
Feature Registry (2.5 hours)
- Feature definition and registration
- Versioning strategies
- Lineage and provenance
- Feature discovery
- Metadata management
- Exercise 01: Build online feature store with Redis
- Exercise 02: Implement offline feature store with S3/Parquet
- Exercise 03: Create point-in-time correct retrieval
- Exercise 04: Build feature transformation pipeline
- Quiz: 15 questions on feature stores
- Practical: Build feature store with online/offline serving
Duration: 12 hours | Week: 5
- Design DAG-based workflow systems
- Implement task dependency resolution
- Build distributed task execution on Kubernetes
- Create retry and error handling mechanisms
- Monitor workflow performance
-
Workflow Orchestration Fundamentals (2 hours)
- DAG concepts and design
- Task operators and executors
- Workflow vs pipelines
- Orchestration patterns
- Case studies: Airflow, Kubeflow, Metaflow
-
DAG Definition and Management (2.5 hours)
- Python SDK for workflows
- Parameterized workflows
- Dynamic DAG generation
- Workflow templates
- Versioning workflows
-
Task Execution (2.5 hours)
- Distributed execution on Kubernetes
- Task queue management
- Resource allocation per task
- Parallel vs sequential execution
- Executor patterns (Kubernetes, Celery)
-
Scheduling and Triggers (2 hours)
- Cron-based scheduling
- Event-driven triggers
- Backfilling workflows
- External dependencies
- Schedule management
-
Error Handling and Monitoring (3 hours)
- Retry policies with backoff
- Dead letter queues
- Alerting and notifications
- Workflow debugging
- Performance monitoring
- Exercise 01: Build DAG definition SDK
- Exercise 02: Implement Kubernetes-based executor
- Exercise 03: Create scheduling system
- Exercise 04: Build retry and error handling
- Quiz: 15 questions on workflow orchestration
- Practical: Build workflow orchestrator with DAG execution
Duration: 10 hours | Week: 6
- Build model registry with versioning
- Implement model lifecycle management
- Track model lineage and provenance
- Create model deployment workflows
- Monitor model performance
-
Model Registry Fundamentals (2 hours)
- Model versioning strategies
- Artifact storage (models, datasets, configs)
- Metadata management
- Model discovery
- Case studies: MLflow, Seldon
-
Model Lifecycle Management (2.5 hours)
- Lifecycle stages (staging, production, archived)
- Model promotion workflows
- Approval gates
- Model deprecation
- Version compatibility
-
Model Lineage Tracking (2 hours)
- Training data lineage
- Code versioning integration
- Hyperparameter tracking
- Experiment-to-production tracing
- Reproducibility
-
Model Deployment (2 hours)
- Deployment strategies (blue-green, canary)
- Model serving integration
- A/B testing support
- Traffic routing
- Rollback mechanisms
-
Model Monitoring (1.5 hours)
- Performance metrics
- Prediction logging
- Model drift detection
- Alerting on degradation
- Retraining triggers
- Exercise 01: Build model registry with versioning
- Exercise 02: Implement lifecycle management
- Exercise 03: Create lineage tracking system
- Exercise 04: Build deployment workflow
- Quiz: 12 questions on model management
- Practical: Build model registry with deployment workflows
Duration: 10 hours | Week: 7
- Design intuitive APIs for ML practitioners
- Build comprehensive Python SDKs
- Create CLI tools for platform operations
- Develop interactive documentation
- Measure and improve platform adoption
-
Developer Experience Principles (2 hours)
- DX vs UX
- Reducing cognitive load
- Convention over configuration
- Sensible defaults
- Progressive disclosure
-
SDK Design (2.5 hours)
- Pythonic API design
- Type hints and IDE support
- Error messages and debugging
- Authentication handling
- Async support
-
CLI Tool Development (2 hours)
- CLI design patterns (Click, Typer)
- Command structure and arguments
- Configuration management
- Output formatting
- Shell completions
-
Documentation and Tutorials (2 hours)
- Interactive documentation
- Code examples and snippets
- Getting started guides
- Video tutorials
- API playground
-
Platform Adoption (1.5 hours)
- Onboarding flows
- Usage analytics
- Feedback collection
- Community building
- Success metrics
- Exercise 01: Design SDK with excellent DX
- Exercise 02: Build CLI tool with rich output
- Exercise 03: Create interactive documentation
- Exercise 04: Implement usage analytics
- Quiz: 12 questions on DX and tooling
- Practical: Build SDK and CLI for platform
Duration: 12 hours | Week: 8
- Instrument services with metrics, logs, traces
- Build monitoring dashboards for platform health
- Create effective alerting rules
- Implement distributed tracing
- Define SLIs and SLOs for reliability
-
Observability Fundamentals (2 hours)
- Three pillars: metrics, logs, traces
- Observability vs monitoring
- Instrumentation strategies
- Cardinality considerations
- Case studies: Datadog, New Relic
-
Metrics Collection (2.5 hours)
- Prometheus architecture
- Counter, gauge, histogram, summary
- Service-level metrics
- Platform metrics
- Custom metrics
-
Logging (2 hours)
- Structured logging
- Log aggregation (ELK, Loki)
- Log levels and context
- Correlation IDs
- Log retention policies
-
Distributed Tracing (2.5 hours)
- OpenTelemetry fundamentals
- Trace context propagation
- Span design
- Jaeger or Zipkin setup
- Trace analysis
-
Alerting and SLOs (3 hours)
- Alert design principles
- Alert fatigue prevention
- SLI definition
- SLO targets and error budgets
- On-call runbooks
- Exercise 01: Instrument service with Prometheus
- Exercise 02: Implement structured logging
- Exercise 03: Set up distributed tracing
- Exercise 04: Define SLIs and SLOs
- Quiz: 15 questions on observability
- Practical: Build monitoring system with dashboards and alerts
Duration: 10 hours | Week: 9
- Implement authentication and authorization
- Design RBAC with fine-grained permissions
- Ensure regulatory compliance
- Build audit logging systems
- Manage secrets securely
-
Authentication (2.5 hours)
- SSO integration (SAML, OIDC)
- JWT-based authentication
- API key management
- Service-to-service auth (mTLS)
- Multi-factor authentication
-
Authorization (2.5 hours)
- RBAC design patterns
- Attribute-based access control (ABAC)
- Policy engines (OPA)
- Permission inheritance
- Principle of least privilege
-
Data Privacy and Compliance (2 hours)
- GDPR compliance
- Data encryption (at-rest, in-transit)
- PII handling
- Data retention policies
- Right to deletion
-
Audit Logging (1.5 hours)
- Audit event design
- Immutable logs
- Log retention and archival
- Compliance reporting
- Forensic analysis
-
Secrets Management (1.5 hours)
- HashiCorp Vault
- Kubernetes secrets
- Secret rotation
- Encryption key management
- Certificate management
- Exercise 01: Implement SSO with OIDC
- Exercise 02: Build RBAC system
- Exercise 03: Create audit logging
- Exercise 04: Set up secrets management
- Quiz: 15 questions on security and governance
- Practical: Build secure platform with RBAC and audit logging
Duration: 4 weeks (120 hours) | Weeks: 10-13
Build a production-grade self-service ML platform that enables data scientists to provision compute resources, submit training jobs, and deploy models without direct infrastructure access.
-
User & Team Management
- User registration and profiles
- Team creation and membership
- SSO integration (SAML/OIDC)
- Activity tracking
-
Resource Provisioning
- Jupyter notebook environments
- GPU/CPU allocation
- Storage volumes
- Environment templates
-
Training Job Management
- Distributed training jobs
- Job scheduling and queueing
- Monitoring and logging
- Hyperparameter tuning
-
Model Deployment
- REST/gRPC endpoints
- Blue-green deployments
- Autoscaling
- Version management
-
Resource Quotas
- Per-team quotas
- Priority scheduling
- Cost tracking
- Overage alerts
-
Platform APIs
- RESTful API
- gRPC API for performance
- WebSocket for real-time updates
- Comprehensive documentation
- Backend: Python 3.11+, FastAPI, gRPC
- Database: PostgreSQL, Redis
- Infrastructure: Kubernetes, Helm
- Monitoring: Prometheus, Grafana
- Authentication: OAuth2, JWT
- Multi-tenant platform architecture
- API design and implementation
- Kubernetes operator development
- Resource management and quotas
- Platform observability
- Working platform API
- Kubernetes operator
- Multi-tenant resource management
- Documentation and SDK
- Test suite (>80% coverage)
Duration: 4 weeks (120 hours) | Weeks: 14-17
Build a production-grade feature store with online/offline serving, feature versioning, lineage tracking, and monitoring.
-
Feature Registry
- Feature definition and registration
- Versioning and lineage
- Discovery and search
- Metadata management
-
Offline Feature Store
- Batch retrieval
- Point-in-time correct joins
- S3/Parquet storage
- Backfilling
-
Online Feature Store
- Low-latency serving (<10ms)
- Redis-based cache
- Feature materialization
- Multi-key retrieval
-
Feature Transformation
- Python SDK
- Aggregations
- Validation
- Custom functions
-
Data Ingestion
- Batch ingestion
- Streaming (Kafka)
- Schema validation
- Error handling
-
Feature Monitoring
- Drift detection
- Quality metrics
- Freshness monitoring
- Alerting
- Backend: Python 3.11+, FastAPI
- Storage: Redis, S3, PostgreSQL
- Processing: Apache Spark
- Streaming: Kafka
- Monitoring: Prometheus
- Feature store architecture
- Online/offline store design
- Point-in-time correctness
- Real-time data pipelines
- Data quality monitoring
- Feature registry service
- Online store (Redis)
- Offline store (S3)
- Transformation SDK
- Monitoring dashboards
Duration: 4 weeks (120 hours) | Weeks: 18-21
Build a comprehensive workflow orchestration system for ML pipelines with DAG execution, scheduling, and monitoring.
-
Workflow Definition
- Python SDK
- Task operators
- Dependencies and branching
- Parameterization
- Templates
-
Scheduling
- Cron-based schedules
- Event-driven triggers
- Manual execution
- Backfilling
-
Execution Management
- Task queue
- Parallel execution
- Resource allocation
- Retry logic
- Cancellation
-
Dependency Management
- Task dependencies
- Cross-DAG dependencies
- External dependencies
- Versioning
-
Monitoring
- Real-time status
- Execution history
- Gantt charts
- Log aggregation
- Analytics
-
Error Handling
- Retries with backoff
- Dead letter queue
- Alerting
- Debugging tools
- Backend: Python 3.11+, FastAPI
- Execution: Kubernetes, Celery
- Database: PostgreSQL, Redis
- Monitoring: Prometheus, Grafana
- Frontend: React (optional)
- DAG-based workflow design
- Distributed task execution
- Scheduling algorithms
- Workflow monitoring
- Error handling patterns
- Workflow SDK
- DAG scheduler
- Kubernetes executor
- Monitoring UI
- Integration with platform
Duration: 4 weeks (120 hours) | Weeks: 22-25
Build a centralized model registry for versioning, metadata management, lifecycle tracking, and governance.
-
Model Registry
- Version management
- Artifact storage
- Metadata tracking
- Discovery
-
Lifecycle Management
- Staging/production stages
- Promotion workflows
- Approval gates
- Deprecation
-
Lineage Tracking
- Training data lineage
- Code versioning
- Hyperparameters
- Reproducibility
-
Model Deployment
- Blue-green deployments
- Canary releases
- A/B testing
- Rollback
-
Model Monitoring
- Performance metrics
- Prediction logging
- Drift detection
- Alerts
- Backend: Python 3.11+, FastAPI
- Storage: S3, PostgreSQL
- Deployment: Kubernetes
- Monitoring: Prometheus
- ML: MLflow (extended)
- Model versioning
- Lifecycle management
- Lineage tracking
- Deployment strategies
- Model monitoring
- Model registry service
- Lifecycle workflows
- Lineage system
- Deployment engine
- Monitoring dashboards
Duration: 4 weeks (120 hours) | Weeks: 26-29
Build a comprehensive developer portal with documentation, SDK, CLI, and tutorials.
-
Python SDK
- Platform client
- Type hints
- Async support
- Error handling
-
CLI Tool
- Platform operations
- Configuration management
- Output formatting
- Shell completions
-
Developer Portal
- Interactive documentation
- API playground
- Tutorials
- Code examples
-
Onboarding
- Getting started guides
- Video tutorials
- Sample projects
- Templates
-
Analytics
- Usage tracking
- Adoption metrics
- Feedback collection
- Success metrics
- SDK: Python 3.11+, httpx
- CLI: Typer or Click
- Frontend: React, TypeScript
- Docs: Docusaurus or MkDocs
- Analytics: PostHog or Mixpanel
- SDK design
- CLI development
- Documentation best practices
- Developer experience
- Adoption metrics
- Python SDK
- CLI tool
- Developer portal
- Interactive tutorials
- Usage analytics
Format: Multiple choice, short answer, practical coding
Passing Score: 80%
Time: 30-45 minutes per quiz
Weight: 30% of final grade
Evaluation Criteria:
-
Functional Completeness (40%)
- All required features implemented
- Features work as specified
- Edge cases handled
-
Code Quality (25%)
- Clean, readable code
- Proper error handling
- Type hints used
- No code smells
-
Testing (15%)
- Unit tests (>80% coverage)
- Integration tests
- Tests pass consistently
-
Documentation (10%)
- README comprehensive
- API documented
- Architecture explained
-
Best Practices (10%)
- Security considerations
- Performance optimization
- Scalability design
Weight: 60% of final grade
Challenge: Design and implement a complete ML platform from scratch
Duration: 2 weeks
Weight: 10% of final grade
Week 1-2: Modules 01-03 (Foundations) Week 3-4: Modules 04-06 (Core competencies part 1) Week 5-6: Modules 07-09 (Core competencies part 2) Week 7-10: Project 01 (Platform Core) Week 11-14: Project 02 (Feature Store) Week 15-18: Project 03 (Workflow Orchestration) Week 19-22: Project 04 (Model Registry) Week 23-26: Project 05 (Developer Portal) Week 27-28: Capstone Project
Month 1: Modules 01-03 Month 2: Modules 04-06 Month 3: Modules 07-09 + Start Project 01 Month 4: Complete Project 01 + Start Project 02 Month 5: Complete Project 02 + Start Project 03 Month 6: Complete Project 03 + Start Project 04 Month 7: Complete Project 04 + Start Project 05 Month 8: Complete Project 05 + Capstone
By completing this curriculum, you will achieve:
| Skill Area | Proficiency Level |
|---|---|
| Python Programming | Advanced |
| Kubernetes | Advanced |
| API Design (REST, gRPC) | Expert |
| Multi-Tenancy | Advanced |
| Feature Stores | Expert |
| Workflow Orchestration | Expert |
| Model Management | Advanced |
| Developer Experience | Advanced |
| Observability | Advanced |
| Security & Governance | Advanced |
| System Design | Advanced |
| Technical Leadership | Intermediate |
- Language: Python 3.11+
- Web Framework: FastAPI, gRPC
- Databases: PostgreSQL, Redis, MongoDB
- Orchestration: Kubernetes, Helm
- Storage: S3 (AWS/MinIO), Parquet
- Streaming: Apache Kafka
- Processing: Apache Spark (optional)
- Monitoring: Prometheus, Grafana
- Logging: ELK or Loki
- Tracing: Jaeger or Zipkin
- IaC: Terraform, Pulumi
- CI/CD: GitHub Actions, GitLab CI
- Service Mesh: Istio, Linkerd
- ML Frameworks: PyTorch, TensorFlow
- Feature Store: Feast
- Workflow: Airflow, Kubeflow
- Model Registry: MLflow
Requirements:
- Docker Desktop
- Kubernetes (minikube or kind)
- Python 3.11+
- kubectl, Helm
- PostgreSQL
- Redis
Setup Steps: See resources/tools.md
Recommended:
- AWS: EKS, RDS, ElastiCache, S3
- GCP: GKE, Cloud SQL, Memorystore, GCS
- Azure: AKS, Azure Database, Azure Cache, Blob Storage
Estimated Cost: $50-100/month for development
- "Designing Machine Learning Systems" by Chip Huyen
- "Building Machine Learning Powered Applications" by Emmanuel Ameisen
- "Kubernetes Patterns" by Bilgin Ibryam & Roland Huß
- Kubernetes for Developers (Linux Foundation)
- Machine Learning Engineering for Production (Coursera)
- Advanced REST APIs (Udemy)
- Feast Documentation
- Kubeflow Documentation
- MLflow Documentation
Upon completing this curriculum, you'll be qualified for:
- ML Platform Engineer roles at tech companies
- Senior AI Infrastructure Engineer positions
- Transition to MLOps Engineer or ML Architect
- Consulting opportunities in ML infrastructure
Next Steps:
- Build portfolio showcasing projects
- Contribute to open-source ML infrastructure
- Write blog posts and give talks
- Apply to ML platform roles
Last Updated: 2025-10-18 | Version: 1.0.0