Skip to content

Latest commit

 

History

History
1562 lines (1166 loc) · 44.6 KB

File metadata and controls

1562 lines (1166 loc) · 44.6 KB

ML Platform Engineer - Complete Curriculum Guide

Table of Contents


Curriculum Overview

What You'll Learn

This comprehensive curriculum transforms infrastructure engineers into specialized ML Platform Engineers who design, build, and maintain self-service platforms that enable data scientists and ML engineers to be productive at scale.

Core Focus Areas:

  1. Platform Architecture: Design multi-tenant, scalable ML platforms
  2. API Development: Build intuitive APIs and SDKs for ML workloads
  3. Feature Stores: Implement enterprise-grade feature management systems
  4. Workflow Orchestration: Create DAG-based ML pipeline systems
  5. Model Management: Build model registries with versioning and governance
  6. Developer Experience: Design exceptional tools and documentation
  7. Observability: Monitor and optimize platform performance
  8. Security & Governance: Implement authentication, authorization, and compliance

Learning Approach

This curriculum emphasizes:

  • Hands-On Projects: 80% practical implementation, 20% theory
  • Progressive Complexity: Each project builds on previous knowledge
  • Real-World Scenarios: Based on production platform patterns
  • Best Practices: Industry-standard tools and architectural patterns
  • Portfolio Development: Build demonstrable expertise

Time Commitment

Part-Time (15-20 hours/week):

  • Duration: 6-8 months
  • Module study: 2-3 hours/week
  • Project work: 12-17 hours/week
  • Assessment: 1-2 hours/week

Full-Time (40 hours/week):

  • Duration: 3-4 months
  • Module study: 5-8 hours/week
  • Project work: 30-35 hours/week
  • Assessment: 2-3 hours/week

Total Estimated Hours: 600-700 hours


Learning Objectives

By completing this curriculum, you will be able to:

1. Platform Architecture & Design

  • Design multi-tenant ML platforms with proper isolation and resource management
  • Create API-first architectures that abstract infrastructure complexity
  • Implement scalable platform services supporting hundreds of users
  • Build extensible plugin systems for platform customization
  • Design for high availability with proper failover and disaster recovery

2. API & SDK Development

  • Design RESTful APIs following OpenAPI specifications
  • Implement gRPC services for high-performance operations
  • Build Python SDKs with excellent developer experience
  • Create CLI tools for platform operations
  • Write comprehensive API documentation with examples

3. Data Platform Engineering

  • Implement feature stores with online and offline serving
  • Build data pipelines for feature engineering at scale
  • Design point-in-time correct data retrieval systems
  • Implement feature versioning and lineage tracking
  • Monitor data quality and detect feature drift

4. Workflow Orchestration

  • Design DAG-based workflow systems for ML pipelines
  • Implement task scheduling with dependency resolution
  • Build distributed execution engines using Kubernetes
  • Create retry and error handling mechanisms
  • Monitor workflow performance and debug failures

5. Model Lifecycle Management

  • Build model registries with versioning and metadata
  • Implement model deployment workflows and promotion gates
  • Track model lineage from training to production
  • Create A/B testing frameworks for model evaluation
  • Monitor model performance and detect degradation

6. Developer Experience (DX)

  • Design intuitive APIs that ML practitioners love
  • Create interactive documentation and tutorials
  • Build self-service onboarding flows
  • Implement feedback loops for continuous improvement
  • Measure platform adoption and satisfaction

7. Platform Observability

  • Instrument services with metrics, logs, and traces
  • Build monitoring dashboards for platform health
  • Create alerting rules for proactive incident response
  • Implement distributed tracing for debugging
  • Define SLIs and SLOs for platform reliability

8. Security & Governance

  • Implement authentication (SSO, SAML, OIDC)
  • Design RBAC systems with fine-grained permissions
  • Ensure data privacy and regulatory compliance
  • Build audit logging for governance requirements
  • Implement secrets management securely

9. DevOps & Production Operations

  • Deploy platforms to Kubernetes clusters
  • Implement CI/CD pipelines for platform services
  • Manage infrastructure as code (Terraform/Pulumi)
  • Perform capacity planning and cost optimization
  • Handle incident response and on-call rotations

10. Technical Leadership

  • Make architectural decisions with proper trade-off analysis
  • Mentor junior engineers on platform best practices
  • Write technical specifications and design documents
  • Collaborate with stakeholders (data scientists, SREs, product)
  • Drive platform adoption across the organization

Role Definition & Career Context

What is an ML Platform Engineer?

An ML Platform Engineer specializes in building internal platforms that enable data scientists and ML engineers to train, deploy, and operate machine learning models efficiently at scale. This role sits at the intersection of:

  • Infrastructure Engineering: Kubernetes, cloud, distributed systems
  • Software Engineering: API design, SDK development, testing
  • Data Engineering: Data pipelines, feature engineering, storage
  • ML Engineering: Understanding ML workflows and requirements
  • Platform Engineering: Developer experience, self-service tooling

Key Responsibilities

  1. Build Self-Service Platforms: Create tools that reduce manual work for ML teams
  2. Maintain ML Infrastructure: Ensure platform reliability and performance
  3. Design APIs and SDKs: Provide excellent developer interfaces
  4. Implement Governance: Enforce security, compliance, and best practices
  5. Drive Platform Adoption: Work with users to improve experience
  6. Scale ML Operations: Support growth from 10 to 1000+ models in production

Career Progression

Junior AI Infrastructure Engineer (Level 0)
    ↓
AI Infrastructure Engineer (Level 1)
    ↓
Senior AI Infrastructure Engineer (Level 2)
    ↓
┌─────────────────┴─────────────────┐
│                                   │
ML Platform Engineer (Level 2.5A)   MLOps Engineer (Level 2.5B)
│                                   │
└─────────────────┬─────────────────┘
                  ↓
    AI Infrastructure Architect (Level 3)
                  ↓
    Senior AI Infrastructure Architect (Level 4)
                  ↓
    Principal AI Infrastructure Architect (Level 5A)

This curriculum positions you at: ML Platform Engineer (Level 2.5A)

Typical Salary Ranges (US, 2025)

  • ML Platform Engineer: $150,000 - $220,000
  • Senior ML Platform Engineer: $180,000 - $280,000
  • Staff ML Platform Engineer: $220,000 - $350,000
  • Principal ML Platform Engineer: $280,000 - $450,000+

Varies by location, company size, and experience. Top tech companies (FAANG) pay at upper end.

Companies Hiring ML Platform Engineers

  • Big Tech: Google, Meta, Amazon, Microsoft, Apple
  • ML-First Companies: OpenAI, Anthropic, Cohere, Hugging Face
  • Tech Unicorns: Uber, Airbnb, Netflix, Spotify, LinkedIn
  • Enterprises: Banks, healthcare, retail, manufacturing with ML initiatives
  • Startups: ML infrastructure companies, MLOps platforms

Prerequisites & Preparation

Required Knowledge

Before starting this curriculum, you should have:

1. Programming (Critical)

  • Python 3.11+: Advanced proficiency

    • Object-oriented programming
    • Async/await patterns
    • Type hints and mypy
    • Context managers and decorators
    • Testing with pytest
  • Shell Scripting: Bash proficiency for automation

Assessment: Can you build a REST API with FastAPI from scratch?

2. Kubernetes (Critical)

  • Core Concepts: Pods, Services, Deployments, ConfigMaps, Secrets
  • Advanced Topics: StatefulSets, DaemonSets, Custom Resource Definitions
  • Operations: kubectl, Helm, debugging failed pods
  • Networking: Service mesh basics, ingress controllers
  • Storage: PersistentVolumes, StorageClasses

Assessment: Can you deploy a multi-tier application to Kubernetes?

3. Databases (Important)

  • SQL: PostgreSQL or MySQL

    • Complex queries with joins
    • Index optimization
    • Transaction management
  • NoSQL: MongoDB or Redis

    • Document modeling
    • Query optimization
    • Caching strategies

Assessment: Can you design a database schema for a multi-tenant application?

4. API Development (Important)

  • REST: HTTP methods, status codes, versioning
  • API Design: Resource modeling, pagination, filtering
  • Authentication: OAuth2, JWT, API keys
  • Documentation: OpenAPI/Swagger

Assessment: Can you design a RESTful API for a CRUD application?

5. Cloud Platforms (Important)

Choose at least one:

  • AWS: EC2, S3, RDS, EKS, IAM
  • GCP: Compute Engine, GCS, Cloud SQL, GKE, IAM
  • Azure: VMs, Blob Storage, SQL Database, AKS, RBAC

Assessment: Can you provision infrastructure using cloud CLI or console?

6. CI/CD (Helpful)

  • Git: Branching, merging, rebasing
  • GitHub Actions or GitLab CI: Pipeline definition, jobs
  • Docker: Building images, multi-stage builds
  • Testing: Unit tests, integration tests, e2e tests

Assessment: Can you create a CI/CD pipeline that builds and deploys an app?

7. Infrastructure as Code (Helpful)

  • Terraform or Pulumi: Basic resource provisioning
  • Configuration Management: Understanding of declarative configs

Assessment: Can you define infrastructure for a simple web app?

8. ML Basics (Helpful)

  • ML Workflows: Training, validation, testing, deployment
  • Frameworks: Familiarity with PyTorch or TensorFlow
  • Model Serving: Basic understanding of inference
  • Data Processing: Feature engineering concepts

Assessment: Can you explain the ML lifecycle from data to production?

Recommended Preparation

If you're missing any prerequisites, complete these first:

  1. Python: Python for DevOps
  2. Kubernetes: Kubernetes Up & Running
  3. Databases: Designing Data-Intensive Applications
  4. API Design: REST API Design Rulebook
  5. Cloud: Platform-specific training (AWS, GCP, Azure)
  6. ML Basics: Introduction to Machine Learning with Python

Pre-Assessment Quiz

Test your readiness: assessments/quizzes/prerequisite-quiz.py

Passing Score: 70% (if you score lower, review prerequisites)


Curriculum Structure

Three-Phase Learning Model

┌─────────────────────────────────────────────────────────────────┐
│                    PHASE 1: FOUNDATIONS                         │
│                    (Weeks 1-3, ~50 hours)                       │
│                                                                 │
│  Module 01: Platform Fundamentals                              │
│  Module 02: API Design for ML Platforms                        │
│  Module 03: Multi-Tenancy & Resource Management                │
│                                                                 │
│  Goal: Understand platform engineering principles              │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│                 PHASE 2: CORE COMPETENCIES                      │
│                    (Weeks 4-9, ~100 hours)                      │
│                                                                 │
│  Module 04: Feature Store Architecture                         │
│  Module 05: Workflow Orchestration                             │
│  Module 06: Model Management & Registry                        │
│  Module 07: Developer Experience & Tooling                     │
│  Module 08: Observability & Monitoring                         │
│  Module 09: Security & Governance                              │
│                                                                 │
│  Goal: Master ML platform components                           │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│                PHASE 3: HANDS-ON PROJECTS                       │
│                   (Weeks 10-25, ~500 hours)                     │
│                                                                 │
│  Project 01: Self-Service ML Platform Core (120h)              │
│  Project 02: Enterprise Feature Store (120h)                   │
│  Project 03: ML Workflow Orchestration (120h)                  │
│  Project 04: Model Registry & Management (120h)                │
│  Project 05: Developer Portal & SDK (120h)                     │
│                                                                 │
│  Goal: Build production-quality platform components            │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│              PHASE 4: ASSESSMENT & CERTIFICATION                │
│                    (Weeks 26-27, ~40 hours)                     │
│                                                                 │
│  Capstone Project: Complete ML Platform                        │
│  Portfolio Review                                              │
│  Final Practical Exam                                          │
│                                                                 │
│  Goal: Validate competency and build portfolio                 │
└─────────────────────────────────────────────────────────────────┘

Learning Methods

  1. Lecture Notes: Conceptual foundations (~20% of time)
  2. Guided Exercises: Step-by-step practice (~20% of time)
  3. Hands-On Projects: Build real systems (~50% of time)
  4. Assessments: Quizzes and exams (~10% of time)

Weekly Structure (Part-Time)

Monday-Wednesday (2-3 hours/day):

  • Study module lecture notes
  • Complete guided exercises
  • Watch supplementary videos

Thursday-Sunday (3-5 hours/day):

  • Work on project implementation
  • Debug and refine code
  • Write documentation
  • Peer code review

Sunday Evening (1 hour):

  • Complete module quiz
  • Update progress tracker
  • Plan next week's work

Detailed Module Breakdown

Module 01: Platform Fundamentals

Duration: 8 hours | Week: 1

Learning Objectives

  • Understand what ML platform engineering entails
  • Learn platform thinking and abstraction design
  • Study multi-tenancy architecture patterns
  • Explore API-first platform development
  • Analyze production ML platform examples

Topics Covered

  1. Introduction to ML Platform Engineering (1.5 hours)

    • What is a platform vs infrastructure?
    • ML platform engineer responsibilities
    • Platform value proposition
    • Case studies: Uber Michelangelo, Netflix Metaflow
  2. Platform Thinking (2 hours)

    • Abstraction design principles
    • Self-service vs full-service
    • Developer experience (DX) fundamentals
    • Platform adoption strategies
  3. Multi-Tenancy Patterns (2 hours)

    • Namespace isolation in Kubernetes
    • Resource quota management
    • Security boundaries
    • Cost allocation models
  4. API-First Development (1.5 hours)

    • API design principles
    • Versioning strategies
    • Backward compatibility
    • Documentation requirements
  5. Platform Architecture Patterns (1 hour)

    • Microservices for platforms
    • Event-driven architectures
    • Plugin systems
    • Extension points

Hands-On Exercises

  1. Exercise 01: Design API for a simple resource provisioning system
  2. Exercise 02: Implement namespace isolation in Kubernetes
  3. Exercise 03: Create resource quota management for teams
  4. Exercise 04: Build a simple plugin system

Reading Materials

Assessment

  • Quiz: 15 questions on platform concepts
  • Practical: Design API for ML resource provisioning

Module 02: API Design for ML Platforms

Duration: 10 hours | Week: 2

Learning Objectives

  • Design RESTful APIs following best practices
  • Implement gRPC services for performance-critical operations
  • Create comprehensive API documentation
  • Build Python SDKs with excellent DX
  • Handle API versioning and evolution

Topics Covered

  1. RESTful API Design (3 hours)

    • Resource modeling for ML workloads
    • HTTP methods and status codes
    • Pagination, filtering, sorting
    • Error handling and validation
    • Rate limiting and throttling
  2. gRPC for ML Platforms (2 hours)

    • Protocol Buffers definition
    • Service definition best practices
    • Streaming RPCs
    • Error handling in gRPC
    • Performance considerations
  3. API Versioning (1.5 hours)

    • Versioning strategies (URL, header, content negotiation)
    • Backward compatibility
    • Deprecation policies
    • Migration strategies
  4. API Documentation (1.5 hours)

    • OpenAPI/Swagger specifications
    • Interactive documentation (Swagger UI, ReDoc)
    • Code examples and tutorials
    • Changelog maintenance
  5. SDK Development (2 hours)

    • SDK design principles
    • Type hints and IDE support
    • Error handling and retries
    • Testing SDK clients
    • Documentation and examples

Hands-On Exercises

  1. Exercise 01: Design RESTful API for training job management
  2. Exercise 02: Implement gRPC service for feature serving
  3. Exercise 03: Create OpenAPI specification
  4. Exercise 04: Build Python SDK with comprehensive tests

Reading Materials

Assessment

  • Quiz: 15 questions on API design
  • Practical: Build RESTful and gRPC APIs for model deployment

Module 03: Multi-Tenancy & Resource Management

Duration: 12 hours | Week: 3

Learning Objectives

  • Implement secure multi-tenancy in Kubernetes
  • Design resource quota and limit systems
  • Build RBAC with fine-grained permissions
  • Create cost allocation and chargeback systems
  • Implement priority-based scheduling

Topics Covered

  1. Multi-Tenancy in Kubernetes (3 hours)

    • Namespace design patterns
    • Network policies for isolation
    • Pod security policies
    • Resource quotas and limit ranges
    • Cross-tenant security
  2. Resource Management (2.5 hours)

    • CPU and memory quotas
    • GPU resource allocation
    • Storage quotas
    • Fair-share scheduling
    • Preemption policies
  3. Authentication & Authorization (3 hours)

    • Authentication methods (SSO, SAML, OIDC)
    • RBAC design for ML platforms
    • Custom roles and permissions
    • Service account management
    • API authentication (JWT, API keys)
  4. Cost Allocation (2 hours)

    • Resource usage tracking
    • Cost attribution models
    • Chargeback vs showback
    • Budget alerts and enforcement
    • Cost optimization strategies
  5. Priority Scheduling (1.5 hours)

    • Priority classes in Kubernetes
    • Queue management
    • Preemption policies
    • Fair queuing
    • SLA enforcement

Hands-On Exercises

  1. Exercise 01: Implement namespace isolation with network policies
  2. Exercise 02: Create resource quota system for teams
  3. Exercise 03: Build RBAC with custom roles
  4. Exercise 04: Implement cost tracking and allocation

Reading Materials

Assessment

  • Quiz: 15 questions on multi-tenancy and resource management
  • Practical: Build multi-tenant platform with quotas and RBAC

Module 04: Feature Store Architecture

Duration: 12 hours | Week: 4

Learning Objectives

  • Understand feature store concepts and use cases
  • Implement online and offline feature stores
  • Design point-in-time correct feature retrieval
  • Build feature versioning and lineage tracking
  • Create real-time feature serving pipelines

Topics Covered

  1. Feature Store Fundamentals (2.5 hours)

    • What is a feature store?
    • Online vs offline stores
    • Feature consistency problem
    • Training-serving skew
    • Case studies: Uber Palette, Airbnb Zipline
  2. Online Feature Store (2.5 hours)

    • Low-latency requirements (<10ms)
    • Redis architecture for features
    • Feature materialization
    • Cache invalidation strategies
    • Multi-key batch retrieval
  3. Offline Feature Store (2 hours)

    • Batch feature retrieval
    • Point-in-time correct joins
    • Parquet/Avro storage
    • Partition strategies
    • Historical feature serving
  4. Feature Engineering Pipeline (2.5 hours)

    • Feature transformation DSL
    • Windowed aggregations
    • Feature validation
    • Backfilling features
    • Feature monitoring
  5. Feature Registry (2.5 hours)

    • Feature definition and registration
    • Versioning strategies
    • Lineage and provenance
    • Feature discovery
    • Metadata management

Hands-On Exercises

  1. Exercise 01: Build online feature store with Redis
  2. Exercise 02: Implement offline feature store with S3/Parquet
  3. Exercise 03: Create point-in-time correct retrieval
  4. Exercise 04: Build feature transformation pipeline

Reading Materials

Assessment

  • Quiz: 15 questions on feature stores
  • Practical: Build feature store with online/offline serving

Module 05: Workflow Orchestration

Duration: 12 hours | Week: 5

Learning Objectives

  • Design DAG-based workflow systems
  • Implement task dependency resolution
  • Build distributed task execution on Kubernetes
  • Create retry and error handling mechanisms
  • Monitor workflow performance

Topics Covered

  1. Workflow Orchestration Fundamentals (2 hours)

    • DAG concepts and design
    • Task operators and executors
    • Workflow vs pipelines
    • Orchestration patterns
    • Case studies: Airflow, Kubeflow, Metaflow
  2. DAG Definition and Management (2.5 hours)

    • Python SDK for workflows
    • Parameterized workflows
    • Dynamic DAG generation
    • Workflow templates
    • Versioning workflows
  3. Task Execution (2.5 hours)

    • Distributed execution on Kubernetes
    • Task queue management
    • Resource allocation per task
    • Parallel vs sequential execution
    • Executor patterns (Kubernetes, Celery)
  4. Scheduling and Triggers (2 hours)

    • Cron-based scheduling
    • Event-driven triggers
    • Backfilling workflows
    • External dependencies
    • Schedule management
  5. Error Handling and Monitoring (3 hours)

    • Retry policies with backoff
    • Dead letter queues
    • Alerting and notifications
    • Workflow debugging
    • Performance monitoring

Hands-On Exercises

  1. Exercise 01: Build DAG definition SDK
  2. Exercise 02: Implement Kubernetes-based executor
  3. Exercise 03: Create scheduling system
  4. Exercise 04: Build retry and error handling

Reading Materials

Assessment

  • Quiz: 15 questions on workflow orchestration
  • Practical: Build workflow orchestrator with DAG execution

Module 06: Model Management & Registry

Duration: 10 hours | Week: 6

Learning Objectives

  • Build model registry with versioning
  • Implement model lifecycle management
  • Track model lineage and provenance
  • Create model deployment workflows
  • Monitor model performance

Topics Covered

  1. Model Registry Fundamentals (2 hours)

    • Model versioning strategies
    • Artifact storage (models, datasets, configs)
    • Metadata management
    • Model discovery
    • Case studies: MLflow, Seldon
  2. Model Lifecycle Management (2.5 hours)

    • Lifecycle stages (staging, production, archived)
    • Model promotion workflows
    • Approval gates
    • Model deprecation
    • Version compatibility
  3. Model Lineage Tracking (2 hours)

    • Training data lineage
    • Code versioning integration
    • Hyperparameter tracking
    • Experiment-to-production tracing
    • Reproducibility
  4. Model Deployment (2 hours)

    • Deployment strategies (blue-green, canary)
    • Model serving integration
    • A/B testing support
    • Traffic routing
    • Rollback mechanisms
  5. Model Monitoring (1.5 hours)

    • Performance metrics
    • Prediction logging
    • Model drift detection
    • Alerting on degradation
    • Retraining triggers

Hands-On Exercises

  1. Exercise 01: Build model registry with versioning
  2. Exercise 02: Implement lifecycle management
  3. Exercise 03: Create lineage tracking system
  4. Exercise 04: Build deployment workflow

Reading Materials

Assessment

  • Quiz: 12 questions on model management
  • Practical: Build model registry with deployment workflows

Module 07: Developer Experience & Tooling

Duration: 10 hours | Week: 7

Learning Objectives

  • Design intuitive APIs for ML practitioners
  • Build comprehensive Python SDKs
  • Create CLI tools for platform operations
  • Develop interactive documentation
  • Measure and improve platform adoption

Topics Covered

  1. Developer Experience Principles (2 hours)

    • DX vs UX
    • Reducing cognitive load
    • Convention over configuration
    • Sensible defaults
    • Progressive disclosure
  2. SDK Design (2.5 hours)

    • Pythonic API design
    • Type hints and IDE support
    • Error messages and debugging
    • Authentication handling
    • Async support
  3. CLI Tool Development (2 hours)

    • CLI design patterns (Click, Typer)
    • Command structure and arguments
    • Configuration management
    • Output formatting
    • Shell completions
  4. Documentation and Tutorials (2 hours)

    • Interactive documentation
    • Code examples and snippets
    • Getting started guides
    • Video tutorials
    • API playground
  5. Platform Adoption (1.5 hours)

    • Onboarding flows
    • Usage analytics
    • Feedback collection
    • Community building
    • Success metrics

Hands-On Exercises

  1. Exercise 01: Design SDK with excellent DX
  2. Exercise 02: Build CLI tool with rich output
  3. Exercise 03: Create interactive documentation
  4. Exercise 04: Implement usage analytics

Reading Materials

Assessment

  • Quiz: 12 questions on DX and tooling
  • Practical: Build SDK and CLI for platform

Module 08: Observability & Monitoring

Duration: 12 hours | Week: 8

Learning Objectives

  • Instrument services with metrics, logs, traces
  • Build monitoring dashboards for platform health
  • Create effective alerting rules
  • Implement distributed tracing
  • Define SLIs and SLOs for reliability

Topics Covered

  1. Observability Fundamentals (2 hours)

    • Three pillars: metrics, logs, traces
    • Observability vs monitoring
    • Instrumentation strategies
    • Cardinality considerations
    • Case studies: Datadog, New Relic
  2. Metrics Collection (2.5 hours)

    • Prometheus architecture
    • Counter, gauge, histogram, summary
    • Service-level metrics
    • Platform metrics
    • Custom metrics
  3. Logging (2 hours)

    • Structured logging
    • Log aggregation (ELK, Loki)
    • Log levels and context
    • Correlation IDs
    • Log retention policies
  4. Distributed Tracing (2.5 hours)

    • OpenTelemetry fundamentals
    • Trace context propagation
    • Span design
    • Jaeger or Zipkin setup
    • Trace analysis
  5. Alerting and SLOs (3 hours)

    • Alert design principles
    • Alert fatigue prevention
    • SLI definition
    • SLO targets and error budgets
    • On-call runbooks

Hands-On Exercises

  1. Exercise 01: Instrument service with Prometheus
  2. Exercise 02: Implement structured logging
  3. Exercise 03: Set up distributed tracing
  4. Exercise 04: Define SLIs and SLOs

Reading Materials

Assessment

  • Quiz: 15 questions on observability
  • Practical: Build monitoring system with dashboards and alerts

Module 09: Security & Governance

Duration: 10 hours | Week: 9

Learning Objectives

  • Implement authentication and authorization
  • Design RBAC with fine-grained permissions
  • Ensure regulatory compliance
  • Build audit logging systems
  • Manage secrets securely

Topics Covered

  1. Authentication (2.5 hours)

    • SSO integration (SAML, OIDC)
    • JWT-based authentication
    • API key management
    • Service-to-service auth (mTLS)
    • Multi-factor authentication
  2. Authorization (2.5 hours)

    • RBAC design patterns
    • Attribute-based access control (ABAC)
    • Policy engines (OPA)
    • Permission inheritance
    • Principle of least privilege
  3. Data Privacy and Compliance (2 hours)

    • GDPR compliance
    • Data encryption (at-rest, in-transit)
    • PII handling
    • Data retention policies
    • Right to deletion
  4. Audit Logging (1.5 hours)

    • Audit event design
    • Immutable logs
    • Log retention and archival
    • Compliance reporting
    • Forensic analysis
  5. Secrets Management (1.5 hours)

    • HashiCorp Vault
    • Kubernetes secrets
    • Secret rotation
    • Encryption key management
    • Certificate management

Hands-On Exercises

  1. Exercise 01: Implement SSO with OIDC
  2. Exercise 02: Build RBAC system
  3. Exercise 03: Create audit logging
  4. Exercise 04: Set up secrets management

Reading Materials

Assessment

  • Quiz: 15 questions on security and governance
  • Practical: Build secure platform with RBAC and audit logging

Project-Based Learning

Project 01: Self-Service ML Platform Core

Duration: 4 weeks (120 hours) | Weeks: 10-13

Project Overview

Build a production-grade self-service ML platform that enables data scientists to provision compute resources, submit training jobs, and deploy models without direct infrastructure access.

Key Features

  1. User & Team Management

    • User registration and profiles
    • Team creation and membership
    • SSO integration (SAML/OIDC)
    • Activity tracking
  2. Resource Provisioning

    • Jupyter notebook environments
    • GPU/CPU allocation
    • Storage volumes
    • Environment templates
  3. Training Job Management

    • Distributed training jobs
    • Job scheduling and queueing
    • Monitoring and logging
    • Hyperparameter tuning
  4. Model Deployment

    • REST/gRPC endpoints
    • Blue-green deployments
    • Autoscaling
    • Version management
  5. Resource Quotas

    • Per-team quotas
    • Priority scheduling
    • Cost tracking
    • Overage alerts
  6. Platform APIs

    • RESTful API
    • gRPC API for performance
    • WebSocket for real-time updates
    • Comprehensive documentation

Technical Stack

  • Backend: Python 3.11+, FastAPI, gRPC
  • Database: PostgreSQL, Redis
  • Infrastructure: Kubernetes, Helm
  • Monitoring: Prometheus, Grafana
  • Authentication: OAuth2, JWT

Learning Outcomes

  • Multi-tenant platform architecture
  • API design and implementation
  • Kubernetes operator development
  • Resource management and quotas
  • Platform observability

Deliverables

  • Working platform API
  • Kubernetes operator
  • Multi-tenant resource management
  • Documentation and SDK
  • Test suite (>80% coverage)

Project 02: Enterprise Feature Store Implementation

Duration: 4 weeks (120 hours) | Weeks: 14-17

Project Overview

Build a production-grade feature store with online/offline serving, feature versioning, lineage tracking, and monitoring.

Key Features

  1. Feature Registry

    • Feature definition and registration
    • Versioning and lineage
    • Discovery and search
    • Metadata management
  2. Offline Feature Store

    • Batch retrieval
    • Point-in-time correct joins
    • S3/Parquet storage
    • Backfilling
  3. Online Feature Store

    • Low-latency serving (<10ms)
    • Redis-based cache
    • Feature materialization
    • Multi-key retrieval
  4. Feature Transformation

    • Python SDK
    • Aggregations
    • Validation
    • Custom functions
  5. Data Ingestion

    • Batch ingestion
    • Streaming (Kafka)
    • Schema validation
    • Error handling
  6. Feature Monitoring

    • Drift detection
    • Quality metrics
    • Freshness monitoring
    • Alerting

Technical Stack

  • Backend: Python 3.11+, FastAPI
  • Storage: Redis, S3, PostgreSQL
  • Processing: Apache Spark
  • Streaming: Kafka
  • Monitoring: Prometheus

Learning Outcomes

  • Feature store architecture
  • Online/offline store design
  • Point-in-time correctness
  • Real-time data pipelines
  • Data quality monitoring

Deliverables

  • Feature registry service
  • Online store (Redis)
  • Offline store (S3)
  • Transformation SDK
  • Monitoring dashboards

Project 03: ML Workflow Orchestration Platform

Duration: 4 weeks (120 hours) | Weeks: 18-21

Project Overview

Build a comprehensive workflow orchestration system for ML pipelines with DAG execution, scheduling, and monitoring.

Key Features

  1. Workflow Definition

    • Python SDK
    • Task operators
    • Dependencies and branching
    • Parameterization
    • Templates
  2. Scheduling

    • Cron-based schedules
    • Event-driven triggers
    • Manual execution
    • Backfilling
  3. Execution Management

    • Task queue
    • Parallel execution
    • Resource allocation
    • Retry logic
    • Cancellation
  4. Dependency Management

    • Task dependencies
    • Cross-DAG dependencies
    • External dependencies
    • Versioning
  5. Monitoring

    • Real-time status
    • Execution history
    • Gantt charts
    • Log aggregation
    • Analytics
  6. Error Handling

    • Retries with backoff
    • Dead letter queue
    • Alerting
    • Debugging tools

Technical Stack

  • Backend: Python 3.11+, FastAPI
  • Execution: Kubernetes, Celery
  • Database: PostgreSQL, Redis
  • Monitoring: Prometheus, Grafana
  • Frontend: React (optional)

Learning Outcomes

  • DAG-based workflow design
  • Distributed task execution
  • Scheduling algorithms
  • Workflow monitoring
  • Error handling patterns

Deliverables

  • Workflow SDK
  • DAG scheduler
  • Kubernetes executor
  • Monitoring UI
  • Integration with platform

Project 04: Model Registry & Management

Duration: 4 weeks (120 hours) | Weeks: 22-25

Project Overview

Build a centralized model registry for versioning, metadata management, lifecycle tracking, and governance.

Key Features

  1. Model Registry

    • Version management
    • Artifact storage
    • Metadata tracking
    • Discovery
  2. Lifecycle Management

    • Staging/production stages
    • Promotion workflows
    • Approval gates
    • Deprecation
  3. Lineage Tracking

    • Training data lineage
    • Code versioning
    • Hyperparameters
    • Reproducibility
  4. Model Deployment

    • Blue-green deployments
    • Canary releases
    • A/B testing
    • Rollback
  5. Model Monitoring

    • Performance metrics
    • Prediction logging
    • Drift detection
    • Alerts

Technical Stack

  • Backend: Python 3.11+, FastAPI
  • Storage: S3, PostgreSQL
  • Deployment: Kubernetes
  • Monitoring: Prometheus
  • ML: MLflow (extended)

Learning Outcomes

  • Model versioning
  • Lifecycle management
  • Lineage tracking
  • Deployment strategies
  • Model monitoring

Deliverables

  • Model registry service
  • Lifecycle workflows
  • Lineage system
  • Deployment engine
  • Monitoring dashboards

Project 05: Developer Portal & SDK

Duration: 4 weeks (120 hours) | Weeks: 26-29

Project Overview

Build a comprehensive developer portal with documentation, SDK, CLI, and tutorials.

Key Features

  1. Python SDK

    • Platform client
    • Type hints
    • Async support
    • Error handling
  2. CLI Tool

    • Platform operations
    • Configuration management
    • Output formatting
    • Shell completions
  3. Developer Portal

    • Interactive documentation
    • API playground
    • Tutorials
    • Code examples
  4. Onboarding

    • Getting started guides
    • Video tutorials
    • Sample projects
    • Templates
  5. Analytics

    • Usage tracking
    • Adoption metrics
    • Feedback collection
    • Success metrics

Technical Stack

  • SDK: Python 3.11+, httpx
  • CLI: Typer or Click
  • Frontend: React, TypeScript
  • Docs: Docusaurus or MkDocs
  • Analytics: PostHog or Mixpanel

Learning Outcomes

  • SDK design
  • CLI development
  • Documentation best practices
  • Developer experience
  • Adoption metrics

Deliverables

  • Python SDK
  • CLI tool
  • Developer portal
  • Interactive tutorials
  • Usage analytics

Assessment Framework

Module Assessments (9 total)

Format: Multiple choice, short answer, practical coding

Passing Score: 80%

Time: 30-45 minutes per quiz

Weight: 30% of final grade

Project Assessments (5 total)

Evaluation Criteria:

  1. Functional Completeness (40%)

    • All required features implemented
    • Features work as specified
    • Edge cases handled
  2. Code Quality (25%)

    • Clean, readable code
    • Proper error handling
    • Type hints used
    • No code smells
  3. Testing (15%)

    • Unit tests (>80% coverage)
    • Integration tests
    • Tests pass consistently
  4. Documentation (10%)

    • README comprehensive
    • API documented
    • Architecture explained
  5. Best Practices (10%)

    • Security considerations
    • Performance optimization
    • Scalability design

Weight: 60% of final grade

Capstone Project (1 total)

Challenge: Design and implement a complete ML platform from scratch

Duration: 2 weeks

Weight: 10% of final grade


Study Plans

Full-Time Study Plan (3-4 months)

Week 1-2: Modules 01-03 (Foundations) Week 3-4: Modules 04-06 (Core competencies part 1) Week 5-6: Modules 07-09 (Core competencies part 2) Week 7-10: Project 01 (Platform Core) Week 11-14: Project 02 (Feature Store) Week 15-18: Project 03 (Workflow Orchestration) Week 19-22: Project 04 (Model Registry) Week 23-26: Project 05 (Developer Portal) Week 27-28: Capstone Project

Part-Time Study Plan (6-8 months)

Month 1: Modules 01-03 Month 2: Modules 04-06 Month 3: Modules 07-09 + Start Project 01 Month 4: Complete Project 01 + Start Project 02 Month 5: Complete Project 02 + Start Project 03 Month 6: Complete Project 03 + Start Project 04 Month 7: Complete Project 04 + Start Project 05 Month 8: Complete Project 05 + Capstone


Skills Matrix

By completing this curriculum, you will achieve:

Skill Area Proficiency Level
Python Programming Advanced
Kubernetes Advanced
API Design (REST, gRPC) Expert
Multi-Tenancy Advanced
Feature Stores Expert
Workflow Orchestration Expert
Model Management Advanced
Developer Experience Advanced
Observability Advanced
Security & Governance Advanced
System Design Advanced
Technical Leadership Intermediate

Technology Stack

Core Technologies

  • Language: Python 3.11+
  • Web Framework: FastAPI, gRPC
  • Databases: PostgreSQL, Redis, MongoDB
  • Orchestration: Kubernetes, Helm
  • Storage: S3 (AWS/MinIO), Parquet
  • Streaming: Apache Kafka
  • Processing: Apache Spark (optional)
  • Monitoring: Prometheus, Grafana
  • Logging: ELK or Loki
  • Tracing: Jaeger or Zipkin

Optional Technologies

  • IaC: Terraform, Pulumi
  • CI/CD: GitHub Actions, GitLab CI
  • Service Mesh: Istio, Linkerd
  • ML Frameworks: PyTorch, TensorFlow
  • Feature Store: Feast
  • Workflow: Airflow, Kubeflow
  • Model Registry: MLflow

Environment Setup

Local Development

Requirements:

  • Docker Desktop
  • Kubernetes (minikube or kind)
  • Python 3.11+
  • kubectl, Helm
  • PostgreSQL
  • Redis

Setup Steps: See resources/tools.md

Cloud Development

Recommended:

  • AWS: EKS, RDS, ElastiCache, S3
  • GCP: GKE, Cloud SQL, Memorystore, GCS
  • Azure: AKS, Azure Database, Azure Cache, Blob Storage

Estimated Cost: $50-100/month for development


Learning Resources

Books

  • "Designing Machine Learning Systems" by Chip Huyen
  • "Building Machine Learning Powered Applications" by Emmanuel Ameisen
  • "Kubernetes Patterns" by Bilgin Ibryam & Roland Huß

Online Courses

  • Kubernetes for Developers (Linux Foundation)
  • Machine Learning Engineering for Production (Coursera)
  • Advanced REST APIs (Udemy)

Documentation

  • Feast Documentation
  • Kubeflow Documentation
  • MLflow Documentation

Career Advancement

Upon completing this curriculum, you'll be qualified for:

  • ML Platform Engineer roles at tech companies
  • Senior AI Infrastructure Engineer positions
  • Transition to MLOps Engineer or ML Architect
  • Consulting opportunities in ML infrastructure

Next Steps:

  • Build portfolio showcasing projects
  • Contribute to open-source ML infrastructure
  • Write blog posts and give talks
  • Apply to ML platform roles

Last Updated: 2025-10-18 | Version: 1.0.0