ML Platform Engineer - Complete Curriculum Guide

Curriculum Overview
Learning Objectives
Role Definition & Career Context
Prerequisites & Preparation
Curriculum Structure
Detailed Module Breakdown
Project-Based Learning
Assessment Framework
Study Plans
Skills Matrix
Technology Stack
Environment Setup
Learning Resources
Career Advancement

Curriculum Overview

What You'll Learn

This comprehensive curriculum transforms infrastructure engineers into specialized ML Platform Engineers who design, build, and maintain self-service platforms that enable data scientists and ML engineers to be productive at scale.

Core Focus Areas:

Platform Architecture: Design multi-tenant, scalable ML platforms
API Development: Build intuitive APIs and SDKs for ML workloads
Feature Stores: Implement enterprise-grade feature management systems
Workflow Orchestration: Create DAG-based ML pipeline systems
Model Management: Build model registries with versioning and governance
Developer Experience: Design exceptional tools and documentation
Observability: Monitor and optimize platform performance
Security & Governance: Implement authentication, authorization, and compliance

Learning Approach

This curriculum emphasizes:

Hands-On Projects: 80% practical implementation, 20% theory
Progressive Complexity: Each project builds on previous knowledge
Real-World Scenarios: Based on production platform patterns
Best Practices: Industry-standard tools and architectural patterns
Portfolio Development: Build demonstrable expertise

Time Commitment

Part-Time (15-20 hours/week):

Duration: 6-8 months
Module study: 2-3 hours/week
Project work: 12-17 hours/week
Assessment: 1-2 hours/week

Full-Time (40 hours/week):

Duration: 3-4 months
Module study: 5-8 hours/week
Project work: 30-35 hours/week
Assessment: 2-3 hours/week

Total Estimated Hours: 600-700 hours

Learning Objectives

By completing this curriculum, you will be able to:

1. Platform Architecture & Design

Design multi-tenant ML platforms with proper isolation and resource management
Create API-first architectures that abstract infrastructure complexity
Implement scalable platform services supporting hundreds of users
Build extensible plugin systems for platform customization
Design for high availability with proper failover and disaster recovery

2. API & SDK Development

Design RESTful APIs following OpenAPI specifications
Implement gRPC services for high-performance operations
Build Python SDKs with excellent developer experience
Create CLI tools for platform operations
Write comprehensive API documentation with examples

3. Data Platform Engineering

Implement feature stores with online and offline serving
Build data pipelines for feature engineering at scale
Design point-in-time correct data retrieval systems
Implement feature versioning and lineage tracking
Monitor data quality and detect feature drift

4. Workflow Orchestration

Design DAG-based workflow systems for ML pipelines
Implement task scheduling with dependency resolution
Build distributed execution engines using Kubernetes
Create retry and error handling mechanisms
Monitor workflow performance and debug failures

5. Model Lifecycle Management

Build model registries with versioning and metadata
Implement model deployment workflows and promotion gates
Track model lineage from training to production
Create A/B testing frameworks for model evaluation
Monitor model performance and detect degradation

6. Developer Experience (DX)

Design intuitive APIs that ML practitioners love
Create interactive documentation and tutorials
Build self-service onboarding flows
Implement feedback loops for continuous improvement
Measure platform adoption and satisfaction

7. Platform Observability

Instrument services with metrics, logs, and traces
Build monitoring dashboards for platform health
Create alerting rules for proactive incident response
Implement distributed tracing for debugging
Define SLIs and SLOs for platform reliability

8. Security & Governance

Implement authentication (SSO, SAML, OIDC)
Design RBAC systems with fine-grained permissions
Ensure data privacy and regulatory compliance
Build audit logging for governance requirements
Implement secrets management securely

9. DevOps & Production Operations

Deploy platforms to Kubernetes clusters
Implement CI/CD pipelines for platform services
Manage infrastructure as code (Terraform/Pulumi)
Perform capacity planning and cost optimization
Handle incident response and on-call rotations

10. Technical Leadership

Make architectural decisions with proper trade-off analysis
Mentor junior engineers on platform best practices
Write technical specifications and design documents
Collaborate with stakeholders (data scientists, SREs, product)
Drive platform adoption across the organization

Role Definition & Career Context

What is an ML Platform Engineer?

An ML Platform Engineer specializes in building internal platforms that enable data scientists and ML engineers to train, deploy, and operate machine learning models efficiently at scale. This role sits at the intersection of:

Infrastructure Engineering: Kubernetes, cloud, distributed systems
Software Engineering: API design, SDK development, testing
Data Engineering: Data pipelines, feature engineering, storage
ML Engineering: Understanding ML workflows and requirements
Platform Engineering: Developer experience, self-service tooling

Key Responsibilities

Build Self-Service Platforms: Create tools that reduce manual work for ML teams
Maintain ML Infrastructure: Ensure platform reliability and performance
Design APIs and SDKs: Provide excellent developer interfaces
Implement Governance: Enforce security, compliance, and best practices
Drive Platform Adoption: Work with users to improve experience
Scale ML Operations: Support growth from 10 to 1000+ models in production

Career Progression

Junior AI Infrastructure Engineer (Level 0)
    ↓
AI Infrastructure Engineer (Level 1)
    ↓
Senior AI Infrastructure Engineer (Level 2)
    ↓
┌─────────────────┴─────────────────┐
│                                   │
ML Platform Engineer (Level 2.5A)   MLOps Engineer (Level 2.5B)
│                                   │
└─────────────────┬─────────────────┘
                  ↓
    AI Infrastructure Architect (Level 3)
                  ↓
    Senior AI Infrastructure Architect (Level 4)
                  ↓
    Principal AI Infrastructure Architect (Level 5A)

This curriculum positions you at: ML Platform Engineer (Level 2.5A)

Typical Salary Ranges (US, 2025)

ML Platform Engineer: $150,000 - $220,000
Senior ML Platform Engineer: $180,000 - $280,000
Staff ML Platform Engineer: $220,000 - $350,000
Principal ML Platform Engineer: $280,000 - $450,000+

Varies by location, company size, and experience. Top tech companies (FAANG) pay at upper end.

Companies Hiring ML Platform Engineers

Big Tech: Google, Meta, Amazon, Microsoft, Apple
ML-First Companies: OpenAI, Anthropic, Cohere, Hugging Face
Tech Unicorns: Uber, Airbnb, Netflix, Spotify, LinkedIn
Enterprises: Banks, healthcare, retail, manufacturing with ML initiatives
Startups: ML infrastructure companies, MLOps platforms

Prerequisites & Preparation

Required Knowledge

Before starting this curriculum, you should have:

1. Programming (Critical)

Python 3.11+: Advanced proficiency
- Object-oriented programming
- Async/await patterns
- Type hints and mypy
- Context managers and decorators
- Testing with pytest
Shell Scripting: Bash proficiency for automation

Assessment: Can you build a REST API with FastAPI from scratch?

2. Kubernetes (Critical)

Core Concepts: Pods, Services, Deployments, ConfigMaps, Secrets
Advanced Topics: StatefulSets, DaemonSets, Custom Resource Definitions
Operations: kubectl, Helm, debugging failed pods
Networking: Service mesh basics, ingress controllers
Storage: PersistentVolumes, StorageClasses

Assessment: Can you deploy a multi-tier application to Kubernetes?

3. Databases (Important)

SQL: PostgreSQL or MySQL
- Complex queries with joins
- Index optimization
- Transaction management
NoSQL: MongoDB or Redis
- Document modeling
- Query optimization
- Caching strategies

Assessment: Can you design a database schema for a multi-tenant application?

4. API Development (Important)

REST: HTTP methods, status codes, versioning
API Design: Resource modeling, pagination, filtering
Authentication: OAuth2, JWT, API keys
Documentation: OpenAPI/Swagger

Assessment: Can you design a RESTful API for a CRUD application?

5. Cloud Platforms (Important)

Choose at least one:

AWS: EC2, S3, RDS, EKS, IAM
GCP: Compute Engine, GCS, Cloud SQL, GKE, IAM
Azure: VMs, Blob Storage, SQL Database, AKS, RBAC

Assessment: Can you provision infrastructure using cloud CLI or console?

6. CI/CD (Helpful)

Git: Branching, merging, rebasing
GitHub Actions or GitLab CI: Pipeline definition, jobs
Docker: Building images, multi-stage builds
Testing: Unit tests, integration tests, e2e tests

Assessment: Can you create a CI/CD pipeline that builds and deploys an app?

7. Infrastructure as Code (Helpful)

Terraform or Pulumi: Basic resource provisioning
Configuration Management: Understanding of declarative configs

Assessment: Can you define infrastructure for a simple web app?

8. ML Basics (Helpful)

ML Workflows: Training, validation, testing, deployment
Frameworks: Familiarity with PyTorch or TensorFlow
Model Serving: Basic understanding of inference
Data Processing: Feature engineering concepts

Assessment: Can you explain the ML lifecycle from data to production?

Recommended Preparation

If you're missing any prerequisites, complete these first:

Python: Python for DevOps
Kubernetes: Kubernetes Up & Running
Databases: Designing Data-Intensive Applications
API Design: REST API Design Rulebook
Cloud: Platform-specific training (AWS, GCP, Azure)
ML Basics: Introduction to Machine Learning with Python

Pre-Assessment Quiz

Test your readiness: assessments/quizzes/prerequisite-quiz.py

Passing Score: 70% (if you score lower, review prerequisites)

Curriculum Structure

Three-Phase Learning Model

┌─────────────────────────────────────────────────────────────────┐
│                    PHASE 1: FOUNDATIONS                         │
│                    (Weeks 1-3, ~50 hours)                       │
│                                                                 │
│  Module 01: Platform Fundamentals                              │
│  Module 02: API Design for ML Platforms                        │
│  Module 03: Multi-Tenancy & Resource Management                │
│                                                                 │
│  Goal: Understand platform engineering principles              │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│                 PHASE 2: CORE COMPETENCIES                      │
│                    (Weeks 4-9, ~100 hours)                      │
│                                                                 │
│  Module 04: Feature Store Architecture                         │
│  Module 05: Workflow Orchestration                             │
│  Module 06: Model Management & Registry                        │
│  Module 07: Developer Experience & Tooling                     │
│  Module 08: Observability & Monitoring                         │
│  Module 09: Security & Governance                              │
│                                                                 │
│  Goal: Master ML platform components                           │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│                PHASE 3: HANDS-ON PROJECTS                       │
│                   (Weeks 10-25, ~500 hours)                     │
│                                                                 │
│  Project 01: Self-Service ML Platform Core (120h)              │
│  Project 02: Enterprise Feature Store (120h)                   │
│  Project 03: ML Workflow Orchestration (120h)                  │
│  Project 04: Model Registry & Management (120h)                │
│  Project 05: Developer Portal & SDK (120h)                     │
│                                                                 │
│  Goal: Build production-quality platform components            │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│              PHASE 4: ASSESSMENT & CERTIFICATION                │
│                    (Weeks 26-27, ~40 hours)                     │
│                                                                 │
│  Capstone Project: Complete ML Platform                        │
│  Portfolio Review                                              │
│  Final Practical Exam                                          │
│                                                                 │
│  Goal: Validate competency and build portfolio                 │
└─────────────────────────────────────────────────────────────────┘

Learning Methods

Lecture Notes: Conceptual foundations (~20% of time)
Guided Exercises: Step-by-step practice (~20% of time)
Hands-On Projects: Build real systems (~50% of time)
Assessments: Quizzes and exams (~10% of time)

Weekly Structure (Part-Time)

Monday-Wednesday (2-3 hours/day):

Study module lecture notes
Complete guided exercises
Watch supplementary videos

Thursday-Sunday (3-5 hours/day):

Work on project implementation
Debug and refine code
Write documentation
Peer code review

Sunday Evening (1 hour):

Complete module quiz
Update progress tracker
Plan next week's work

Detailed Module Breakdown

Module 01: Platform Fundamentals

Duration: 8 hours | Week: 1

Learning Objectives

Understand what ML platform engineering entails
Learn platform thinking and abstraction design
Study multi-tenancy architecture patterns
Explore API-first platform development
Analyze production ML platform examples

Topics Covered

Introduction to ML Platform Engineering (1.5 hours)
- What is a platform vs infrastructure?
- ML platform engineer responsibilities
- Platform value proposition
- Case studies: Uber Michelangelo, Netflix Metaflow
Platform Thinking (2 hours)
- Abstraction design principles
- Self-service vs full-service
- Developer experience (DX) fundamentals
- Platform adoption strategies
Multi-Tenancy Patterns (2 hours)
- Namespace isolation in Kubernetes
- Resource quota management
- Security boundaries
- Cost allocation models
API-First Development (1.5 hours)
- API design principles
- Versioning strategies
- Backward compatibility
- Documentation requirements
Platform Architecture Patterns (1 hour)
- Microservices for platforms
- Event-driven architectures
- Plugin systems
- Extension points

Hands-On Exercises

Exercise 01: Design API for a simple resource provisioning system
Exercise 02: Implement namespace isolation in Kubernetes
Exercise 03: Create resource quota management for teams
Exercise 04: Build a simple plugin system

Reading Materials

Assessment

Quiz: 15 questions on platform concepts
Practical: Design API for ML resource provisioning

Module 02: API Design for ML Platforms

Duration: 10 hours | Week: 2

Learning Objectives

Design RESTful APIs following best practices
Implement gRPC services for performance-critical operations
Create comprehensive API documentation
Build Python SDKs with excellent DX
Handle API versioning and evolution

Topics Covered

RESTful API Design (3 hours)
- Resource modeling for ML workloads
- HTTP methods and status codes
- Pagination, filtering, sorting
- Error handling and validation
- Rate limiting and throttling
gRPC for ML Platforms (2 hours)
- Protocol Buffers definition
- Service definition best practices
- Streaming RPCs
- Error handling in gRPC
- Performance considerations
API Versioning (1.5 hours)
- Versioning strategies (URL, header, content negotiation)
- Backward compatibility
- Deprecation policies
- Migration strategies
API Documentation (1.5 hours)
- OpenAPI/Swagger specifications
- Interactive documentation (Swagger UI, ReDoc)
- Code examples and tutorials
- Changelog maintenance
SDK Development (2 hours)
- SDK design principles
- Type hints and IDE support
- Error handling and retries
- Testing SDK clients
- Documentation and examples

Hands-On Exercises

Exercise 01: Design RESTful API for training job management
Exercise 02: Implement gRPC service for feature serving
Exercise 03: Create OpenAPI specification
Exercise 04: Build Python SDK with comprehensive tests

Reading Materials

Assessment

Quiz: 15 questions on API design
Practical: Build RESTful and gRPC APIs for model deployment

Module 03: Multi-Tenancy & Resource Management

Duration: 12 hours | Week: 3

Learning Objectives

Implement secure multi-tenancy in Kubernetes
Design resource quota and limit systems
Build RBAC with fine-grained permissions
Create cost allocation and chargeback systems
Implement priority-based scheduling

Topics Covered

Multi-Tenancy in Kubernetes (3 hours)
- Namespace design patterns
- Network policies for isolation
- Pod security policies
- Resource quotas and limit ranges
- Cross-tenant security
Resource Management (2.5 hours)
- CPU and memory quotas
- GPU resource allocation
- Storage quotas
- Fair-share scheduling
- Preemption policies
Authentication & Authorization (3 hours)
- Authentication methods (SSO, SAML, OIDC)
- RBAC design for ML platforms
- Custom roles and permissions
- Service account management
- API authentication (JWT, API keys)
Cost Allocation (2 hours)
- Resource usage tracking
- Cost attribution models
- Chargeback vs showback
- Budget alerts and enforcement
- Cost optimization strategies
Priority Scheduling (1.5 hours)
- Priority classes in Kubernetes
- Queue management
- Preemption policies
- Fair queuing
- SLA enforcement

Hands-On Exercises

Exercise 01: Implement namespace isolation with network policies
Exercise 02: Create resource quota system for teams
Exercise 03: Build RBAC with custom roles
Exercise 04: Implement cost tracking and allocation

Reading Materials

Assessment

Quiz: 15 questions on multi-tenancy and resource management
Practical: Build multi-tenant platform with quotas and RBAC

Module 04: Feature Store Architecture

Duration: 12 hours | Week: 4

Learning Objectives

Understand feature store concepts and use cases
Implement online and offline feature stores
Design point-in-time correct feature retrieval
Build feature versioning and lineage tracking
Create real-time feature serving pipelines

Topics Covered

Feature Store Fundamentals (2.5 hours)
- What is a feature store?
- Online vs offline stores
- Feature consistency problem
- Training-serving skew
- Case studies: Uber Palette, Airbnb Zipline
Online Feature Store (2.5 hours)
- Low-latency requirements (<10ms)
- Redis architecture for features
- Feature materialization
- Cache invalidation strategies
- Multi-key batch retrieval
Offline Feature Store (2 hours)
- Batch feature retrieval
- Point-in-time correct joins
- Parquet/Avro storage
- Partition strategies
- Historical feature serving
Feature Engineering Pipeline (2.5 hours)
- Feature transformation DSL
- Windowed aggregations
- Feature validation
- Backfilling features
- Feature monitoring
Feature Registry (2.5 hours)
- Feature definition and registration
- Versioning strategies
- Lineage and provenance
- Feature discovery
- Metadata management

Hands-On Exercises

Exercise 01: Build online feature store with Redis
Exercise 02: Implement offline feature store with S3/Parquet
Exercise 03: Create point-in-time correct retrieval
Exercise 04: Build feature transformation pipeline

Reading Materials

Assessment

Quiz: 15 questions on feature stores
Practical: Build feature store with online/offline serving

Module 05: Workflow Orchestration

Duration: 12 hours | Week: 5

Learning Objectives

Design DAG-based workflow systems
Implement task dependency resolution
Build distributed task execution on Kubernetes
Create retry and error handling mechanisms
Monitor workflow performance

Topics Covered

Workflow Orchestration Fundamentals (2 hours)
- DAG concepts and design
- Task operators and executors
- Workflow vs pipelines
- Orchestration patterns
- Case studies: Airflow, Kubeflow, Metaflow
DAG Definition and Management (2.5 hours)
- Python SDK for workflows
- Parameterized workflows
- Dynamic DAG generation
- Workflow templates
- Versioning workflows
Task Execution (2.5 hours)
- Distributed execution on Kubernetes
- Task queue management
- Resource allocation per task
- Parallel vs sequential execution
- Executor patterns (Kubernetes, Celery)
Scheduling and Triggers (2 hours)
- Cron-based scheduling
- Event-driven triggers
- Backfilling workflows
- External dependencies
- Schedule management
Error Handling and Monitoring (3 hours)
- Retry policies with backoff
- Dead letter queues
- Alerting and notifications
- Workflow debugging
- Performance monitoring

Hands-On Exercises

Exercise 01: Build DAG definition SDK
Exercise 02: Implement Kubernetes-based executor
Exercise 03: Create scheduling system
Exercise 04: Build retry and error handling

Reading Materials

Assessment

Quiz: 15 questions on workflow orchestration
Practical: Build workflow orchestrator with DAG execution

Module 06: Model Management & Registry

Duration: 10 hours | Week: 6

Learning Objectives

Build model registry with versioning
Implement model lifecycle management
Track model lineage and provenance
Create model deployment workflows
Monitor model performance

Topics Covered

Model Registry Fundamentals (2 hours)
- Model versioning strategies
- Artifact storage (models, datasets, configs)
- Metadata management
- Model discovery
- Case studies: MLflow, Seldon
Model Lifecycle Management (2.5 hours)
- Lifecycle stages (staging, production, archived)
- Model promotion workflows
- Approval gates
- Model deprecation
- Version compatibility
Model Lineage Tracking (2 hours)
- Training data lineage
- Code versioning integration
- Hyperparameter tracking
- Experiment-to-production tracing
- Reproducibility
Model Deployment (2 hours)
- Deployment strategies (blue-green, canary)
- Model serving integration
- A/B testing support
- Traffic routing
- Rollback mechanisms
Model Monitoring (1.5 hours)
- Performance metrics
- Prediction logging
- Model drift detection
- Alerting on degradation
- Retraining triggers

Hands-On Exercises

Exercise 01: Build model registry with versioning
Exercise 02: Implement lifecycle management
Exercise 03: Create lineage tracking system
Exercise 04: Build deployment workflow

Reading Materials

Assessment

Quiz: 12 questions on model management
Practical: Build model registry with deployment workflows

Module 07: Developer Experience & Tooling

Duration: 10 hours | Week: 7

Learning Objectives

Design intuitive APIs for ML practitioners
Build comprehensive Python SDKs
Create CLI tools for platform operations
Develop interactive documentation
Measure and improve platform adoption

Topics Covered

Developer Experience Principles (2 hours)
- DX vs UX
- Reducing cognitive load
- Convention over configuration
- Sensible defaults
- Progressive disclosure
SDK Design (2.5 hours)
- Pythonic API design
- Type hints and IDE support
- Error messages and debugging
- Authentication handling
- Async support
CLI Tool Development (2 hours)
- CLI design patterns (Click, Typer)
- Command structure and arguments
- Configuration management
- Output formatting
- Shell completions
Documentation and Tutorials (2 hours)
- Interactive documentation
- Code examples and snippets
- Getting started guides
- Video tutorials
- API playground
Platform Adoption (1.5 hours)
- Onboarding flows
- Usage analytics
- Feedback collection
- Community building
- Success metrics

Hands-On Exercises

Exercise 01: Design SDK with excellent DX
Exercise 02: Build CLI tool with rich output
Exercise 03: Create interactive documentation
Exercise 04: Implement usage analytics

Reading Materials

Assessment

Quiz: 12 questions on DX and tooling
Practical: Build SDK and CLI for platform

Module 08: Observability & Monitoring

Duration: 12 hours | Week: 8

Learning Objectives

Instrument services with metrics, logs, traces
Build monitoring dashboards for platform health
Create effective alerting rules
Implement distributed tracing
Define SLIs and SLOs for reliability

Topics Covered

Observability Fundamentals (2 hours)
- Three pillars: metrics, logs, traces
- Observability vs monitoring
- Instrumentation strategies
- Cardinality considerations
- Case studies: Datadog, New Relic
Metrics Collection (2.5 hours)
- Prometheus architecture
- Counter, gauge, histogram, summary
- Service-level metrics
- Platform metrics
- Custom metrics
Logging (2 hours)
- Structured logging
- Log aggregation (ELK, Loki)
- Log levels and context
- Correlation IDs
- Log retention policies
Distributed Tracing (2.5 hours)
- OpenTelemetry fundamentals
- Trace context propagation
- Span design
- Jaeger or Zipkin setup
- Trace analysis
Alerting and SLOs (3 hours)
- Alert design principles
- Alert fatigue prevention
- SLI definition
- SLO targets and error budgets
- On-call runbooks

Hands-On Exercises

Exercise 01: Instrument service with Prometheus
Exercise 02: Implement structured logging
Exercise 03: Set up distributed tracing
Exercise 04: Define SLIs and SLOs

Reading Materials

Assessment

Quiz: 15 questions on observability
Practical: Build monitoring system with dashboards and alerts

Module 09: Security & Governance

Duration: 10 hours | Week: 9

Learning Objectives

Implement authentication and authorization
Design RBAC with fine-grained permissions
Ensure regulatory compliance
Build audit logging systems
Manage secrets securely

Topics Covered

Authentication (2.5 hours)
- SSO integration (SAML, OIDC)
- JWT-based authentication
- API key management
- Service-to-service auth (mTLS)
- Multi-factor authentication
Authorization (2.5 hours)
- RBAC design patterns
- Attribute-based access control (ABAC)
- Policy engines (OPA)
- Permission inheritance
- Principle of least privilege
Data Privacy and Compliance (2 hours)
- GDPR compliance
- Data encryption (at-rest, in-transit)
- PII handling
- Data retention policies
- Right to deletion
Audit Logging (1.5 hours)
- Audit event design
- Immutable logs
- Log retention and archival
- Compliance reporting
- Forensic analysis
Secrets Management (1.5 hours)
- HashiCorp Vault
- Kubernetes secrets
- Secret rotation
- Encryption key management
- Certificate management

Hands-On Exercises

Exercise 01: Implement SSO with OIDC
Exercise 02: Build RBAC system
Exercise 03: Create audit logging
Exercise 04: Set up secrets management

Reading Materials

Assessment

Quiz: 15 questions on security and governance
Practical: Build secure platform with RBAC and audit logging

Project-Based Learning

Project 01: Self-Service ML Platform Core

Duration: 4 weeks (120 hours) | Weeks: 10-13

Project Overview

Build a production-grade self-service ML platform that enables data scientists to provision compute resources, submit training jobs, and deploy models without direct infrastructure access.

Key Features

User & Team Management
- User registration and profiles
- Team creation and membership
- SSO integration (SAML/OIDC)
- Activity tracking
Resource Provisioning
- Jupyter notebook environments
- GPU/CPU allocation
- Storage volumes
- Environment templates
Training Job Management
- Distributed training jobs
- Job scheduling and queueing
- Monitoring and logging
- Hyperparameter tuning
Model Deployment
- REST/gRPC endpoints
- Blue-green deployments
- Autoscaling
- Version management
Resource Quotas
- Per-team quotas
- Priority scheduling
- Cost tracking
- Overage alerts
Platform APIs
- RESTful API
- gRPC API for performance
- WebSocket for real-time updates
- Comprehensive documentation

Technical Stack

Backend: Python 3.11+, FastAPI, gRPC
Database: PostgreSQL, Redis
Infrastructure: Kubernetes, Helm
Monitoring: Prometheus, Grafana
Authentication: OAuth2, JWT

Learning Outcomes

Multi-tenant platform architecture
API design and implementation
Kubernetes operator development
Resource management and quotas
Platform observability

Deliverables

Working platform API
Kubernetes operator
Multi-tenant resource management
Documentation and SDK
Test suite (>80% coverage)

Project 02: Enterprise Feature Store Implementation

Duration: 4 weeks (120 hours) | Weeks: 14-17

Project Overview

Build a production-grade feature store with online/offline serving, feature versioning, lineage tracking, and monitoring.

Key Features

Feature Registry
- Feature definition and registration
- Versioning and lineage
- Discovery and search
- Metadata management
Offline Feature Store
- Batch retrieval
- Point-in-time correct joins
- S3/Parquet storage
- Backfilling
Online Feature Store
- Low-latency serving (<10ms)
- Redis-based cache
- Feature materialization
- Multi-key retrieval
Feature Transformation
- Python SDK
- Aggregations
- Validation
- Custom functions
Data Ingestion
- Batch ingestion
- Streaming (Kafka)
- Schema validation
- Error handling
Feature Monitoring
- Drift detection
- Quality metrics
- Freshness monitoring
- Alerting

Technical Stack

Backend: Python 3.11+, FastAPI
Storage: Redis, S3, PostgreSQL
Processing: Apache Spark
Streaming: Kafka
Monitoring: Prometheus

Learning Outcomes

Feature store architecture
Online/offline store design
Point-in-time correctness
Real-time data pipelines
Data quality monitoring

Deliverables

Feature registry service
Online store (Redis)
Offline store (S3)
Transformation SDK
Monitoring dashboards

Project 03: ML Workflow Orchestration Platform

Duration: 4 weeks (120 hours) | Weeks: 18-21

Project Overview

Build a comprehensive workflow orchestration system for ML pipelines with DAG execution, scheduling, and monitoring.

Key Features

Workflow Definition
- Python SDK
- Task operators
- Dependencies and branching
- Parameterization
- Templates
Scheduling
- Cron-based schedules
- Event-driven triggers
- Manual execution
- Backfilling
Execution Management
- Task queue
- Parallel execution
- Resource allocation
- Retry logic
- Cancellation
Dependency Management
- Task dependencies
- Cross-DAG dependencies
- External dependencies
- Versioning
Monitoring
- Real-time status
- Execution history
- Gantt charts
- Log aggregation
- Analytics
Error Handling
- Retries with backoff
- Dead letter queue
- Alerting
- Debugging tools

Technical Stack

Backend: Python 3.11+, FastAPI
Execution: Kubernetes, Celery
Database: PostgreSQL, Redis
Monitoring: Prometheus, Grafana
Frontend: React (optional)

Learning Outcomes

DAG-based workflow design
Distributed task execution
Scheduling algorithms
Workflow monitoring
Error handling patterns

Deliverables

Workflow SDK
DAG scheduler
Kubernetes executor
Monitoring UI
Integration with platform

Project 04: Model Registry & Management

Duration: 4 weeks (120 hours) | Weeks: 22-25

Project Overview

Build a centralized model registry for versioning, metadata management, lifecycle tracking, and governance.

Key Features

Model Registry
- Version management
- Artifact storage
- Metadata tracking
- Discovery
Lifecycle Management
- Staging/production stages
- Promotion workflows
- Approval gates
- Deprecation
Lineage Tracking
- Training data lineage
- Code versioning
- Hyperparameters
- Reproducibility
Model Deployment
- Blue-green deployments
- Canary releases
- A/B testing
- Rollback
Model Monitoring
- Performance metrics
- Prediction logging
- Drift detection
- Alerts

Technical Stack

Backend: Python 3.11+, FastAPI
Storage: S3, PostgreSQL
Deployment: Kubernetes
Monitoring: Prometheus
ML: MLflow (extended)

Learning Outcomes

Model versioning
Lifecycle management
Lineage tracking
Deployment strategies
Model monitoring

Deliverables

Model registry service
Lifecycle workflows
Lineage system
Deployment engine
Monitoring dashboards

Project 05: Developer Portal & SDK

Duration: 4 weeks (120 hours) | Weeks: 26-29

Project Overview

Build a comprehensive developer portal with documentation, SDK, CLI, and tutorials.

Key Features

Python SDK
- Platform client
- Type hints
- Async support
- Error handling
CLI Tool
- Platform operations
- Configuration management
- Output formatting
- Shell completions
Developer Portal
- Interactive documentation
- API playground
- Tutorials
- Code examples
Onboarding
- Getting started guides
- Video tutorials
- Sample projects
- Templates
Analytics
- Usage tracking
- Adoption metrics
- Feedback collection
- Success metrics

Technical Stack

SDK: Python 3.11+, httpx
CLI: Typer or Click
Frontend: React, TypeScript
Docs: Docusaurus or MkDocs
Analytics: PostHog or Mixpanel

Learning Outcomes

SDK design
CLI development
Documentation best practices
Developer experience
Adoption metrics

Deliverables

Python SDK
CLI tool
Developer portal
Interactive tutorials
Usage analytics

Assessment Framework

Module Assessments (9 total)

Format: Multiple choice, short answer, practical coding

Passing Score: 80%

Time: 30-45 minutes per quiz

Weight: 30% of final grade

Project Assessments (5 total)

Evaluation Criteria:

Functional Completeness (40%)
- All required features implemented
- Features work as specified
- Edge cases handled
Code Quality (25%)
- Clean, readable code
- Proper error handling
- Type hints used
- No code smells
Testing (15%)
- Unit tests (>80% coverage)
- Integration tests
- Tests pass consistently
Documentation (10%)
- README comprehensive
- API documented
- Architecture explained
Best Practices (10%)
- Security considerations
- Performance optimization
- Scalability design

Weight: 60% of final grade

Capstone Project (1 total)

Challenge: Design and implement a complete ML platform from scratch

Duration: 2 weeks

Weight: 10% of final grade

Study Plans

Full-Time Study Plan (3-4 months)

Week 1-2: Modules 01-03 (Foundations) Week 3-4: Modules 04-06 (Core competencies part 1) Week 5-6: Modules 07-09 (Core competencies part 2) Week 7-10: Project 01 (Platform Core) Week 11-14: Project 02 (Feature Store) Week 15-18: Project 03 (Workflow Orchestration) Week 19-22: Project 04 (Model Registry) Week 23-26: Project 05 (Developer Portal) Week 27-28: Capstone Project

Part-Time Study Plan (6-8 months)

Month 1: Modules 01-03 Month 2: Modules 04-06 Month 3: Modules 07-09 + Start Project 01 Month 4: Complete Project 01 + Start Project 02 Month 5: Complete Project 02 + Start Project 03 Month 6: Complete Project 03 + Start Project 04 Month 7: Complete Project 04 + Start Project 05 Month 8: Complete Project 05 + Capstone

Skills Matrix

By completing this curriculum, you will achieve:

Skill Area	Proficiency Level
Python Programming	Advanced
Kubernetes	Advanced
API Design (REST, gRPC)	Expert
Multi-Tenancy	Advanced
Feature Stores	Expert
Workflow Orchestration	Expert
Model Management	Advanced
Developer Experience	Advanced
Observability	Advanced
Security & Governance	Advanced
System Design	Advanced
Technical Leadership	Intermediate

Technology Stack

Core Technologies

Language: Python 3.11+
Web Framework: FastAPI, gRPC
Databases: PostgreSQL, Redis, MongoDB
Orchestration: Kubernetes, Helm
Storage: S3 (AWS/MinIO), Parquet
Streaming: Apache Kafka
Processing: Apache Spark (optional)
Monitoring: Prometheus, Grafana
Logging: ELK or Loki
Tracing: Jaeger or Zipkin

Optional Technologies

IaC: Terraform, Pulumi
CI/CD: GitHub Actions, GitLab CI
Service Mesh: Istio, Linkerd
ML Frameworks: PyTorch, TensorFlow
Feature Store: Feast
Workflow: Airflow, Kubeflow
Model Registry: MLflow

Environment Setup

Local Development

Requirements:

Docker Desktop
Kubernetes (minikube or kind)
Python 3.11+
kubectl, Helm
PostgreSQL
Redis

Setup Steps: See resources/tools.md

Cloud Development

Recommended:

AWS: EKS, RDS, ElastiCache, S3
GCP: GKE, Cloud SQL, Memorystore, GCS
Azure: AKS, Azure Database, Azure Cache, Blob Storage

Estimated Cost: $50-100/month for development

Learning Resources

Books

"Designing Machine Learning Systems" by Chip Huyen
"Building Machine Learning Powered Applications" by Emmanuel Ameisen
"Kubernetes Patterns" by Bilgin Ibryam & Roland Huß

Online Courses

Kubernetes for Developers (Linux Foundation)
Machine Learning Engineering for Production (Coursera)
Advanced REST APIs (Udemy)

Documentation

Feast Documentation
Kubeflow Documentation
MLflow Documentation

Career Advancement

Upon completing this curriculum, you'll be qualified for:

ML Platform Engineer roles at tech companies
Senior AI Infrastructure Engineer positions
Transition to MLOps Engineer or ML Architect
Consulting opportunities in ML infrastructure

Next Steps:

Build portfolio showcasing projects
Contribute to open-source ML infrastructure
Write blog posts and give talks
Apply to ML platform roles

Last Updated: 2025-10-18 | Version: 1.0.0

FilesExpand file tree

CURRICULUM.md

Latest commit

History

CURRICULUM.md

File metadata and controls

ML Platform Engineer - Complete Curriculum Guide

Table of Contents

Curriculum Overview

What You'll Learn

Learning Approach

Time Commitment

Learning Objectives

1. Platform Architecture & Design

2. API & SDK Development

3. Data Platform Engineering

4. Workflow Orchestration

5. Model Lifecycle Management

6. Developer Experience (DX)

7. Platform Observability

8. Security & Governance

9. DevOps & Production Operations

10. Technical Leadership

Role Definition & Career Context

What is an ML Platform Engineer?

Key Responsibilities

Career Progression

Typical Salary Ranges (US, 2025)

Companies Hiring ML Platform Engineers

Prerequisites & Preparation

Required Knowledge

1. Programming (Critical)

2. Kubernetes (Critical)

3. Databases (Important)

4. API Development (Important)

5. Cloud Platforms (Important)

6. CI/CD (Helpful)

7. Infrastructure as Code (Helpful)

8. ML Basics (Helpful)

Recommended Preparation

Pre-Assessment Quiz

Curriculum Structure

Three-Phase Learning Model

Learning Methods

Weekly Structure (Part-Time)

Detailed Module Breakdown

Module 01: Platform Fundamentals

Learning Objectives

Topics Covered

Hands-On Exercises

Reading Materials

Assessment

Module 02: API Design for ML Platforms

Learning Objectives

Topics Covered

Hands-On Exercises

Reading Materials

Assessment

Module 03: Multi-Tenancy & Resource Management

Learning Objectives

Topics Covered

Hands-On Exercises

Reading Materials

Assessment

Module 04: Feature Store Architecture

Learning Objectives

Topics Covered

Hands-On Exercises

Reading Materials

Assessment

Module 05: Workflow Orchestration

Learning Objectives

Topics Covered

Hands-On Exercises

Reading Materials

Assessment

Module 06: Model Management & Registry

Learning Objectives

Topics Covered

Hands-On Exercises