Curriculum Index - AI Infrastructure Engineer Solutions

Complete catalog of all exercises in the mid-level track

This index provides a comprehensive overview of all 26 exercises, their relationships, difficulty levels, time estimates, and skill alignments.

📊 Quick Statistics

Metric	Value
Total Modules	10
Total Exercises	26
Difficulty Range	Intermediate → Advanced
Total Learning Time	200-280 hours
Prerequisites	Junior Engineer track
Target Role Level	L4-L5 (Mid-Level)

Module Overview

Summary Table

Module	Code	Exercises	Total Hours	Primary Focus
Foundations	mod-101	3	18-24	Advanced tooling & frameworks
Cloud Computing	mod-102	3	24-30	Multi-cloud & cost optimization
Containerization	mod-103	3	20-26	Security & optimization
Kubernetes	mod-104	3	26-34	Advanced orchestration
Data Pipelines	mod-105	2	22-28	Streaming & workflow
MLOps	mod-106	3	24-32	Production ML lifecycle
GPU Computing	mod-107	3	28-36	GPU management & optimization
Monitoring	mod-108	2	20-26	Observability & ML metrics
Infrastructure as Code	mod-109	2	20-26	Terraform & Pulumi
LLM Infrastructure	mod-110	2	38-48	LLM serving & RAG

Complete Exercise Catalog

mod-101: Foundations

Advanced tools and frameworks for ML infrastructure development.

Exercise 04: Python Environment Manager

Difficulty: ⭐⭐⭐ Intermediate Time: 6-8 hours Prerequisites: Python, virtualenv, Docker

Description: Build an automated Python environment management tool that handles virtual environments, dependency management, and containerization for ML projects.

Key Technologies:

Python: subprocess, pathlib, typing
Environment: pyenv, Poetry, Conda
Containers: Docker, docker-compose

Learning Objectives:

Automate environment setup workflows
Integrate multiple environment managers
Build CLI tools with Click
Handle complex dependency resolution
Containerize Python environments

Deliverables:

CLI tool for environment management
Support for venv, Poetry, Conda
Docker integration
Configuration file support
Comprehensive test suite

Career Skills: Developer tooling, automation, dependency management

Exercise 05: ML Framework Benchmark

Difficulty: ⭐⭐⭐ Intermediate Time: 6-8 hours Prerequisites: PyTorch, TensorFlow, basic ML

Description: Create a comprehensive benchmarking suite to compare ML frameworks (PyTorch, TensorFlow, JAX) across different model architectures and hardware configurations.

Key Technologies:

Frameworks: PyTorch, TensorFlow, JAX
Benchmarking: timeit, memory_profiler
Visualization: Plotly, Matplotlib
Models: CNN, Transformer, LSTM

Learning Objectives:

Design fair benchmarking methodologies
Measure training and inference performance
Profile memory usage
Analyze GPU utilization
Generate comparative reports

Deliverables:

Benchmark suite for multiple frameworks
Performance metrics collection
Memory profiling
Interactive visualizations
Automated report generation

Career Skills: Performance optimization, framework evaluation, technical analysis

Exercise 06: FastAPI ML Template Generator

Difficulty: ⭐⭐⭐ Intermediate Time: 6-8 hours Prerequisites: FastAPI, Jinja2, project structure

Description: Build a code generation tool that creates standardized FastAPI-based ML serving projects with best practices, monitoring, and deployment configurations.

Key Technologies:

Framework: FastAPI
Templating: Jinja2
Project: Cookiecutter patterns
Standards: OpenAPI, Pydantic

Learning Objectives:

Implement code generation patterns
Design reusable project templates
Apply API best practices
Standardize ML serving patterns
Create scaffolding tools

Deliverables:

Template generator CLI
Multiple project templates
Configuration customization
Generated projects are deployable
Documentation generator

Career Skills: Code generation, standardization, API design, tooling

mod-102: Cloud Computing

Multi-cloud infrastructure management and optimization.

Exercise 01: Multi-Cloud Cost Analyzer

Difficulty: ⭐⭐⭐⭐ Intermediate-Advanced Time: 8-10 hours Prerequisites: AWS/GCP/Azure APIs, cost concepts

Description: Build a comprehensive cost analysis tool that aggregates, analyzes, and visualizes spending across AWS, GCP, and Azure, with optimization recommendations.

Key Technologies:

Cloud APIs: Boto3, Google Cloud, Azure SDK
Analysis: Pandas, NumPy
Visualization: Plotly, Dash
Reporting: Jinja2

Learning Objectives:

Integrate multiple cloud provider APIs
Aggregate cost data across clouds
Identify cost optimization opportunities
Build interactive dashboards
Generate automated reports

Deliverables:

Multi-cloud cost aggregation
Cost anomaly detection
Optimization recommendations
Interactive dashboards
Scheduled reporting

Career Skills: Cloud cost management, FinOps, multi-cloud expertise, data analysis

Exercise 02: Cloud ML Infrastructure

Difficulty: ⭐⭐⭐⭐ Intermediate-Advanced Time: 8-10 hours Prerequisites: Terraform, cloud platforms

Description: Deploy a complete ML infrastructure stack to multiple cloud providers using Terraform, including compute, storage, networking, and ML-specific services.

Key Technologies:

IaC: Terraform
Clouds: AWS (SageMaker), GCP (Vertex AI), Azure (Azure ML)
Networking: VPCs, subnets, security groups
Storage: S3, GCS, Blob Storage

Learning Objectives:

Write cloud-agnostic Terraform modules
Deploy to multiple cloud providers
Configure ML-specific services
Implement security best practices
Manage infrastructure state

Deliverables:

Terraform modules for each cloud
ML infrastructure deployment
Network configuration
Security setup (IAM, service accounts)
Documentation and runbooks

Career Skills: Multi-cloud deployment, Terraform expertise, ML infrastructure, security

Exercise 03: Disaster Recovery

Difficulty: ⭐⭐⭐⭐ Advanced Time: 8-12 hours Prerequisites: Cloud platforms, databases, backup strategies

Description: Implement a comprehensive disaster recovery solution for ML infrastructure, including backup automation, failover mechanisms, and recovery testing.

Key Technologies:

Backup: Cloud-native backup services
DR: Cross-region replication, failover
Testing: Chaos engineering, DR drills
Monitoring: Recovery time objectives (RTO/RPO)

Learning Objectives:

Design DR strategies for ML systems
Implement automated backups
Configure cross-region replication
Test recovery procedures
Measure and optimize RTO/RPO

Deliverables:

Automated backup system
Failover automation
DR testing framework
Recovery playbooks
RTO/RPO monitoring

Career Skills: Disaster recovery, business continuity, reliability engineering

mod-103: Containerization

Advanced container security, optimization, and registry management.

Exercise 04: Container Security

Difficulty: ⭐⭐⭐⭐ Intermediate-Advanced Time: 7-9 hours Prerequisites: Docker, security concepts

Description: Build a comprehensive container security scanning and compliance tool that checks images for vulnerabilities, misconfigurations, and generates SBOMs.

Key Technologies:

Scanning: Trivy, Grype
Compliance: CIS Docker Benchmark
SBOM: Syft
Reporting: JSON, HTML reports

Learning Objectives:

Implement vulnerability scanning
Generate Software Bill of Materials
Check CIS compliance
Detect secrets in images
Create security scorecards

Deliverables:

Automated security scanner
Multi-tool integration
SBOM generation
Compliance reporting
CI/CD integration

Career Skills: Container security, compliance, supply chain security, DevSecOps

Exercise 05: Image Optimizer

Difficulty: ⭐⭐⭐ Intermediate Time: 6-8 hours Prerequisites: Docker, multi-stage builds

Description: Create a tool that analyzes and optimizes Docker images for size, layer count, and build time, with automated recommendations and refactoring.

Key Technologies:

Docker: Multi-stage builds, BuildKit
Analysis: Dive, docker-slim
Optimization: Layer caching, .dockerignore
Comparison: Size and performance metrics

Learning Objectives:

Analyze Docker image layers
Optimize image size
Reduce build times
Implement build caching
Generate optimization reports

Deliverables:

Image analysis tool
Optimization recommendations
Automated refactoring (optional)
Before/after comparisons
Best practices checker

Career Skills: Docker optimization, build engineering, efficiency

Exercise 06: Registry Manager

Difficulty: ⭐⭐⭐⭐ Intermediate-Advanced Time: 7-9 hours Prerequisites: Container registries, image signing

Description: Build a private container registry management system with image signing, scanning, retention policies, and replication across multiple registries.

Key Technologies:

Registries: Harbor, ECR, GCR, ACR
Signing: Cosign, Notary
Policies: Retention, garbage collection
Replication: Cross-registry sync

Learning Objectives:

Deploy and manage Harbor registry
Implement image signing and verification
Configure retention policies
Set up cross-registry replication
Automate registry operations

Deliverables:

Registry deployment (Harbor)
Image signing pipeline
Retention policy automation
Replication configuration
Registry monitoring

Career Skills: Registry management, image signing, supply chain security

mod-104: Kubernetes

Advanced Kubernetes cluster management and patterns.

Exercise 04: K8s Cluster Autoscaler

Difficulty: ⭐⭐⭐⭐ Advanced Time: 9-11 hours Prerequisites: Kubernetes, HPA, metrics

Description: Implement intelligent autoscaling for Kubernetes clusters using HPA, VPA, and Cluster Autoscaler, with custom metrics and cost optimization.

Key Technologies:

Autoscaling: HPA, VPA, Cluster Autoscaler
Metrics: Prometheus, custom metrics
Optimization: Cost-aware scheduling
Monitoring: Grafana dashboards

Learning Objectives:

Configure Horizontal Pod Autoscaler
Implement Vertical Pod Autoscaler
Deploy Cluster Autoscaler
Use custom metrics for scaling
Optimize for cost and performance

Deliverables:

Complete autoscaling setup
Custom metrics integration
Cost optimization policies
Monitoring dashboards
Testing and validation

Career Skills: Kubernetes autoscaling, resource optimization, cost management

Exercise 05: Service Mesh Observability

Difficulty: ⭐⭐⭐⭐ Advanced Time: 8-11 hours Prerequisites: Kubernetes, service mesh concepts

Description: Deploy and configure a service mesh (Istio or Linkerd) for microservices observability, including distributed tracing, traffic management, and security.

Key Technologies:

Service Mesh: Istio or Linkerd
Tracing: Jaeger, Zipkin
Monitoring: Prometheus, Grafana
Traffic: Canary, blue-green

Learning Objectives:

Deploy service mesh to Kubernetes
Configure distributed tracing
Implement traffic management
Set up mutual TLS
Monitor service mesh metrics

Deliverables:

Service mesh deployment
Distributed tracing setup
Traffic management policies
mTLS configuration
Observability dashboards

Career Skills: Service mesh, distributed tracing, microservices, advanced networking

Exercise 06: K8s Operator Framework

Difficulty: ⭐⭐⭐⭐⭐ Advanced Time: 9-12 hours Prerequisites: Kubernetes, Python, CRDs

Description: Build custom Kubernetes operators using Kopf framework to automate ML model deployment, scaling, and lifecycle management.

Key Technologies:

Framework: Kopf (Python)
K8s: Custom Resource Definitions
Controllers: Reconciliation loops
API: Kubernetes Python client

Learning Objectives:

Design custom resource definitions
Implement operator reconciliation logic
Handle resource lifecycle events
Manage dependent resources
Test operators thoroughly

Deliverables:

Custom operator for ML models
CRD definitions
Reconciliation logic
Event handling
Operator testing

Career Skills: Kubernetes operators, CRDs, automation, platform engineering

mod-105: Data Pipelines

Real-time streaming and workflow orchestration.

Exercise 03: Streaming Pipeline Kafka

Difficulty: ⭐⭐⭐⭐ Advanced Time: 11-14 hours Prerequisites: Kafka, streaming concepts, Spark

Description: Build a real-time streaming data pipeline using Kafka for ingestion, Spark for processing, and various sinks for ML model training and inference.

Key Technologies:

Streaming: Apache Kafka
Processing: PySpark, Flink (optional)
Storage: S3, databases
Monitoring: Kafka metrics

Learning Objectives:

Design streaming architectures
Implement Kafka producers/consumers
Process streams with Spark
Handle backpressure and failures
Monitor pipeline health

Deliverables:

End-to-end streaming pipeline
Data quality checks
Exactly-once semantics
Failure recovery
Monitoring and alerting

Career Skills: Stream processing, Kafka, real-time systems, data engineering

Exercise 04: Workflow Orchestration Airflow

Difficulty: ⭐⭐⭐⭐ Intermediate-Advanced Time: 11-14 hours Prerequisites: Airflow, DAGs, Python

Description: Create complex ML workflow orchestration using Apache Airflow, including data pipelines, training jobs, model deployment, and monitoring.

Key Technologies:

Orchestration: Apache Airflow
Tasks: PythonOperator, KubernetesPodOperator
Scheduling: Cron, sensors
Monitoring: Airflow UI, alerts

Learning Objectives:

Design DAG structures for ML workflows
Implement task dependencies
Use dynamic DAG generation
Configure sensors and triggers
Monitor workflow execution

Deliverables:

Complete ML workflow DAGs
Data validation tasks
Training and deployment tasks
Monitoring and alerting
SLA tracking

Career Skills: Workflow orchestration, Airflow, MLOps, data pipeline engineering

mod-106: MLOps

Production ML lifecycle management.

Exercise 04: Experiment Tracking MLflow

Difficulty: ⭐⭐⭐⭐ Intermediate-Advanced Time: 8-10 hours Prerequisites: MLflow, ML frameworks, PostgreSQL

Description: Set up a production MLflow deployment for experiment tracking, model registry, and artifact storage, integrated with training pipelines.

Key Technologies:

MLflow: Tracking, Registry, Projects
Storage: S3, PostgreSQL
Integration: PyTorch, TensorFlow
Deployment: Docker, Kubernetes

Learning Objectives:

Deploy MLflow server infrastructure
Track experiments programmatically
Manage model lifecycle in registry
Integrate with training code
Automate model promotion

Deliverables:

MLflow server deployment
Experiment tracking integration
Model registry setup
Artifact storage (S3)
Model promotion workflows

Career Skills: Experiment tracking, MLOps, model management, ML infrastructure

Exercise 05: Model Monitoring Drift

Difficulty: ⭐⭐⭐⭐ Advanced Time: 8-11 hours Prerequisites: ML monitoring, statistics, Evidently

Description: Implement comprehensive model monitoring using Evidently to detect data drift, model drift, and performance degradation in production.

Key Technologies:

Monitoring: Evidently
Metrics: Custom ML metrics
Alerting: Prometheus, alertmanager
Dashboards: Grafana

Learning Objectives:

Implement drift detection algorithms
Monitor model performance metrics
Set up automated alerting
Create monitoring dashboards
Trigger retraining on drift

Deliverables:

Drift detection system
Performance monitoring
Automated alerts
Grafana dashboards
Retraining triggers

Career Skills: ML monitoring, drift detection, production ML, data science

Exercise 06: CI/CD ML Pipelines

Difficulty: ⭐⭐⭐⭐ Advanced Time: 8-11 hours Prerequisites: GitHub Actions, DVC, testing

Description: Build automated CI/CD pipelines for ML projects including data versioning, model testing, deployment, and rollback capabilities.

Key Technologies:

CI/CD: GitHub Actions, GitLab CI
Versioning: DVC, Git LFS
Testing: pytest, Great Expectations
Deployment: Kubernetes, Docker

Learning Objectives:

Design ML-specific CI/CD workflows
Version datasets with DVC
Implement model testing
Automate deployment pipelines
Handle rollbacks and monitoring

Deliverables:

Complete CI/CD workflows
Data versioning setup
Automated testing suite
Deployment automation
Rollback mechanisms

Career Skills: CI/CD, MLOps, automation, DevOps for ML

mod-107: GPU Computing

GPU cluster management and optimization.

Exercise 04: GPU Cluster Management

Difficulty: ⭐⭐⭐⭐⭐ Advanced Time: 9-12 hours Prerequisites: Kubernetes, NVIDIA GPUs, multi-tenancy

Description: Deploy and manage a multi-tenant GPU cluster using NVIDIA GPU Operator, with resource quotas, monitoring, and cost tracking.

Key Technologies:

GPUs: NVIDIA GPU Operator
Kubernetes: Device plugins, taints/tolerations
Monitoring: dcgm-exporter, Prometheus
Scheduling: Node selectors, affinity

Learning Objectives:

Deploy NVIDIA GPU Operator
Configure GPU device plugins
Implement multi-tenancy with quotas
Monitor GPU utilization
Track GPU costs per team

Deliverables:

GPU cluster deployment
Multi-tenancy configuration
Resource quotas and limits
GPU monitoring dashboards
Cost allocation system

Career Skills: GPU infrastructure, Kubernetes, multi-tenancy, cost management

Exercise 05: GPU Performance Optimization

Difficulty: ⭐⭐⭐⭐⭐ Advanced Time: 9-12 hours Prerequisites: CUDA, PyTorch, profiling

Description: Optimize GPU utilization for ML workloads through profiling, memory optimization, and kernel tuning, with comprehensive benchmarking.

Key Technologies:

Profiling: NVIDIA Nsight, PyTorch Profiler
Optimization: Mixed precision, gradient accumulation
Monitoring: GPU metrics, memory usage
Frameworks: PyTorch, TensorFlow

Learning Objectives:

Profile GPU workloads
Optimize memory usage
Implement mixed-precision training
Tune batch sizes and workers
Benchmark optimizations

Deliverables:

Profiling framework
Optimization recommendations
Memory optimization techniques
Performance benchmarks
Best practices documentation

Career Skills: GPU optimization, performance tuning, CUDA, ML acceleration

Exercise 06: Distributed GPU Training

Difficulty: ⭐⭐⭐⭐⭐ Advanced Time: 10-12 hours Prerequisites: Distributed systems, PyTorch DDP, Ray

Description: Implement distributed training across multiple GPUs and nodes using Ray, Horovod, and PyTorch DistributedDataParallel.

Key Technologies:

Distributed: Ray, Horovod
PyTorch: DistributedDataParallel (DDP)
Communication: NCCL, Gloo
Orchestration: Kubernetes

Learning Objectives:

Implement data parallelism
Configure distributed training
Optimize communication overhead
Handle fault tolerance
Scale training across nodes

Deliverables:

Distributed training setup
Multi-node configuration
Fault tolerance mechanisms
Performance optimization
Scaling benchmarks

Career Skills: Distributed training, Ray, PyTorch DDP, large-scale ML

mod-108: Monitoring & Observability

Production observability for ML systems.

Exercise 01: Observability Stack

Difficulty: ⭐⭐⭐⭐ Advanced Time: 10-13 hours Prerequisites: Prometheus, Grafana, Kubernetes

Description: Deploy a comprehensive observability stack with Prometheus, Grafana, Jaeger, and ELK for metrics, tracing, and logging of ML systems.

Key Technologies:

Metrics: Prometheus, Grafana
Tracing: Jaeger, OpenTelemetry
Logging: Elasticsearch, Fluentd, Kibana
Alerting: Alertmanager

Learning Objectives:

Deploy full observability stack
Configure metric collection
Set up distributed tracing
Implement centralized logging
Create comprehensive dashboards

Deliverables:

Complete observability stack
Custom metrics collection
Distributed tracing setup
Centralized logging
Alert rules and dashboards

Career Skills: Observability, SRE, monitoring, distributed systems

Exercise 02: ML Model Monitoring

Difficulty: ⭐⭐⭐⭐ Advanced Time: 10-13 hours Prerequisites: Prometheus, ML serving, statistics

Description: Implement ML-specific monitoring including prediction latency, model accuracy, feature distributions, and business metrics.

Key Technologies:

Monitoring: Prometheus, Grafana
ML Metrics: Custom exporters
Analysis: Statistical tests
Alerting: Smart thresholds

Learning Objectives:

Define ML-specific metrics
Implement custom exporters
Create ML dashboards
Set up intelligent alerting
Monitor model health

Deliverables:

ML metrics collection
Custom Prometheus exporters
Grafana dashboards for ML
Alerting rules
Model health scorecards

Career Skills: ML monitoring, metrics design, observability, SRE for ML

mod-109: Infrastructure as Code

Advanced IaC with Terraform and Pulumi.

Exercise 01: Terraform ML Infrastructure

Difficulty: ⭐⭐⭐⭐ Intermediate-Advanced Time: 10-13 hours Prerequisites: Terraform, cloud platforms, modules

Description: Build reusable Terraform modules for deploying ML infrastructure across AWS, GCP, and Azure, with state management and testing.

Key Technologies:

IaC: Terraform 1.5+
Clouds: AWS, GCP, Azure
Testing: Terratest, tflint
State: Remote backends, locking

Learning Objectives:

Design reusable Terraform modules
Implement multi-cloud deployments
Manage state securely
Test infrastructure code
Version and publish modules

Deliverables:

ML infrastructure modules
Multi-cloud deployments
State management setup
Module testing
Documentation and examples

Career Skills: Terraform, IaC, multi-cloud, DevOps

Exercise 02: Pulumi Multi-Cloud ML

Difficulty: ⭐⭐⭐⭐ Advanced Time: 10-13 hours Prerequisites: Pulumi, Python, cloud platforms

Description: Implement cloud-agnostic ML infrastructure using Pulumi with Python, including component resources and automation API.

Key Technologies:

IaC: Pulumi with Python
Clouds: AWS, GCP, Azure
Automation: Pulumi Automation API
Testing: Python unit tests

Learning Objectives:

Write IaC in Python with Pulumi
Create component resources
Use Pulumi Automation API
Test infrastructure code
Implement GitOps workflows

Deliverables:

Pulumi programs for ML infra
Component resources
Automation API usage
Infrastructure testing
GitOps integration

Career Skills: Pulumi, programmatic IaC, Python, multi-cloud

mod-110: LLM Infrastructure

Production LLM serving and RAG systems.

Exercise 01: Production LLM Serving

Difficulty: ⭐⭐⭐⭐⭐ Advanced Time: 19-24 hours Prerequisites: LLMs, vLLM, GPU optimization

Description: Deploy optimized LLM serving infrastructure using vLLM or TensorRT-LLM with GPU optimization, batching, and production monitoring.

Key Technologies:

Serving: vLLM, TensorRT-LLM
API: FastAPI, streaming
GPUs: NVIDIA A100/H100 optimization
Monitoring: Custom LLM metrics

Learning Objectives:

Deploy optimized LLM serving
Implement dynamic batching
Configure GPU settings
Optimize inference latency
Monitor LLM performance and costs

Deliverables:

Production LLM serving API
vLLM/TensorRT deployment
GPU optimization
Streaming responses
Comprehensive monitoring

Career Skills: LLM serving, GPU optimization, GenAI infrastructure

Exercise 02: Production RAG System

Difficulty: ⭐⭐⭐⭐⭐ Advanced Time: 19-24 hours Prerequisites: LangChain, vector DBs, document processing

Description: Build a production RAG (Retrieval-Augmented Generation) system with document ingestion, vector storage, and optimized retrieval.

Key Technologies:

RAG: LangChain
Vector DB: ChromaDB, Pinecone
Processing: Unstructured, PyPDF
LLM: OpenAI API or self-hosted

Learning Objectives:

Design RAG architectures
Implement document ingestion
Optimize vector search
Tune retrieval parameters
Monitor RAG quality

Deliverables:

Complete RAG system
Document processing pipeline
Vector database setup
Retrieval optimization
Quality monitoring

Career Skills: RAG systems, vector databases, LLM applications, GenAI

Difficulty Progression

Difficulty Levels Explained

⭐⭐⭐ Intermediate: Requires solid foundation, introduces new concepts
⭐⭐⭐⭐ Intermediate-Advanced: Complex integrations, production considerations
⭐⭐⭐⭐⭐ Advanced: Cutting-edge technologies, complex architectures

Recommended Learning Order

Level 1 (Intermediate):
├── mod-101: All exercises (3)
├── mod-103: Exercise 05 (1)
└── Total: 4 exercises

Level 2 (Intermediate-Advanced):
├── mod-102: Exercises 01-02 (2)
├── mod-103: Exercises 04, 06 (2)
├── mod-105: Exercise 04 (1)
├── mod-106: Exercises 04, 06 (2)
└── Total: 7 exercises

Level 3 (Advanced):
├── mod-102: Exercise 03 (1)
├── mod-104: All exercises (3)
├── mod-105: Exercise 03 (1)
├── mod-106: Exercise 05 (1)
├── mod-107: All exercises (3)
├── mod-108: All exercises (2)
├── mod-109: All exercises (2)
├── mod-110: All exercises (2)
└── Total: 15 exercises

Skill Matrix

Technical Skills by Exercise

Exercise	Cloud	K8s	Docker	ML	GPU	IaC	Monitoring	Data Eng
101-04	○	○	●	○	○	○	○	○
101-05	○	○	○	●	●	○	○	○
101-06	○	○	○	●	○	○	○	○
102-01	●	○	○	○	○	○	○	○
102-02	●	○	○	●	○	●	○	○
102-03	●	○	○	○	○	○	●	●
103-04	○	○	●	○	○	○	○	○
103-05	○	○	●	○	○	○	○	○
103-06	●	●	●	○	○	○	○	○
104-04	●	●	○	○	○	○	●	○
104-05	○	●	○	○	○	○	●	○
104-06	○	●	○	●	○	○	○	○
105-03	○	●	●	○	○	○	●	●
105-04	○	●	○	●	○	○	●	●
106-04	●	●	●	●	○	○	○	○
106-05	○	○	○	●	○	○	●	○
106-06	○	●	●	●	○	○	○	○
107-04	●	●	○	○	●	○	●	○
107-05	○	○	○	●	●	○	●	○
107-06	○	●	○	●	●	○	○	○
108-01	○	●	●	○	○	○	●	○
108-02	○	●	○	●	○	○	●	○
109-01	●	●	○	●	●	●	○	○
109-02	●	●	○	●	○	●	○	○
110-01	●	●	●	●	●	○	●	○
110-02	●	●	○	●	●	○	●	●

Legend: ● Primary skill | ○ Secondary skill

Technology Coverage

By Category

Cloud Platforms:

AWS: 8 exercises
GCP: 8 exercises
Azure: 8 exercises
Multi-cloud: 6 exercises

Container & Orchestration:

Docker: 12 exercises
Kubernetes: 16 exercises
Helm: 8 exercises
Service Mesh: 1 exercise

ML Frameworks:

PyTorch: 12 exercises
TensorFlow: 4 exercises
JAX: 1 exercise
HuggingFace: 4 exercises

MLOps Tools:

MLflow: 2 exercises
Airflow: 1 exercise
DVC: 1 exercise
Evidently: 1 exercise

Data & Streaming:

Kafka: 1 exercise
Spark: 1 exercise
PostgreSQL: 4 exercises
Redis: 2 exercises

GPU & Acceleration:

CUDA: 3 exercises
NVIDIA Tools: 3 exercises
Distributed Training: 1 exercise

Monitoring:

Prometheus: 8 exercises
Grafana: 8 exercises
Jaeger: 1 exercise
ELK Stack: 1 exercise

Infrastructure as Code:

Terraform: 2 exercises
Pulumi: 1 exercise

LLM Infrastructure:

vLLM: 1 exercise
LangChain: 1 exercise
Vector DBs: 1 exercise

Learning Paths

Path 1: Complete Mastery (All 26 exercises)

Order: Sequential (mod-101 → mod-110) Duration: 200-280 hours Best for: Comprehensive skill building

Path 2: Fast Track MLOps (12 exercises)

Order: 101{4,5,6} → 106{4,5,6} → 108{2} → 102{2} → 107{4,5} → 109{1} Duration: 110-150 hours Best for: MLOps specialists

Path 3: Platform Engineering (13 exercises)

Order: 104{4,5,6} → 103{4,5,6} → 109{1,2} → 108{1,2} → 102{1,2} → 107{4} Duration: 130-180 hours Best for: Platform engineers

Path 4: LLM Specialist (12 exercises)

Order: 110{1,2} → 107{4,5,6} → 108{2} → 104{4,5} → 102{2} → 109{1} Duration: 140-190 hours Best for: GenAI infrastructure

Time Estimates

By Time Commitment

Full-Time Study (40 hrs/week):

Complete Path: 10-14 weeks
Fast Track: 6-8 weeks
Platform: 7-9 weeks
LLM: 7-10 weeks

Part-Time Study (20 hrs/week):

Complete Path: 20-28 weeks
Fast Track: 11-15 weeks
Platform: 13-18 weeks
LLM: 14-19 weeks

Casual Study (10 hrs/week):

Complete Path: 40-56 weeks
Fast Track: 22-30 weeks
Platform: 26-36 weeks
LLM: 28-38 weeks

Prerequisites Map

Module Dependencies

mod-101 (Foundations)
    └──> mod-102 (Cloud)
            └──> mod-103 (Containers)
                    └──> mod-104 (Kubernetes)
                            ├──> mod-105 (Data Pipelines)
                            ├──> mod-106 (MLOps)
                            ├──> mod-107 (GPU)
                            │       └──> mod-110 (LLM)
                            ├──> mod-108 (Monitoring)
                            └──> mod-109 (IaC)

Skills Prerequisites

Before Starting:

✅ Junior Engineer track completed
✅ Python OOP and async
✅ Docker fundamentals
✅ Basic Kubernetes
✅ Git proficiency
✅ Linux command line
✅ Basic ML concepts

Developed During:

Advanced Kubernetes patterns
Multi-cloud expertise
GPU infrastructure
Production MLOps
LLM deployment
Advanced monitoring
Infrastructure as Code mastery

Career Alignment

By Target Role

ML Infrastructure Engineer (L4-L5):

Priority: Complete Path or Fast Track MLOps
Key modules: 106, 107, 108, 109, 110
Certifications: AWS ML, GCP ML Engineer

Platform Engineer:

Priority: Platform Engineering Path
Key modules: 103, 104, 108, 109
Certifications: CKA, CKAD, Terraform

MLOps Engineer:

Priority: Fast Track MLOps Path
Key modules: 105, 106, 108
Certifications: AWS ML, MLOps practices

SRE - ML Systems:

Priority: Complete Path with focus on 108
Key modules: 104, 107, 108, 109
Certifications: CKA, Prometheus

LLM Infrastructure Specialist:

Priority: LLM Specialist Path
Key modules: 107, 110
Certifications: NVIDIA DLI, cloud certifications

Appendix: Quick Reference

Exercise Code Reference

Code	Exercise Name
101-04	Python Environment Manager
101-05	ML Framework Benchmark
101-06	FastAPI ML Template Generator
102-01	Multi-Cloud Cost Analyzer
102-02	Cloud ML Infrastructure
102-03	Disaster Recovery
103-04	Container Security
103-05	Image Optimizer
103-06	Registry Manager
104-04	K8s Cluster Autoscaler
104-05	Service Mesh Observability
104-06	K8s Operator Framework
105-03	Streaming Pipeline Kafka
105-04	Workflow Orchestration Airflow
106-04	Experiment Tracking MLflow
106-05	Model Monitoring Drift
106-06	CI/CD ML Pipelines
107-04	GPU Cluster Management
107-05	GPU Performance Optimization
107-06	Distributed GPU Training
108-01	Observability Stack
108-02	ML Model Monitoring
109-01	Terraform ML Infrastructure
109-02	Pulumi Multi-Cloud ML
110-01	Production LLM Serving
110-02	Production RAG System

Use this index to navigate the curriculum and plan your learning journey! 📚

Last Updated: October 25, 2025 Version: 1.0

FilesExpand file tree

CURRICULUM_INDEX.md

Latest commit

History

CURRICULUM_INDEX.md

File metadata and controls

Curriculum Index - AI Infrastructure Engineer Solutions

📊 Quick Statistics

📚 Table of Contents

Module Overview

Summary Table

Complete Exercise Catalog

mod-101: Foundations

Exercise 04: Python Environment Manager

Exercise 05: ML Framework Benchmark

Exercise 06: FastAPI ML Template Generator

mod-102: Cloud Computing

Exercise 01: Multi-Cloud Cost Analyzer

Exercise 02: Cloud ML Infrastructure

Exercise 03: Disaster Recovery

mod-103: Containerization

Exercise 04: Container Security

Exercise 05: Image Optimizer

Exercise 06: Registry Manager

mod-104: Kubernetes

Exercise 04: K8s Cluster Autoscaler

Exercise 05: Service Mesh Observability

Exercise 06: K8s Operator Framework

mod-105: Data Pipelines

Exercise 03: Streaming Pipeline Kafka

Exercise 04: Workflow Orchestration Airflow

mod-106: MLOps

Exercise 04: Experiment Tracking MLflow

Exercise 05: Model Monitoring Drift

Exercise 06: CI/CD ML Pipelines

mod-107: GPU Computing

Exercise 04: GPU Cluster Management

Exercise 05: GPU Performance Optimization

Exercise 06: Distributed GPU Training

mod-108: Monitoring & Observability

Exercise 01: Observability Stack

Exercise 02: ML Model Monitoring

mod-109: Infrastructure as Code

Exercise 01: Terraform ML Infrastructure

Exercise 02: Pulumi Multi-Cloud ML

mod-110: LLM Infrastructure

Exercise 01: Production LLM Serving

Exercise 02: Production RAG System

Difficulty Progression

Difficulty Levels Explained

Recommended Learning Order

Skill Matrix

Technical Skills by Exercise

Technology Coverage

By Category

Learning Paths

Path 1: Complete Mastery (All 26 exercises)

Path 2: Fast Track MLOps (12 exercises)

Path 3: Platform Engineering (13 exercises)

Path 4: LLM Specialist (12 exercises)

Time Estimates

By Time Commitment

Prerequisites Map

Module Dependencies

Skills Prerequisites

Career Alignment

By Target Role

Appendix: Quick Reference

Exercise Code Reference