Complete catalog of all exercises in the mid-level track
This index provides a comprehensive overview of all 26 exercises, their relationships, difficulty levels, time estimates, and skill alignments.
| Metric | Value |
|---|---|
| Total Modules | 10 |
| Total Exercises | 26 |
| Difficulty Range | Intermediate → Advanced |
| Total Learning Time | 200-280 hours |
| Prerequisites | Junior Engineer track |
| Target Role Level | L4-L5 (Mid-Level) |
- Module Overview
- Complete Exercise Catalog
- Difficulty Progression
- Skill Matrix
- Technology Coverage
- Learning Paths
- Time Estimates
- Prerequisites Map
- Career Alignment
| Module | Code | Exercises | Total Hours | Primary Focus |
|---|---|---|---|---|
| Foundations | mod-101 | 3 | 18-24 | Advanced tooling & frameworks |
| Cloud Computing | mod-102 | 3 | 24-30 | Multi-cloud & cost optimization |
| Containerization | mod-103 | 3 | 20-26 | Security & optimization |
| Kubernetes | mod-104 | 3 | 26-34 | Advanced orchestration |
| Data Pipelines | mod-105 | 2 | 22-28 | Streaming & workflow |
| MLOps | mod-106 | 3 | 24-32 | Production ML lifecycle |
| GPU Computing | mod-107 | 3 | 28-36 | GPU management & optimization |
| Monitoring | mod-108 | 2 | 20-26 | Observability & ML metrics |
| Infrastructure as Code | mod-109 | 2 | 20-26 | Terraform & Pulumi |
| LLM Infrastructure | mod-110 | 2 | 38-48 | LLM serving & RAG |
Advanced tools and frameworks for ML infrastructure development.
Difficulty: ⭐⭐⭐ Intermediate Time: 6-8 hours Prerequisites: Python, virtualenv, Docker
Description: Build an automated Python environment management tool that handles virtual environments, dependency management, and containerization for ML projects.
Key Technologies:
- Python:
subprocess,pathlib,typing - Environment:
pyenv,Poetry,Conda - Containers: Docker, docker-compose
Learning Objectives:
- Automate environment setup workflows
- Integrate multiple environment managers
- Build CLI tools with Click
- Handle complex dependency resolution
- Containerize Python environments
Deliverables:
- CLI tool for environment management
- Support for venv, Poetry, Conda
- Docker integration
- Configuration file support
- Comprehensive test suite
Career Skills: Developer tooling, automation, dependency management
Difficulty: ⭐⭐⭐ Intermediate Time: 6-8 hours Prerequisites: PyTorch, TensorFlow, basic ML
Description: Create a comprehensive benchmarking suite to compare ML frameworks (PyTorch, TensorFlow, JAX) across different model architectures and hardware configurations.
Key Technologies:
- Frameworks: PyTorch, TensorFlow, JAX
- Benchmarking:
timeit,memory_profiler - Visualization: Plotly, Matplotlib
- Models: CNN, Transformer, LSTM
Learning Objectives:
- Design fair benchmarking methodologies
- Measure training and inference performance
- Profile memory usage
- Analyze GPU utilization
- Generate comparative reports
Deliverables:
- Benchmark suite for multiple frameworks
- Performance metrics collection
- Memory profiling
- Interactive visualizations
- Automated report generation
Career Skills: Performance optimization, framework evaluation, technical analysis
Difficulty: ⭐⭐⭐ Intermediate Time: 6-8 hours Prerequisites: FastAPI, Jinja2, project structure
Description: Build a code generation tool that creates standardized FastAPI-based ML serving projects with best practices, monitoring, and deployment configurations.
Key Technologies:
- Framework: FastAPI
- Templating: Jinja2
- Project: Cookiecutter patterns
- Standards: OpenAPI, Pydantic
Learning Objectives:
- Implement code generation patterns
- Design reusable project templates
- Apply API best practices
- Standardize ML serving patterns
- Create scaffolding tools
Deliverables:
- Template generator CLI
- Multiple project templates
- Configuration customization
- Generated projects are deployable
- Documentation generator
Career Skills: Code generation, standardization, API design, tooling
Multi-cloud infrastructure management and optimization.
Difficulty: ⭐⭐⭐⭐ Intermediate-Advanced Time: 8-10 hours Prerequisites: AWS/GCP/Azure APIs, cost concepts
Description: Build a comprehensive cost analysis tool that aggregates, analyzes, and visualizes spending across AWS, GCP, and Azure, with optimization recommendations.
Key Technologies:
- Cloud APIs: Boto3, Google Cloud, Azure SDK
- Analysis: Pandas, NumPy
- Visualization: Plotly, Dash
- Reporting: Jinja2
Learning Objectives:
- Integrate multiple cloud provider APIs
- Aggregate cost data across clouds
- Identify cost optimization opportunities
- Build interactive dashboards
- Generate automated reports
Deliverables:
- Multi-cloud cost aggregation
- Cost anomaly detection
- Optimization recommendations
- Interactive dashboards
- Scheduled reporting
Career Skills: Cloud cost management, FinOps, multi-cloud expertise, data analysis
Difficulty: ⭐⭐⭐⭐ Intermediate-Advanced Time: 8-10 hours Prerequisites: Terraform, cloud platforms
Description: Deploy a complete ML infrastructure stack to multiple cloud providers using Terraform, including compute, storage, networking, and ML-specific services.
Key Technologies:
- IaC: Terraform
- Clouds: AWS (SageMaker), GCP (Vertex AI), Azure (Azure ML)
- Networking: VPCs, subnets, security groups
- Storage: S3, GCS, Blob Storage
Learning Objectives:
- Write cloud-agnostic Terraform modules
- Deploy to multiple cloud providers
- Configure ML-specific services
- Implement security best practices
- Manage infrastructure state
Deliverables:
- Terraform modules for each cloud
- ML infrastructure deployment
- Network configuration
- Security setup (IAM, service accounts)
- Documentation and runbooks
Career Skills: Multi-cloud deployment, Terraform expertise, ML infrastructure, security
Difficulty: ⭐⭐⭐⭐ Advanced Time: 8-12 hours Prerequisites: Cloud platforms, databases, backup strategies
Description: Implement a comprehensive disaster recovery solution for ML infrastructure, including backup automation, failover mechanisms, and recovery testing.
Key Technologies:
- Backup: Cloud-native backup services
- DR: Cross-region replication, failover
- Testing: Chaos engineering, DR drills
- Monitoring: Recovery time objectives (RTO/RPO)
Learning Objectives:
- Design DR strategies for ML systems
- Implement automated backups
- Configure cross-region replication
- Test recovery procedures
- Measure and optimize RTO/RPO
Deliverables:
- Automated backup system
- Failover automation
- DR testing framework
- Recovery playbooks
- RTO/RPO monitoring
Career Skills: Disaster recovery, business continuity, reliability engineering
Advanced container security, optimization, and registry management.
Difficulty: ⭐⭐⭐⭐ Intermediate-Advanced Time: 7-9 hours Prerequisites: Docker, security concepts
Description: Build a comprehensive container security scanning and compliance tool that checks images for vulnerabilities, misconfigurations, and generates SBOMs.
Key Technologies:
- Scanning: Trivy, Grype
- Compliance: CIS Docker Benchmark
- SBOM: Syft
- Reporting: JSON, HTML reports
Learning Objectives:
- Implement vulnerability scanning
- Generate Software Bill of Materials
- Check CIS compliance
- Detect secrets in images
- Create security scorecards
Deliverables:
- Automated security scanner
- Multi-tool integration
- SBOM generation
- Compliance reporting
- CI/CD integration
Career Skills: Container security, compliance, supply chain security, DevSecOps
Difficulty: ⭐⭐⭐ Intermediate Time: 6-8 hours Prerequisites: Docker, multi-stage builds
Description: Create a tool that analyzes and optimizes Docker images for size, layer count, and build time, with automated recommendations and refactoring.
Key Technologies:
- Docker: Multi-stage builds, BuildKit
- Analysis: Dive, docker-slim
- Optimization: Layer caching, .dockerignore
- Comparison: Size and performance metrics
Learning Objectives:
- Analyze Docker image layers
- Optimize image size
- Reduce build times
- Implement build caching
- Generate optimization reports
Deliverables:
- Image analysis tool
- Optimization recommendations
- Automated refactoring (optional)
- Before/after comparisons
- Best practices checker
Career Skills: Docker optimization, build engineering, efficiency
Difficulty: ⭐⭐⭐⭐ Intermediate-Advanced Time: 7-9 hours Prerequisites: Container registries, image signing
Description: Build a private container registry management system with image signing, scanning, retention policies, and replication across multiple registries.
Key Technologies:
- Registries: Harbor, ECR, GCR, ACR
- Signing: Cosign, Notary
- Policies: Retention, garbage collection
- Replication: Cross-registry sync
Learning Objectives:
- Deploy and manage Harbor registry
- Implement image signing and verification
- Configure retention policies
- Set up cross-registry replication
- Automate registry operations
Deliverables:
- Registry deployment (Harbor)
- Image signing pipeline
- Retention policy automation
- Replication configuration
- Registry monitoring
Career Skills: Registry management, image signing, supply chain security
Advanced Kubernetes cluster management and patterns.
Difficulty: ⭐⭐⭐⭐ Advanced Time: 9-11 hours Prerequisites: Kubernetes, HPA, metrics
Description: Implement intelligent autoscaling for Kubernetes clusters using HPA, VPA, and Cluster Autoscaler, with custom metrics and cost optimization.
Key Technologies:
- Autoscaling: HPA, VPA, Cluster Autoscaler
- Metrics: Prometheus, custom metrics
- Optimization: Cost-aware scheduling
- Monitoring: Grafana dashboards
Learning Objectives:
- Configure Horizontal Pod Autoscaler
- Implement Vertical Pod Autoscaler
- Deploy Cluster Autoscaler
- Use custom metrics for scaling
- Optimize for cost and performance
Deliverables:
- Complete autoscaling setup
- Custom metrics integration
- Cost optimization policies
- Monitoring dashboards
- Testing and validation
Career Skills: Kubernetes autoscaling, resource optimization, cost management
Difficulty: ⭐⭐⭐⭐ Advanced Time: 8-11 hours Prerequisites: Kubernetes, service mesh concepts
Description: Deploy and configure a service mesh (Istio or Linkerd) for microservices observability, including distributed tracing, traffic management, and security.
Key Technologies:
- Service Mesh: Istio or Linkerd
- Tracing: Jaeger, Zipkin
- Monitoring: Prometheus, Grafana
- Traffic: Canary, blue-green
Learning Objectives:
- Deploy service mesh to Kubernetes
- Configure distributed tracing
- Implement traffic management
- Set up mutual TLS
- Monitor service mesh metrics
Deliverables:
- Service mesh deployment
- Distributed tracing setup
- Traffic management policies
- mTLS configuration
- Observability dashboards
Career Skills: Service mesh, distributed tracing, microservices, advanced networking
Difficulty: ⭐⭐⭐⭐⭐ Advanced Time: 9-12 hours Prerequisites: Kubernetes, Python, CRDs
Description: Build custom Kubernetes operators using Kopf framework to automate ML model deployment, scaling, and lifecycle management.
Key Technologies:
- Framework: Kopf (Python)
- K8s: Custom Resource Definitions
- Controllers: Reconciliation loops
- API: Kubernetes Python client
Learning Objectives:
- Design custom resource definitions
- Implement operator reconciliation logic
- Handle resource lifecycle events
- Manage dependent resources
- Test operators thoroughly
Deliverables:
- Custom operator for ML models
- CRD definitions
- Reconciliation logic
- Event handling
- Operator testing
Career Skills: Kubernetes operators, CRDs, automation, platform engineering
Real-time streaming and workflow orchestration.
Difficulty: ⭐⭐⭐⭐ Advanced Time: 11-14 hours Prerequisites: Kafka, streaming concepts, Spark
Description: Build a real-time streaming data pipeline using Kafka for ingestion, Spark for processing, and various sinks for ML model training and inference.
Key Technologies:
- Streaming: Apache Kafka
- Processing: PySpark, Flink (optional)
- Storage: S3, databases
- Monitoring: Kafka metrics
Learning Objectives:
- Design streaming architectures
- Implement Kafka producers/consumers
- Process streams with Spark
- Handle backpressure and failures
- Monitor pipeline health
Deliverables:
- End-to-end streaming pipeline
- Data quality checks
- Exactly-once semantics
- Failure recovery
- Monitoring and alerting
Career Skills: Stream processing, Kafka, real-time systems, data engineering
Difficulty: ⭐⭐⭐⭐ Intermediate-Advanced Time: 11-14 hours Prerequisites: Airflow, DAGs, Python
Description: Create complex ML workflow orchestration using Apache Airflow, including data pipelines, training jobs, model deployment, and monitoring.
Key Technologies:
- Orchestration: Apache Airflow
- Tasks: PythonOperator, KubernetesPodOperator
- Scheduling: Cron, sensors
- Monitoring: Airflow UI, alerts
Learning Objectives:
- Design DAG structures for ML workflows
- Implement task dependencies
- Use dynamic DAG generation
- Configure sensors and triggers
- Monitor workflow execution
Deliverables:
- Complete ML workflow DAGs
- Data validation tasks
- Training and deployment tasks
- Monitoring and alerting
- SLA tracking
Career Skills: Workflow orchestration, Airflow, MLOps, data pipeline engineering
Production ML lifecycle management.
Difficulty: ⭐⭐⭐⭐ Intermediate-Advanced Time: 8-10 hours Prerequisites: MLflow, ML frameworks, PostgreSQL
Description: Set up a production MLflow deployment for experiment tracking, model registry, and artifact storage, integrated with training pipelines.
Key Technologies:
- MLflow: Tracking, Registry, Projects
- Storage: S3, PostgreSQL
- Integration: PyTorch, TensorFlow
- Deployment: Docker, Kubernetes
Learning Objectives:
- Deploy MLflow server infrastructure
- Track experiments programmatically
- Manage model lifecycle in registry
- Integrate with training code
- Automate model promotion
Deliverables:
- MLflow server deployment
- Experiment tracking integration
- Model registry setup
- Artifact storage (S3)
- Model promotion workflows
Career Skills: Experiment tracking, MLOps, model management, ML infrastructure
Difficulty: ⭐⭐⭐⭐ Advanced Time: 8-11 hours Prerequisites: ML monitoring, statistics, Evidently
Description: Implement comprehensive model monitoring using Evidently to detect data drift, model drift, and performance degradation in production.
Key Technologies:
- Monitoring: Evidently
- Metrics: Custom ML metrics
- Alerting: Prometheus, alertmanager
- Dashboards: Grafana
Learning Objectives:
- Implement drift detection algorithms
- Monitor model performance metrics
- Set up automated alerting
- Create monitoring dashboards
- Trigger retraining on drift
Deliverables:
- Drift detection system
- Performance monitoring
- Automated alerts
- Grafana dashboards
- Retraining triggers
Career Skills: ML monitoring, drift detection, production ML, data science
Difficulty: ⭐⭐⭐⭐ Advanced Time: 8-11 hours Prerequisites: GitHub Actions, DVC, testing
Description: Build automated CI/CD pipelines for ML projects including data versioning, model testing, deployment, and rollback capabilities.
Key Technologies:
- CI/CD: GitHub Actions, GitLab CI
- Versioning: DVC, Git LFS
- Testing: pytest, Great Expectations
- Deployment: Kubernetes, Docker
Learning Objectives:
- Design ML-specific CI/CD workflows
- Version datasets with DVC
- Implement model testing
- Automate deployment pipelines
- Handle rollbacks and monitoring
Deliverables:
- Complete CI/CD workflows
- Data versioning setup
- Automated testing suite
- Deployment automation
- Rollback mechanisms
Career Skills: CI/CD, MLOps, automation, DevOps for ML
GPU cluster management and optimization.
Difficulty: ⭐⭐⭐⭐⭐ Advanced Time: 9-12 hours Prerequisites: Kubernetes, NVIDIA GPUs, multi-tenancy
Description: Deploy and manage a multi-tenant GPU cluster using NVIDIA GPU Operator, with resource quotas, monitoring, and cost tracking.
Key Technologies:
- GPUs: NVIDIA GPU Operator
- Kubernetes: Device plugins, taints/tolerations
- Monitoring: dcgm-exporter, Prometheus
- Scheduling: Node selectors, affinity
Learning Objectives:
- Deploy NVIDIA GPU Operator
- Configure GPU device plugins
- Implement multi-tenancy with quotas
- Monitor GPU utilization
- Track GPU costs per team
Deliverables:
- GPU cluster deployment
- Multi-tenancy configuration
- Resource quotas and limits
- GPU monitoring dashboards
- Cost allocation system
Career Skills: GPU infrastructure, Kubernetes, multi-tenancy, cost management
Difficulty: ⭐⭐⭐⭐⭐ Advanced Time: 9-12 hours Prerequisites: CUDA, PyTorch, profiling
Description: Optimize GPU utilization for ML workloads through profiling, memory optimization, and kernel tuning, with comprehensive benchmarking.
Key Technologies:
- Profiling: NVIDIA Nsight, PyTorch Profiler
- Optimization: Mixed precision, gradient accumulation
- Monitoring: GPU metrics, memory usage
- Frameworks: PyTorch, TensorFlow
Learning Objectives:
- Profile GPU workloads
- Optimize memory usage
- Implement mixed-precision training
- Tune batch sizes and workers
- Benchmark optimizations
Deliverables:
- Profiling framework
- Optimization recommendations
- Memory optimization techniques
- Performance benchmarks
- Best practices documentation
Career Skills: GPU optimization, performance tuning, CUDA, ML acceleration
Difficulty: ⭐⭐⭐⭐⭐ Advanced Time: 10-12 hours Prerequisites: Distributed systems, PyTorch DDP, Ray
Description: Implement distributed training across multiple GPUs and nodes using Ray, Horovod, and PyTorch DistributedDataParallel.
Key Technologies:
- Distributed: Ray, Horovod
- PyTorch: DistributedDataParallel (DDP)
- Communication: NCCL, Gloo
- Orchestration: Kubernetes
Learning Objectives:
- Implement data parallelism
- Configure distributed training
- Optimize communication overhead
- Handle fault tolerance
- Scale training across nodes
Deliverables:
- Distributed training setup
- Multi-node configuration
- Fault tolerance mechanisms
- Performance optimization
- Scaling benchmarks
Career Skills: Distributed training, Ray, PyTorch DDP, large-scale ML
Production observability for ML systems.
Difficulty: ⭐⭐⭐⭐ Advanced Time: 10-13 hours Prerequisites: Prometheus, Grafana, Kubernetes
Description: Deploy a comprehensive observability stack with Prometheus, Grafana, Jaeger, and ELK for metrics, tracing, and logging of ML systems.
Key Technologies:
- Metrics: Prometheus, Grafana
- Tracing: Jaeger, OpenTelemetry
- Logging: Elasticsearch, Fluentd, Kibana
- Alerting: Alertmanager
Learning Objectives:
- Deploy full observability stack
- Configure metric collection
- Set up distributed tracing
- Implement centralized logging
- Create comprehensive dashboards
Deliverables:
- Complete observability stack
- Custom metrics collection
- Distributed tracing setup
- Centralized logging
- Alert rules and dashboards
Career Skills: Observability, SRE, monitoring, distributed systems
Difficulty: ⭐⭐⭐⭐ Advanced Time: 10-13 hours Prerequisites: Prometheus, ML serving, statistics
Description: Implement ML-specific monitoring including prediction latency, model accuracy, feature distributions, and business metrics.
Key Technologies:
- Monitoring: Prometheus, Grafana
- ML Metrics: Custom exporters
- Analysis: Statistical tests
- Alerting: Smart thresholds
Learning Objectives:
- Define ML-specific metrics
- Implement custom exporters
- Create ML dashboards
- Set up intelligent alerting
- Monitor model health
Deliverables:
- ML metrics collection
- Custom Prometheus exporters
- Grafana dashboards for ML
- Alerting rules
- Model health scorecards
Career Skills: ML monitoring, metrics design, observability, SRE for ML
Advanced IaC with Terraform and Pulumi.
Difficulty: ⭐⭐⭐⭐ Intermediate-Advanced Time: 10-13 hours Prerequisites: Terraform, cloud platforms, modules
Description: Build reusable Terraform modules for deploying ML infrastructure across AWS, GCP, and Azure, with state management and testing.
Key Technologies:
- IaC: Terraform 1.5+
- Clouds: AWS, GCP, Azure
- Testing: Terratest, tflint
- State: Remote backends, locking
Learning Objectives:
- Design reusable Terraform modules
- Implement multi-cloud deployments
- Manage state securely
- Test infrastructure code
- Version and publish modules
Deliverables:
- ML infrastructure modules
- Multi-cloud deployments
- State management setup
- Module testing
- Documentation and examples
Career Skills: Terraform, IaC, multi-cloud, DevOps
Difficulty: ⭐⭐⭐⭐ Advanced Time: 10-13 hours Prerequisites: Pulumi, Python, cloud platforms
Description: Implement cloud-agnostic ML infrastructure using Pulumi with Python, including component resources and automation API.
Key Technologies:
- IaC: Pulumi with Python
- Clouds: AWS, GCP, Azure
- Automation: Pulumi Automation API
- Testing: Python unit tests
Learning Objectives:
- Write IaC in Python with Pulumi
- Create component resources
- Use Pulumi Automation API
- Test infrastructure code
- Implement GitOps workflows
Deliverables:
- Pulumi programs for ML infra
- Component resources
- Automation API usage
- Infrastructure testing
- GitOps integration
Career Skills: Pulumi, programmatic IaC, Python, multi-cloud
Production LLM serving and RAG systems.
Difficulty: ⭐⭐⭐⭐⭐ Advanced Time: 19-24 hours Prerequisites: LLMs, vLLM, GPU optimization
Description: Deploy optimized LLM serving infrastructure using vLLM or TensorRT-LLM with GPU optimization, batching, and production monitoring.
Key Technologies:
- Serving: vLLM, TensorRT-LLM
- API: FastAPI, streaming
- GPUs: NVIDIA A100/H100 optimization
- Monitoring: Custom LLM metrics
Learning Objectives:
- Deploy optimized LLM serving
- Implement dynamic batching
- Configure GPU settings
- Optimize inference latency
- Monitor LLM performance and costs
Deliverables:
- Production LLM serving API
- vLLM/TensorRT deployment
- GPU optimization
- Streaming responses
- Comprehensive monitoring
Career Skills: LLM serving, GPU optimization, GenAI infrastructure
Difficulty: ⭐⭐⭐⭐⭐ Advanced Time: 19-24 hours Prerequisites: LangChain, vector DBs, document processing
Description: Build a production RAG (Retrieval-Augmented Generation) system with document ingestion, vector storage, and optimized retrieval.
Key Technologies:
- RAG: LangChain
- Vector DB: ChromaDB, Pinecone
- Processing: Unstructured, PyPDF
- LLM: OpenAI API or self-hosted
Learning Objectives:
- Design RAG architectures
- Implement document ingestion
- Optimize vector search
- Tune retrieval parameters
- Monitor RAG quality
Deliverables:
- Complete RAG system
- Document processing pipeline
- Vector database setup
- Retrieval optimization
- Quality monitoring
Career Skills: RAG systems, vector databases, LLM applications, GenAI
- ⭐⭐⭐ Intermediate: Requires solid foundation, introduces new concepts
- ⭐⭐⭐⭐ Intermediate-Advanced: Complex integrations, production considerations
- ⭐⭐⭐⭐⭐ Advanced: Cutting-edge technologies, complex architectures
Level 1 (Intermediate):
├── mod-101: All exercises (3)
├── mod-103: Exercise 05 (1)
└── Total: 4 exercises
Level 2 (Intermediate-Advanced):
├── mod-102: Exercises 01-02 (2)
├── mod-103: Exercises 04, 06 (2)
├── mod-105: Exercise 04 (1)
├── mod-106: Exercises 04, 06 (2)
└── Total: 7 exercises
Level 3 (Advanced):
├── mod-102: Exercise 03 (1)
├── mod-104: All exercises (3)
├── mod-105: Exercise 03 (1)
├── mod-106: Exercise 05 (1)
├── mod-107: All exercises (3)
├── mod-108: All exercises (2)
├── mod-109: All exercises (2)
├── mod-110: All exercises (2)
└── Total: 15 exercises
| Exercise | Cloud | K8s | Docker | ML | GPU | IaC | Monitoring | Data Eng |
|---|---|---|---|---|---|---|---|---|
| 101-04 | ○ | ○ | ● | ○ | ○ | ○ | ○ | ○ |
| 101-05 | ○ | ○ | ○ | ● | ● | ○ | ○ | ○ |
| 101-06 | ○ | ○ | ○ | ● | ○ | ○ | ○ | ○ |
| 102-01 | ● | ○ | ○ | ○ | ○ | ○ | ○ | ○ |
| 102-02 | ● | ○ | ○ | ● | ○ | ● | ○ | ○ |
| 102-03 | ● | ○ | ○ | ○ | ○ | ○ | ● | ● |
| 103-04 | ○ | ○ | ● | ○ | ○ | ○ | ○ | ○ |
| 103-05 | ○ | ○ | ● | ○ | ○ | ○ | ○ | ○ |
| 103-06 | ● | ● | ● | ○ | ○ | ○ | ○ | ○ |
| 104-04 | ● | ● | ○ | ○ | ○ | ○ | ● | ○ |
| 104-05 | ○ | ● | ○ | ○ | ○ | ○ | ● | ○ |
| 104-06 | ○ | ● | ○ | ● | ○ | ○ | ○ | ○ |
| 105-03 | ○ | ● | ● | ○ | ○ | ○ | ● | ● |
| 105-04 | ○ | ● | ○ | ● | ○ | ○ | ● | ● |
| 106-04 | ● | ● | ● | ● | ○ | ○ | ○ | ○ |
| 106-05 | ○ | ○ | ○ | ● | ○ | ○ | ● | ○ |
| 106-06 | ○ | ● | ● | ● | ○ | ○ | ○ | ○ |
| 107-04 | ● | ● | ○ | ○ | ● | ○ | ● | ○ |
| 107-05 | ○ | ○ | ○ | ● | ● | ○ | ● | ○ |
| 107-06 | ○ | ● | ○ | ● | ● | ○ | ○ | ○ |
| 108-01 | ○ | ● | ● | ○ | ○ | ○ | ● | ○ |
| 108-02 | ○ | ● | ○ | ● | ○ | ○ | ● | ○ |
| 109-01 | ● | ● | ○ | ● | ● | ● | ○ | ○ |
| 109-02 | ● | ● | ○ | ● | ○ | ● | ○ | ○ |
| 110-01 | ● | ● | ● | ● | ● | ○ | ● | ○ |
| 110-02 | ● | ● | ○ | ● | ● | ○ | ● | ● |
Legend: ● Primary skill | ○ Secondary skill
Cloud Platforms:
- AWS: 8 exercises
- GCP: 8 exercises
- Azure: 8 exercises
- Multi-cloud: 6 exercises
Container & Orchestration:
- Docker: 12 exercises
- Kubernetes: 16 exercises
- Helm: 8 exercises
- Service Mesh: 1 exercise
ML Frameworks:
- PyTorch: 12 exercises
- TensorFlow: 4 exercises
- JAX: 1 exercise
- HuggingFace: 4 exercises
MLOps Tools:
- MLflow: 2 exercises
- Airflow: 1 exercise
- DVC: 1 exercise
- Evidently: 1 exercise
Data & Streaming:
- Kafka: 1 exercise
- Spark: 1 exercise
- PostgreSQL: 4 exercises
- Redis: 2 exercises
GPU & Acceleration:
- CUDA: 3 exercises
- NVIDIA Tools: 3 exercises
- Distributed Training: 1 exercise
Monitoring:
- Prometheus: 8 exercises
- Grafana: 8 exercises
- Jaeger: 1 exercise
- ELK Stack: 1 exercise
Infrastructure as Code:
- Terraform: 2 exercises
- Pulumi: 1 exercise
LLM Infrastructure:
- vLLM: 1 exercise
- LangChain: 1 exercise
- Vector DBs: 1 exercise
Order: Sequential (mod-101 → mod-110) Duration: 200-280 hours Best for: Comprehensive skill building
Order: 101{4,5,6} → 106{4,5,6} → 108{2} → 102{2} → 107{4,5} → 109{1} Duration: 110-150 hours Best for: MLOps specialists
Order: 104{4,5,6} → 103{4,5,6} → 109{1,2} → 108{1,2} → 102{1,2} → 107{4} Duration: 130-180 hours Best for: Platform engineers
Order: 110{1,2} → 107{4,5,6} → 108{2} → 104{4,5} → 102{2} → 109{1} Duration: 140-190 hours Best for: GenAI infrastructure
Full-Time Study (40 hrs/week):
- Complete Path: 10-14 weeks
- Fast Track: 6-8 weeks
- Platform: 7-9 weeks
- LLM: 7-10 weeks
Part-Time Study (20 hrs/week):
- Complete Path: 20-28 weeks
- Fast Track: 11-15 weeks
- Platform: 13-18 weeks
- LLM: 14-19 weeks
Casual Study (10 hrs/week):
- Complete Path: 40-56 weeks
- Fast Track: 22-30 weeks
- Platform: 26-36 weeks
- LLM: 28-38 weeks
mod-101 (Foundations)
└──> mod-102 (Cloud)
└──> mod-103 (Containers)
└──> mod-104 (Kubernetes)
├──> mod-105 (Data Pipelines)
├──> mod-106 (MLOps)
├──> mod-107 (GPU)
│ └──> mod-110 (LLM)
├──> mod-108 (Monitoring)
└──> mod-109 (IaC)
Before Starting:
- ✅ Junior Engineer track completed
- ✅ Python OOP and async
- ✅ Docker fundamentals
- ✅ Basic Kubernetes
- ✅ Git proficiency
- ✅ Linux command line
- ✅ Basic ML concepts
Developed During:
- Advanced Kubernetes patterns
- Multi-cloud expertise
- GPU infrastructure
- Production MLOps
- LLM deployment
- Advanced monitoring
- Infrastructure as Code mastery
ML Infrastructure Engineer (L4-L5):
- Priority: Complete Path or Fast Track MLOps
- Key modules: 106, 107, 108, 109, 110
- Certifications: AWS ML, GCP ML Engineer
Platform Engineer:
- Priority: Platform Engineering Path
- Key modules: 103, 104, 108, 109
- Certifications: CKA, CKAD, Terraform
MLOps Engineer:
- Priority: Fast Track MLOps Path
- Key modules: 105, 106, 108
- Certifications: AWS ML, MLOps practices
SRE - ML Systems:
- Priority: Complete Path with focus on 108
- Key modules: 104, 107, 108, 109
- Certifications: CKA, Prometheus
LLM Infrastructure Specialist:
- Priority: LLM Specialist Path
- Key modules: 107, 110
- Certifications: NVIDIA DLI, cloud certifications
| Code | Exercise Name |
|---|---|
| 101-04 | Python Environment Manager |
| 101-05 | ML Framework Benchmark |
| 101-06 | FastAPI ML Template Generator |
| 102-01 | Multi-Cloud Cost Analyzer |
| 102-02 | Cloud ML Infrastructure |
| 102-03 | Disaster Recovery |
| 103-04 | Container Security |
| 103-05 | Image Optimizer |
| 103-06 | Registry Manager |
| 104-04 | K8s Cluster Autoscaler |
| 104-05 | Service Mesh Observability |
| 104-06 | K8s Operator Framework |
| 105-03 | Streaming Pipeline Kafka |
| 105-04 | Workflow Orchestration Airflow |
| 106-04 | Experiment Tracking MLflow |
| 106-05 | Model Monitoring Drift |
| 106-06 | CI/CD ML Pipelines |
| 107-04 | GPU Cluster Management |
| 107-05 | GPU Performance Optimization |
| 107-06 | Distributed GPU Training |
| 108-01 | Observability Stack |
| 108-02 | ML Model Monitoring |
| 109-01 | Terraform ML Infrastructure |
| 109-02 | Pulumi Multi-Cloud ML |
| 110-01 | Production LLM Serving |
| 110-02 | Production RAG System |
Use this index to navigate the curriculum and plan your learning journey! 📚
Last Updated: October 25, 2025 Version: 1.0