Complete, production-ready solutions for the AI Infrastructure Junior Engineer Learning Path. This repository contains fully implemented code, comprehensive documentation, and step-by-step guides for all exercises.
This repository provides reference solutions for all exercises in the ai-infra-junior-engineer-learning curriculum. Each solution includes:
- ✅ Complete, working code ready to run
- ✅ Step-by-step implementation guides
- ✅ Architecture documentation with diagrams
- ✅ Comprehensive test suites
- ✅ Docker & Kubernetes configurations
- ✅ Deployment scripts and automation
- ✅ Troubleshooting guides
- ✅ Production best practices
🎓 Capstone Project Solutions Added!
- 🚀 Project 01: Simple Model API - Flask + Docker + PyTorch serving
- ☸️ Project 02: Kubernetes Model Serving - K8s + HPA + Ingress
- 🔄 Project 03: ML Pipeline with Tracking - Airflow + MLflow + DVC
- 📊 Project 04: Monitoring & Alerting - Prometheus + Grafana + ELK
- 🏗️ Project 05: Production ML System - Complete CI/CD + Security + HA
Recently Added Exercise Solutions:
- 🤖 LLM Basics Exercise (Module 004) - Complete solution for running your first language model with Hugging Face Transformers
- ⚡ GPU Fundamentals Exercise (Module 004) - Full implementation of GPU-accelerated ML inference with PyTorch
- 🏗️ Terraform/IaC Exercise (Module 010) - Production-ready Infrastructure as Code with hands-on AWS deployment
- 🔄 Airflow Workflow Exercise (Module 009) - Complete ML pipeline orchestration with monitoring and alerting
New Documentation:
- 📋 Technology Versions Guide - Version specifications for all tools
- 🗺️ Curriculum Cross-Reference - Mapping to Engineer track
- 📈 Career Progression Guide - Complete career ladder
Important: These solutions are meant to be used AFTER attempting the exercises yourself. The learning path is designed to:
- Try First: Attempt each exercise independently using the learning repository
- Compare: Review the solution to see different approaches
- Understand: Read the step-by-step guide to understand design decisions
- Improve: Identify gaps and refine your own implementation
Don't just copy code - understand the WHY behind each decision.
ai-infra-junior-engineer-solutions/
├── README.md (this file)
├── LEARNING_GUIDE.md (how to use this repository effectively)
├── modules/
│ ├── mod-004-ml-basics/ ✨ NEW
│ │ ├── exercise-04-llm-basics/
│ │ └── exercise-05-gpu-fundamentals/
│ ├── mod-005-docker/
│ │ ├── exercise-01-docker-basics/
│ │ │ ├── README.md
│ │ │ ├── STEP_BY_STEP.md
│ │ │ ├── src/ (complete code)
│ │ │ ├── tests/
│ │ │ ├── docker/
│ │ │ └── scripts/
│ │ ├── exercise-02-multi-stage-builds/
│ │ ├── exercise-03-docker-compose/
│ │ └── ...
│ ├── mod-006-kubernetes/
│ ├── mod-007-apis/
│ ├── mod-008-databases/
│ ├── mod-009-monitoring/
│ │ └── exercise-06-airflow-workflow-monitoring/ ✨ NEW
│ └── mod-010-cloud-platforms/
│ └── exercise-07-terraform-basics/ ✨ NEW
├── projects/ 🎓 NEW
│ ├── project-01-simple-model-api/
│ ├── project-02-kubernetes-serving/
│ ├── project-03-ml-pipeline-tracking/
│ ├── project-04-monitoring-alerting/
│ └── project-05-production-ml-capstone/
├── .github/
│ └── workflows/
│ ├── ci-cd.yml
│ └── docker-build.yml
├── guides/
│ ├── debugging-guide.md
│ ├── optimization-guide.md
│ ├── production-readiness-checklist.md
│ └── common-pitfalls.md
└── resources/
├── additional-reading.md
├── useful-tools.md
└── community-resources.md
Ensure you have the following installed:
- Docker (20.10+):
docker --version - Docker Compose (2.0+):
docker compose version - Kubernetes (kubectl 1.25+):
kubectl version --client - Python (3.11+):
python --version - Node.js (18+):
node --version - AWS CLI (2.x):
aws --version - Terraform (1.5+):
terraform --version
git clone https://github.com/ai-infra-curriculum/ai-infra-junior-engineer-solutions.git
cd ai-infra-junior-engineer-solutionsEach exercise has a scripts/ directory with automated setup:
# Navigate to an exercise
cd modules/mod-005-docker/exercise-01-docker-basics
# Run setup script
./scripts/setup.sh
# Run the application
./scripts/run.sh
# Run tests
./scripts/test.sh| Exercise | Description | Complexity | Concepts |
|---|---|---|---|
| 01 | LLM Basics | ⭐⭐ Medium | Hugging Face Transformers, model loading, inference |
| 02 | GPU Fundamentals | ⭐⭐⭐ Hard | CUDA, PyTorch GPU acceleration, performance optimization |
Total Lines of Code: ~2,800 Estimated Completion: 8-12 hours
| Exercise | Description | Complexity | Concepts |
|---|---|---|---|
| 01 | Docker Basics | ⭐ Easy | Dockerfile, images, containers |
| 02 | Multi-Stage Builds | ⭐⭐ Medium | Build optimization, layer caching |
| 03 | Docker Compose | ⭐⭐ Medium | Multi-container apps, networking |
| 04 | ML Model Serving | ⭐⭐⭐ Hard | Flask API, model loading, optimization |
| 05 | Container Optimization | ⭐⭐⭐ Hard | Image size reduction, security scanning |
| 06 | Docker Networking | ⭐⭐ Medium | Bridge, overlay, host networking |
| 07 | Persistent Data | ⭐⭐ Medium | Volumes, bind mounts, data management |
Total Lines of Code: ~3,500 Estimated Completion: 12-16 hours
| Exercise | Description | Complexity | Concepts |
|---|---|---|---|
| 01 | Kubernetes Basics | ⭐⭐ Medium | Pods, deployments, services |
| 02 | ConfigMaps & Secrets | ⭐⭐ Medium | Configuration management |
| 03 | Persistent Volumes | ⭐⭐ Medium | StatefulSets, PVCs, storage classes |
| 04 | Ingress & Load Balancing | ⭐⭐⭐ Hard | NGINX Ingress, path-based routing |
| 05 | Autoscaling | ⭐⭐⭐ Hard | HPA, VPA, cluster autoscaler |
| 06 | Helm Charts | ⭐⭐⭐ Hard | Package management, templating |
| 07 | Production ML Deployment | ⭐⭐⭐⭐ Expert | Complete stack with monitoring |
Total Lines of Code: ~4,200 Estimated Completion: 14-18 hours
| Exercise | Description | Complexity | Concepts |
|---|---|---|---|
| 01 | REST API with Flask | ⭐⭐ Medium | REST principles, Flask routing |
| 02 | FastAPI ML Service | ⭐⭐ Medium | Async API, Pydantic validation |
| 03 | gRPC Service | ⭐⭐⭐ Hard | Protocol buffers, streaming |
| 04 | GraphQL API | ⭐⭐⭐ Hard | Schema design, resolvers |
| 05 | Production API | ⭐⭐⭐⭐ Expert | Auth, rate limiting, caching, docs |
Total Lines of Code: ~5,800 Estimated Completion: 16-20 hours
| Exercise | Description | Complexity | Concepts |
|---|---|---|---|
| 01 | SQL Fundamentals | ⭐ Easy | CRUD, joins, aggregations |
| 02 | PostgreSQL for ML | ⭐⭐ Medium | Schema design, indexes, performance |
| 03 | Database Migrations | ⭐⭐ Medium | Alembic, version control |
| 04 | NoSQL with MongoDB | ⭐⭐ Medium | Document storage, aggregation pipeline |
| 05 | Production Database | ⭐⭐⭐⭐ Expert | HA, replication, backup, monitoring |
Total Lines of Code: ~4,500 Estimated Completion: 14-18 hours
| Exercise | Description | Complexity | Concepts |
|---|---|---|---|
| 01 | Observability Foundations | ⭐⭐ Medium | Metrics, logs, traces |
| 02 | Prometheus Stack | ⭐⭐⭐ Hard | Prometheus, exporters, PromQL |
| 03 | Grafana Dashboards | ⭐⭐ Medium | Visualization, alerts |
| 04 | Logging with Loki | ⭐⭐⭐ Hard | Log aggregation, querying |
| 05 | Alerting & Incidents | ⭐⭐⭐⭐ Expert | Alertmanager, runbooks, postmortems |
| 06 | Airflow Workflow ✨ NEW | ⭐⭐⭐ Hard | Pipeline orchestration, DAGs, monitoring |
Total Lines of Code: ~14,500 Estimated Completion: 22-26 hours
| Exercise | Description | Complexity | Concepts |
|---|---|---|---|
| 01 | AWS Account & IAM | ⭐ Easy | IAM, MFA, tagging, budgets |
| 02 | Compute & Storage | ⭐⭐ Medium | EC2, S3, EBS, Spot instances |
| 03 | Networking & Security | ⭐⭐⭐ Hard | VPC, Security Groups, Terraform |
| 04 | Containerized Deployment | ⭐⭐⭐⭐ Expert | ECS, EKS, ECR, auto-scaling |
| 05 | SageMaker & Optimization | ⭐⭐⭐⭐ Expert | ML platform, cost optimization |
| 06 | Terraform IaC ✨ NEW | ⭐⭐⭐ Hard | Infrastructure as Code, modules, state management |
Total Lines of Code: ~9,400 Estimated Completion: 26-30 hours
| Project | Description | Complexity | Technologies |
|---|---|---|---|
| 01 | Simple Model API | ⭐⭐⭐ Hard | Flask, PyTorch, Docker, ResNet-50 |
| 02 | Kubernetes Model Serving | ⭐⭐⭐⭐ Expert | Kubernetes, HPA, Ingress, NGINX |
| 03 | ML Pipeline with Tracking | ⭐⭐⭐⭐ Expert | Airflow, MLflow, DVC, Great Expectations |
| 04 | Monitoring & Alerting | ⭐⭐⭐⭐ Expert | Prometheus, Grafana, ELK, Alertmanager |
| 05 | Production ML System | ⭐⭐⭐⭐⭐ Master | CI/CD, Security, HA, Canary, SLOs |
Total Documentation: ~3,500 lines across comprehensive SOLUTION_GUIDE.md files Estimated Completion: 40-60 hours total Portfolio Ready: Yes - production-grade implementations
Each capstone project includes:
- Complete source code with tests
- Production configurations (Docker, K8s, CI/CD)
- Comprehensive SOLUTION_GUIDE.md (500-900 lines)
- Architecture diagrams and design decisions
- Deployment automation and troubleshooting
- Python 3.11+ (ML, APIs, automation)
- SQL (PostgreSQL, MySQL)
- YAML (Kubernetes, Docker Compose)
- HCL (Terraform)
- Bash (scripting, automation)
- Flask / FastAPI (APIs)
- PyTorch / TensorFlow (ML models)
- SQLAlchemy (ORM)
- Pytest (testing)
- Docker & Docker Compose
- Kubernetes (Minikube, Kind, EKS)
- Helm (package management)
- Terraform (IaC)
- AWS (EC2, S3, ECS, EKS, SageMaker)
- Prometheus & Grafana (monitoring)
- Loki (logging)
- Attempt First: Try the exercise from the learning repository
- Get Stuck?: Review the STEP_BY_STEP.md for guidance
- Compare Solutions: See how your approach differs
- Run Tests: Ensure your solution passes all test cases
- Deploy: Use the deployment scripts to test in real environments
- Reference Implementation: Use as canonical examples
- Grading: Compare student submissions against solutions
- Discussion: Point out design decisions and trade-offs
- Extensions: Challenge students to improve upon solutions
- Skill Assessment: Use exercises as take-home assignments
- Code Review: Evaluate candidate solutions against references
- Interview Prep: Discuss architecture decisions in interviews
All solutions include comprehensive test suites:
# Run all tests for a module
cd modules/mod-007-apis
./scripts/test-all.sh
# Run tests for specific exercise
cd exercise-02-fastapi-ml-service
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=src --cov-report=htmlTest Coverage Goals: 80%+ for all production code
Each exercise includes deployment configurations:
# Docker Compose
docker compose up -d
# Kubernetes (Minikube)
kubectl apply -f k8s/# Terraform
cd terraform/
terraform init
terraform apply
# ECS/EKS
./scripts/deploy-aws.shDocker build fails:
# Clear cache and rebuild
docker builder prune
docker build --no-cache -t myapp .Kubernetes pods not starting:
# Check events
kubectl describe pod <pod-name>
kubectl logs <pod-name>
# Check resources
kubectl top nodes
kubectl top podsAWS credentials issues:
# Verify credentials
aws sts get-caller-identity
# Reconfigure
aws configureSee guides/debugging-guide.md for comprehensive troubleshooting.
We welcome contributions! See CONTRIBUTING.md for guidelines.
- Fork the repository
- Create a feature branch (
git checkout -b feature/improve-exercise-05) - Make your changes
- Add tests
- Submit a pull request
- Improve documentation
- Add more test cases
- Optimize Docker images
- Add new deployment targets (GCP, Azure)
- Create video walkthroughs
MIT License - see LICENSE file for details.
- Learning Repository: ai-infra-junior-engineer-learning
- Contributors: See CONTRIBUTORS.md
- Inspired By: Industry best practices from Google, Netflix, Uber ML teams
- Issues: GitHub Issues
- Email: ai-infra-curriculum@joshua-ferguson.com
- Organization: AI Infrastructure Curriculum
After completing all exercises with these solutions, you should be able to:
✅ Build and optimize Docker containers for ML workloads ✅ Deploy scalable applications on Kubernetes ✅ Design and implement production-grade REST/gRPC APIs ✅ Manage databases with proper schema design and migrations ✅ Implement comprehensive monitoring and alerting systems ✅ Deploy ML infrastructure on AWS with cost optimization ✅ Write Infrastructure as Code with Terraform ✅ Debug production issues effectively ✅ Pass Junior AI Infrastructure Engineer interviews
Happy Learning! 🚀
Last Updated: October 23, 2025 Version: 1.0.0