Welcome to the AI Infrastructure Architect Learning Repository! This comprehensive curriculum is designed to develop enterprise-scale AI infrastructure architecture skills. This is Level 3 in the AI Infrastructure career progression, building upon the Engineer and Senior Engineer foundations.
This curriculum transforms senior engineers into architects capable of:
- Designing end-to-end enterprise AI/ML platform architectures
- Creating technical strategy and multi-year roadmaps for AI infrastructure
- Establishing architectural patterns and standards across organizations
- Leading cross-functional architecture initiatives
- Architecting multi-cloud and hybrid AI deployment solutions
- Designing high-availability, fault-tolerant ML systems for mission-critical applications
- Creating comprehensive security and compliance architectures
- Designing cost-optimization strategies for large-scale AI workloads
- Completed: Senior AI Infrastructure Engineer level (or equivalent 5-8 years experience)
- Education: Bachelor's degree required; Master's degree in Computer Science, Data Science, or AI preferred
- Experience: 5-8 years as Senior AI/ML Infrastructure Engineer with proven track record of designing large-scale ML platforms
- Skills: Deep understanding of all senior engineer topics, enterprise architecture awareness, business acumen
- Advanced Kubernetes (operators, CRDs, multi-cluster)
- CUDA programming and GPU optimization
- Distributed training (Ray, Horovod)
- Multi-cloud architecture experience
- Advanced security and compliance knowledge
- MLOps platform design
- Infrastructure as Code mastery
- Stakeholder management experience
- Technical presentation skills
- Cross-functional collaboration
- Strategic thinking ability
ai-infra-architect-learning/
├── lessons/ # 10 comprehensive modules
│ ├── mod-301-enterprise-architecture/
│ ├── mod-302-multicloud-hybrid/
│ ├── mod-303-security-compliance/
│ ├── mod-304-cost-finops/
│ ├── mod-305-ha-dr/
│ ├── mod-306-enterprise-mlops/
│ ├── mod-307-data-architecture/
│ ├── mod-308-llm-rag/
│ ├── mod-309-arch-communication/
│ └── mod-310-emerging-tech/
├── projects/ # 5 comprehensive projects
│ ├── project-301-enterprise-mlops-platform/
│ ├── project-302-multicloud-infrastructure/
│ ├── project-303-llm-rag-platform/
│ ├── project-304-data-platform/
│ └── project-305-security-framework/
├── assessments/ # Quizzes and practical exams
│ ├── quizzes/
│ └── practical-exams/
├── resources/ # Additional learning materials
│ ├── reading-list.md
│ ├── tools.md
│ └── references.md
├── CURRICULUM.md # Detailed curriculum guide
└── CONTRIBUTING.md # Contribution guidelines
600 hours (approximately 10-12 months full-time or 20 months part-time)
| Module | Topic | Duration | Focus Area |
|---|---|---|---|
| 301 | Enterprise Architecture Fundamentals | 50 hrs | TOGAF, ADM, governance |
| 302 | Multi-Cloud and Hybrid Architecture | 60 hrs | Multi-cloud strategy, vendor management |
| 303 | Enterprise Security and Compliance | 55 hrs | Zero-trust, GDPR, HIPAA, SOC2 |
| 304 | Cost Optimization and FinOps | 45 hrs | TCO, cost allocation, optimization |
| 305 | High-Availability and Disaster Recovery | 50 hrs | 99.95%+ uptime, DR planning |
| 306 | Enterprise MLOps Platform Architecture | 55 hrs | Model governance, feature stores |
| 307 | Data Architecture and Engineering for AI | 50 hrs | Data lakehouse, governance |
| 308 | LLM Platform and RAG Architecture | 55 hrs | Enterprise LLM, RAG at scale |
| 309 | Architecture Communication and Leadership | 40 hrs | Executive comms, stakeholder mgmt |
| 310 | Emerging Technologies and Innovation | 40 hrs | Future tech, innovation frameworks |
| Project | Theme | Duration | Key Deliverables |
|---|---|---|---|
| 301 | Enterprise ML Platform Architecture | 80 hrs | Complete platform architecture, ADRs, reference architecture |
| 302 | Multi-Cloud AI Infrastructure | 100 hrs | Multi-cloud architecture, cost model, DR plan |
| 303 | LLM Platform with RAG | 90 hrs | LLM platform design, governance framework |
| 304 | Data Platform for AI | 85 hrs | Data lakehouse architecture, governance |
| 305 | Security and Compliance Framework | 70 hrs | Security architecture, compliance docs |
Total Project Time: 425 hours
As an AI Infrastructure Architect, you will:
- Define technical strategy and multi-year roadmaps
- Establish architectural patterns and standards
- Lead technology evaluation and vendor selection
- Design cost-optimization strategies
- Create governance frameworks
- Design end-to-end enterprise AI/ML platforms
- Architect multi-cloud and hybrid solutions
- Design high-availability, fault-tolerant systems
- Create security and compliance architectures
- Design scalable feature stores and data platforms
- Lead cross-functional architecture initiatives
- Collaborate with business leaders on requirements
- Mentor architects and senior engineers
- Drive architectural improvements across teams
- Lead architecture review boards
# Clone this repository
git clone https://github.com/ai-infra-curriculum/ai-infra-architect-learning.git
cd ai-infra-architect-learning
# Set up Python environment (Python 3.11+)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install architecture tools
# - TOGAF certification study materials
# - Architecture modeling tools (ArchiMate, draw.io)
# - Cloud architecture tools (AWS Well-Architected, GCP Architecture Framework)- Cloud Accounts: AWS, GCP, Azure (architect-level access)
- Architecture Tools: Draw.io or Lucidchart for diagrams
- TOGAF: Study materials for TOGAF 9 certification
- Kubernetes: EKS, GKE, or AKS clusters for labs
- Collaboration: Confluence or Notion for architecture documentation
- Architecture Frameworks: ArchiMate for enterprise architecture
- Cost Tools: CloudHealth, CloudCheckr, or native cloud cost tools
- Security: Cloud security posture management tools
- Monitoring: Enterprise observability platforms (DataDog, New Relic)
- Start with Module 301: Enterprise Architecture Fundamentals
- Complete modules sequentially: Each builds on previous knowledge
- Apply learning in projects: Start projects after completing relevant modules
- Take assessments: Verify understanding before progressing
- Build portfolio: Document architectures and decisions
- Discussions: GitHub Discussions for Q&A and collaboration
- Issues: Report problems or suggest improvements
- Slack/Discord: Join the AI Infrastructure community (link in repo)
Learn enterprise architecture frameworks, TOGAF, and governance processes.
- Objectives: Apply TOGAF ADM, create architecture documentation, lead governance
- Topics: Enterprise architecture, TOGAF, Zachman, governance, stakeholder management
- Project: Design reference architecture for AI platform
Master multi-cloud strategy and hybrid cloud architecture patterns.
- Objectives: Design multi-cloud ML architectures, optimize vendor selection
- Topics: Multi-cloud strategy, hybrid cloud, vendor selection, migration
- Project: Multi-cloud ML platform architecture
Design comprehensive security architectures for ML systems.
- Objectives: Architect zero-trust ML systems, implement compliance frameworks
- Topics: Zero-trust, GDPR, HIPAA, SOC2, data governance, privacy-preserving ML
- Project: Security architecture for regulated industry ML system
Master cost optimization and FinOps practices for AI infrastructure.
- Objectives: Design cost-optimized architectures, implement FinOps
- Topics: FinOps, TCO analysis, cost allocation, chargeback models
- Project: Cost-optimized ML platform with FinOps governance
Design HA/DR architectures for mission-critical ML systems.
- Objectives: Design 99.95%+ uptime systems, create DR plans
- Topics: HA patterns, DR planning, chaos engineering, self-healing systems
- Project: HA ML platform with comprehensive DR plan
Architect enterprise-scale MLOps platforms with governance.
- Objectives: Design MLOps platforms, create governance frameworks
- Topics: MLOps architecture, model lifecycle, feature stores, real-time serving
- Project: Enterprise MLOps platform architecture
Design data platforms supporting AI/ML workloads at scale.
- Objectives: Architect data lakehouses, implement governance
- Topics: Data architecture, lakehouse, streaming, governance, data quality
- Project: Real-time data platform for ML
Architect enterprise LLM platforms with RAG and governance.
- Objectives: Design LLM platforms, implement RAG at scale, create governance
- Topics: LLM architecture, RAG, vector databases, safety, cost optimization
- Project: Enterprise LLM platform with RAG
Master architecture communication and stakeholder management.
- Objectives: Present to executives, lead governance, build consensus
- Topics: Executive comms, visual communication, ADRs, stakeholder management
- Project: Executive architecture presentation
Evaluate emerging AI technologies and design innovation frameworks.
- Objectives: Assess emerging tech, create innovation frameworks, build roadmaps
- Topics: Emerging AI hardware, edge AI, quantum computing, responsible AI
- Project: Technology roadmap and innovation framework
Duration: 80 hours | Difficulty: High
Design a complete enterprise MLOps platform architecture including governance, security, and cost optimization for an organization with 100+ data scientists.
Key Deliverables:
- Comprehensive architecture documentation with diagrams
- Architecture Decision Records (ADRs) for key decisions
- Reference architecture artifacts
- Governance framework
- Cost model and optimization strategy
Duration: 100 hours | Difficulty: Very High
Architect a multi-cloud ML platform spanning AWS, GCP, and Azure with HA/DR, cost optimization, and compliance.
Key Deliverables:
- Multi-cloud architecture design
- Vendor selection framework
- HA/DR architecture and runbooks
- Cost model and optimization
- Migration roadmap
Duration: 90 hours | Difficulty: Very High
Design an enterprise LLM platform with RAG capabilities, governance, safety guardrails, and cost optimization.
Key Deliverables:
- LLM platform architecture
- RAG system design at scale
- Governance and safety framework
- Cost-performance optimization strategy
- Model selection framework
Duration: 85 hours | Difficulty: High
Architect a real-time data platform supporting ML workloads with data governance, quality, and lineage.
Key Deliverables:
- Data lakehouse architecture
- Real-time streaming architecture
- Data governance framework
- Data quality monitoring system
- Metadata and lineage tracking
Duration: 70 hours | Difficulty: High
Create a comprehensive security architecture for ML systems in a regulated industry (healthcare or finance).
Key Deliverables:
- Zero-trust security architecture
- Compliance framework (HIPAA or GDPR)
- Data governance and privacy architecture
- Security audit checklist
- Incident response playbooks
- Quizzes: 80% minimum passing score on all module quizzes
- Coverage: 15-20 questions per module covering key concepts
- Retakes: Unlimited retakes with randomized questions
- Completion: All 5 projects must be completed
- Quality: Architecture designs must meet acceptance criteria
- Documentation: Comprehensive architecture documentation required
- Peer Review: Optional peer review for feedback
- Architecture Documentation: ADRs, reference architectures, design docs
- Demonstrated Impact: Cost savings, efficiency improvements documented
- Thought Leadership: Optional blog posts, presentations
- Certifications: TOGAF 9 certification recommended
- TOGAF 9 Certified (Highest Priority)
- AWS Solutions Architect – Professional
- Google Cloud Professional Cloud Architect
- Microsoft Azure Solutions Architect Expert
- Certified Kubernetes Administrator (CKA)
- Certified Kubernetes Security Specialist (CKS)
- CISSP or cloud security certification
- FinOps Certified Practitioner
Upon completing this curriculum, you will be prepared for:
- AI Infrastructure Architect
- ML Platform Architect
- Cloud Architect (AI/ML)
- Principal ML Infrastructure Engineer
- Solutions Architect (AI/ML)
- Median: $210,000 USD
- Range: $165,000 - $350,000 USD (total compensation with equity)
- Top Companies: FAANG often exceeds $300,000
- Enterprise-scale AI platform architecture
- Multi-cloud and hybrid architecture design
- Security and compliance architecture
- Cost optimization and FinOps
- High-availability and disaster recovery
- Stakeholder management and communication
- Technical strategy and roadmapping
- Reading List: Recommended books, papers, and articles
- Tools: Enterprise architecture and AI infrastructure tools
- References: Documentation, standards, frameworks
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
- Fix typos or errors in content
- Add new exercises or examples
- Improve architecture diagrams
- Share your project solutions (anonymized)
- Suggest new topics or modules
- Provide feedback on curriculum
This curriculum is licensed under the MIT License. See LICENSE for details.
- Email: ai-infra-curriculum@joshua-ferguson.com
- Organization: github.com/ai-infra-curriculum
- Discussions: GitHub Discussions for Q&A
- Issues: GitHub Issues for bug reports
This curriculum was developed through research of real-world job requirements, industry best practices, and contributions from AI infrastructure practitioners at leading tech companies.
Ready to become an AI Infrastructure Architect? Start with Module 301 →