Skip to content

ai-infra-curriculum/ai-infra-architect-learning

Repository files navigation

AI Infrastructure Architect - Learning Repository

License: MIT Level: Architect Duration: 600 hours

Overview

Welcome to the AI Infrastructure Architect Learning Repository! This comprehensive curriculum is designed to develop enterprise-scale AI infrastructure architecture skills. This is Level 3 in the AI Infrastructure career progression, building upon the Engineer and Senior Engineer foundations.

What You'll Learn

This curriculum transforms senior engineers into architects capable of:

  • Designing end-to-end enterprise AI/ML platform architectures
  • Creating technical strategy and multi-year roadmaps for AI infrastructure
  • Establishing architectural patterns and standards across organizations
  • Leading cross-functional architecture initiatives
  • Architecting multi-cloud and hybrid AI deployment solutions
  • Designing high-availability, fault-tolerant ML systems for mission-critical applications
  • Creating comprehensive security and compliance architectures
  • Designing cost-optimization strategies for large-scale AI workloads

Prerequisites

Required Foundation

  • Completed: Senior AI Infrastructure Engineer level (or equivalent 5-8 years experience)
  • Education: Bachelor's degree required; Master's degree in Computer Science, Data Science, or AI preferred
  • Experience: 5-8 years as Senior AI/ML Infrastructure Engineer with proven track record of designing large-scale ML platforms
  • Skills: Deep understanding of all senior engineer topics, enterprise architecture awareness, business acumen

Technical Prerequisites

  • Advanced Kubernetes (operators, CRDs, multi-cluster)
  • CUDA programming and GPU optimization
  • Distributed training (Ray, Horovod)
  • Multi-cloud architecture experience
  • Advanced security and compliance knowledge
  • MLOps platform design
  • Infrastructure as Code mastery

Soft Skills Prerequisites

  • Stakeholder management experience
  • Technical presentation skills
  • Cross-functional collaboration
  • Strategic thinking ability

Repository Structure

ai-infra-architect-learning/
├── lessons/                 # 10 comprehensive modules
│   ├── mod-301-enterprise-architecture/
│   ├── mod-302-multicloud-hybrid/
│   ├── mod-303-security-compliance/
│   ├── mod-304-cost-finops/
│   ├── mod-305-ha-dr/
│   ├── mod-306-enterprise-mlops/
│   ├── mod-307-data-architecture/
│   ├── mod-308-llm-rag/
│   ├── mod-309-arch-communication/
│   └── mod-310-emerging-tech/
├── projects/                # 5 comprehensive projects
│   ├── project-301-enterprise-mlops-platform/
│   ├── project-302-multicloud-infrastructure/
│   ├── project-303-llm-rag-platform/
│   ├── project-304-data-platform/
│   └── project-305-security-framework/
├── assessments/             # Quizzes and practical exams
│   ├── quizzes/
│   └── practical-exams/
├── resources/               # Additional learning materials
│   ├── reading-list.md
│   ├── tools.md
│   └── references.md
├── CURRICULUM.md            # Detailed curriculum guide
└── CONTRIBUTING.md          # Contribution guidelines

Learning Path

Estimated Time to Completion

600 hours (approximately 10-12 months full-time or 20 months part-time)

Module Overview (10 modules)

Module Topic Duration Focus Area
301 Enterprise Architecture Fundamentals 50 hrs TOGAF, ADM, governance
302 Multi-Cloud and Hybrid Architecture 60 hrs Multi-cloud strategy, vendor management
303 Enterprise Security and Compliance 55 hrs Zero-trust, GDPR, HIPAA, SOC2
304 Cost Optimization and FinOps 45 hrs TCO, cost allocation, optimization
305 High-Availability and Disaster Recovery 50 hrs 99.95%+ uptime, DR planning
306 Enterprise MLOps Platform Architecture 55 hrs Model governance, feature stores
307 Data Architecture and Engineering for AI 50 hrs Data lakehouse, governance
308 LLM Platform and RAG Architecture 55 hrs Enterprise LLM, RAG at scale
309 Architecture Communication and Leadership 40 hrs Executive comms, stakeholder mgmt
310 Emerging Technologies and Innovation 40 hrs Future tech, innovation frameworks

Project Overview (5 projects)

Project Theme Duration Key Deliverables
301 Enterprise ML Platform Architecture 80 hrs Complete platform architecture, ADRs, reference architecture
302 Multi-Cloud AI Infrastructure 100 hrs Multi-cloud architecture, cost model, DR plan
303 LLM Platform with RAG 90 hrs LLM platform design, governance framework
304 Data Platform for AI 85 hrs Data lakehouse architecture, governance
305 Security and Compliance Framework 70 hrs Security architecture, compliance docs

Total Project Time: 425 hours

Architecture Responsibilities

As an AI Infrastructure Architect, you will:

Strategic

  • Define technical strategy and multi-year roadmaps
  • Establish architectural patterns and standards
  • Lead technology evaluation and vendor selection
  • Design cost-optimization strategies
  • Create governance frameworks

Technical

  • Design end-to-end enterprise AI/ML platforms
  • Architect multi-cloud and hybrid solutions
  • Design high-availability, fault-tolerant systems
  • Create security and compliance architectures
  • Design scalable feature stores and data platforms

Leadership

  • Lead cross-functional architecture initiatives
  • Collaborate with business leaders on requirements
  • Mentor architects and senior engineers
  • Drive architectural improvements across teams
  • Lead architecture review boards

Getting Started

1. Environment Setup

# Clone this repository
git clone https://github.com/ai-infra-curriculum/ai-infra-architect-learning.git
cd ai-infra-architect-learning

# Set up Python environment (Python 3.11+)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install architecture tools
# - TOGAF certification study materials
# - Architecture modeling tools (ArchiMate, draw.io)
# - Cloud architecture tools (AWS Well-Architected, GCP Architecture Framework)

2. Access Required Tools

Essential Tools

  • Cloud Accounts: AWS, GCP, Azure (architect-level access)
  • Architecture Tools: Draw.io or Lucidchart for diagrams
  • TOGAF: Study materials for TOGAF 9 certification
  • Kubernetes: EKS, GKE, or AKS clusters for labs
  • Collaboration: Confluence or Notion for architecture documentation

Recommended Tools

  • Architecture Frameworks: ArchiMate for enterprise architecture
  • Cost Tools: CloudHealth, CloudCheckr, or native cloud cost tools
  • Security: Cloud security posture management tools
  • Monitoring: Enterprise observability platforms (DataDog, New Relic)

3. Follow the Learning Path

  1. Start with Module 301: Enterprise Architecture Fundamentals
  2. Complete modules sequentially: Each builds on previous knowledge
  3. Apply learning in projects: Start projects after completing relevant modules
  4. Take assessments: Verify understanding before progressing
  5. Build portfolio: Document architectures and decisions

4. Join the Community

  • Discussions: GitHub Discussions for Q&A and collaboration
  • Issues: Report problems or suggest improvements
  • Slack/Discord: Join the AI Infrastructure community (link in repo)

Module Details

Module 301: Enterprise Architecture Fundamentals (50 hours)

Learn enterprise architecture frameworks, TOGAF, and governance processes.

  • Objectives: Apply TOGAF ADM, create architecture documentation, lead governance
  • Topics: Enterprise architecture, TOGAF, Zachman, governance, stakeholder management
  • Project: Design reference architecture for AI platform

Start Module 301 →

Module 302: Multi-Cloud and Hybrid Architecture (60 hours)

Master multi-cloud strategy and hybrid cloud architecture patterns.

  • Objectives: Design multi-cloud ML architectures, optimize vendor selection
  • Topics: Multi-cloud strategy, hybrid cloud, vendor selection, migration
  • Project: Multi-cloud ML platform architecture

Start Module 302 →

Module 303: Enterprise Security and Compliance (55 hours)

Design comprehensive security architectures for ML systems.

  • Objectives: Architect zero-trust ML systems, implement compliance frameworks
  • Topics: Zero-trust, GDPR, HIPAA, SOC2, data governance, privacy-preserving ML
  • Project: Security architecture for regulated industry ML system

Start Module 303 →

Module 304: Cost Optimization and FinOps (45 hours)

Master cost optimization and FinOps practices for AI infrastructure.

  • Objectives: Design cost-optimized architectures, implement FinOps
  • Topics: FinOps, TCO analysis, cost allocation, chargeback models
  • Project: Cost-optimized ML platform with FinOps governance

Start Module 304 →

Module 305: High-Availability and Disaster Recovery (50 hours)

Design HA/DR architectures for mission-critical ML systems.

  • Objectives: Design 99.95%+ uptime systems, create DR plans
  • Topics: HA patterns, DR planning, chaos engineering, self-healing systems
  • Project: HA ML platform with comprehensive DR plan

Start Module 305 →

Module 306: Enterprise MLOps Platform Architecture (55 hours)

Architect enterprise-scale MLOps platforms with governance.

  • Objectives: Design MLOps platforms, create governance frameworks
  • Topics: MLOps architecture, model lifecycle, feature stores, real-time serving
  • Project: Enterprise MLOps platform architecture

Start Module 306 →

Module 307: Data Architecture and Engineering for AI (50 hours)

Design data platforms supporting AI/ML workloads at scale.

  • Objectives: Architect data lakehouses, implement governance
  • Topics: Data architecture, lakehouse, streaming, governance, data quality
  • Project: Real-time data platform for ML

Start Module 307 →

Module 308: LLM Platform and RAG Architecture (55 hours)

Architect enterprise LLM platforms with RAG and governance.

  • Objectives: Design LLM platforms, implement RAG at scale, create governance
  • Topics: LLM architecture, RAG, vector databases, safety, cost optimization
  • Project: Enterprise LLM platform with RAG

Start Module 308 →

Module 309: Architecture Communication and Leadership (40 hours)

Master architecture communication and stakeholder management.

  • Objectives: Present to executives, lead governance, build consensus
  • Topics: Executive comms, visual communication, ADRs, stakeholder management
  • Project: Executive architecture presentation

Start Module 309 →

Module 310: Emerging Technologies and Innovation (40 hours)

Evaluate emerging AI technologies and design innovation frameworks.

  • Objectives: Assess emerging tech, create innovation frameworks, build roadmaps
  • Topics: Emerging AI hardware, edge AI, quantum computing, responsible AI
  • Project: Technology roadmap and innovation framework

Start Module 310 →

Projects

Project 301: Enterprise ML Platform Architecture

Duration: 80 hours | Difficulty: High

Design a complete enterprise MLOps platform architecture including governance, security, and cost optimization for an organization with 100+ data scientists.

Key Deliverables:

  • Comprehensive architecture documentation with diagrams
  • Architecture Decision Records (ADRs) for key decisions
  • Reference architecture artifacts
  • Governance framework
  • Cost model and optimization strategy

View Project Details →

Project 302: Multi-Cloud AI Infrastructure

Duration: 100 hours | Difficulty: Very High

Architect a multi-cloud ML platform spanning AWS, GCP, and Azure with HA/DR, cost optimization, and compliance.

Key Deliverables:

  • Multi-cloud architecture design
  • Vendor selection framework
  • HA/DR architecture and runbooks
  • Cost model and optimization
  • Migration roadmap

View Project Details →

Project 303: LLM Platform with RAG

Duration: 90 hours | Difficulty: Very High

Design an enterprise LLM platform with RAG capabilities, governance, safety guardrails, and cost optimization.

Key Deliverables:

  • LLM platform architecture
  • RAG system design at scale
  • Governance and safety framework
  • Cost-performance optimization strategy
  • Model selection framework

View Project Details →

Project 304: Data Platform for AI

Duration: 85 hours | Difficulty: High

Architect a real-time data platform supporting ML workloads with data governance, quality, and lineage.

Key Deliverables:

  • Data lakehouse architecture
  • Real-time streaming architecture
  • Data governance framework
  • Data quality monitoring system
  • Metadata and lineage tracking

View Project Details →

Project 305: Security and Compliance Framework

Duration: 70 hours | Difficulty: High

Create a comprehensive security architecture for ML systems in a regulated industry (healthcare or finance).

Key Deliverables:

  • Zero-trust security architecture
  • Compliance framework (HIPAA or GDPR)
  • Data governance and privacy architecture
  • Security audit checklist
  • Incident response playbooks

View Project Details →

Assessment Criteria

Knowledge Assessments

  • Quizzes: 80% minimum passing score on all module quizzes
  • Coverage: 15-20 questions per module covering key concepts
  • Retakes: Unlimited retakes with randomized questions

Practical Projects

  • Completion: All 5 projects must be completed
  • Quality: Architecture designs must meet acceptance criteria
  • Documentation: Comprehensive architecture documentation required
  • Peer Review: Optional peer review for feedback

Portfolio Development

  • Architecture Documentation: ADRs, reference architectures, design docs
  • Demonstrated Impact: Cost savings, efficiency improvements documented
  • Thought Leadership: Optional blog posts, presentations
  • Certifications: TOGAF 9 certification recommended

Recommended Certifications

Priority Certifications

  1. TOGAF 9 Certified (Highest Priority)
  2. AWS Solutions Architect – Professional
  3. Google Cloud Professional Cloud Architect
  4. Microsoft Azure Solutions Architect Expert

Supporting Certifications

  1. Certified Kubernetes Administrator (CKA)
  2. Certified Kubernetes Security Specialist (CKS)
  3. CISSP or cloud security certification
  4. FinOps Certified Practitioner

Career Outcomes

Upon completing this curriculum, you will be prepared for:

Roles

  • AI Infrastructure Architect
  • ML Platform Architect
  • Cloud Architect (AI/ML)
  • Principal ML Infrastructure Engineer
  • Solutions Architect (AI/ML)

Salary Range

  • Median: $210,000 USD
  • Range: $165,000 - $350,000 USD (total compensation with equity)
  • Top Companies: FAANG often exceeds $300,000

Skills Demonstrated

  • Enterprise-scale AI platform architecture
  • Multi-cloud and hybrid architecture design
  • Security and compliance architecture
  • Cost optimization and FinOps
  • High-availability and disaster recovery
  • Stakeholder management and communication
  • Technical strategy and roadmapping

Resources

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Ways to Contribute

  • Fix typos or errors in content
  • Add new exercises or examples
  • Improve architecture diagrams
  • Share your project solutions (anonymized)
  • Suggest new topics or modules
  • Provide feedback on curriculum

License

This curriculum is licensed under the MIT License. See LICENSE for details.

Contact

Acknowledgments

This curriculum was developed through research of real-world job requirements, industry best practices, and contributions from AI infrastructure practitioners at leading tech companies.


Ready to become an AI Infrastructure Architect? Start with Module 301 →

About

AI Infrastructure Architect Learning Track - System design and architecture patterns for ML

Topics

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages