AI Infrastructure Architect - Learning Repository

Overview

Welcome to the AI Infrastructure Architect Learning Repository! This comprehensive curriculum is designed to develop enterprise-scale AI infrastructure architecture skills. This is Level 3 in the AI Infrastructure career progression, building upon the Engineer and Senior Engineer foundations.

What You'll Learn

This curriculum transforms senior engineers into architects capable of:

Designing end-to-end enterprise AI/ML platform architectures
Creating technical strategy and multi-year roadmaps for AI infrastructure
Establishing architectural patterns and standards across organizations
Leading cross-functional architecture initiatives
Architecting multi-cloud and hybrid AI deployment solutions
Designing high-availability, fault-tolerant ML systems for mission-critical applications
Creating comprehensive security and compliance architectures
Designing cost-optimization strategies for large-scale AI workloads

Prerequisites

Required Foundation

Completed: Senior AI Infrastructure Engineer level (or equivalent 5-8 years experience)
Education: Bachelor's degree required; Master's degree in Computer Science, Data Science, or AI preferred
Experience: 5-8 years as Senior AI/ML Infrastructure Engineer with proven track record of designing large-scale ML platforms
Skills: Deep understanding of all senior engineer topics, enterprise architecture awareness, business acumen

Technical Prerequisites

Advanced Kubernetes (operators, CRDs, multi-cluster)
CUDA programming and GPU optimization
Distributed training (Ray, Horovod)
Multi-cloud architecture experience
Advanced security and compliance knowledge
MLOps platform design
Infrastructure as Code mastery

Soft Skills Prerequisites

Stakeholder management experience
Technical presentation skills
Cross-functional collaboration
Strategic thinking ability

Repository Structure

ai-infra-architect-learning/
├── lessons/                 # 10 comprehensive modules
│   ├── mod-301-enterprise-architecture/
│   ├── mod-302-multicloud-hybrid/
│   ├── mod-303-security-compliance/
│   ├── mod-304-cost-finops/
│   ├── mod-305-ha-dr/
│   ├── mod-306-enterprise-mlops/
│   ├── mod-307-data-architecture/
│   ├── mod-308-llm-rag/
│   ├── mod-309-arch-communication/
│   └── mod-310-emerging-tech/
├── projects/                # 5 comprehensive projects
│   ├── project-301-enterprise-mlops-platform/
│   ├── project-302-multicloud-infrastructure/
│   ├── project-303-llm-rag-platform/
│   ├── project-304-data-platform/
│   └── project-305-security-framework/
├── assessments/             # Quizzes and practical exams
│   ├── quizzes/
│   └── practical-exams/
├── resources/               # Additional learning materials
│   ├── reading-list.md
│   ├── tools.md
│   └── references.md
├── CURRICULUM.md            # Detailed curriculum guide
└── CONTRIBUTING.md          # Contribution guidelines

Learning Path

Estimated Time to Completion

600 hours (approximately 10-12 months full-time or 20 months part-time)

Module Overview (10 modules)

Module	Topic	Duration	Focus Area
301	Enterprise Architecture Fundamentals	50 hrs	TOGAF, ADM, governance
302	Multi-Cloud and Hybrid Architecture	60 hrs	Multi-cloud strategy, vendor management
303	Enterprise Security and Compliance	55 hrs	Zero-trust, GDPR, HIPAA, SOC2
304	Cost Optimization and FinOps	45 hrs	TCO, cost allocation, optimization
305	High-Availability and Disaster Recovery	50 hrs	99.95%+ uptime, DR planning
306	Enterprise MLOps Platform Architecture	55 hrs	Model governance, feature stores
307	Data Architecture and Engineering for AI	50 hrs	Data lakehouse, governance
308	LLM Platform and RAG Architecture	55 hrs	Enterprise LLM, RAG at scale
309	Architecture Communication and Leadership	40 hrs	Executive comms, stakeholder mgmt
310	Emerging Technologies and Innovation	40 hrs	Future tech, innovation frameworks

Project Overview (5 projects)

Project	Theme	Duration	Key Deliverables
301	Enterprise ML Platform Architecture	80 hrs	Complete platform architecture, ADRs, reference architecture
302	Multi-Cloud AI Infrastructure	100 hrs	Multi-cloud architecture, cost model, DR plan
303	LLM Platform with RAG	90 hrs	LLM platform design, governance framework
304	Data Platform for AI	85 hrs	Data lakehouse architecture, governance
305	Security and Compliance Framework	70 hrs	Security architecture, compliance docs

Total Project Time: 425 hours

Architecture Responsibilities

As an AI Infrastructure Architect, you will:

Strategic

Define technical strategy and multi-year roadmaps
Establish architectural patterns and standards
Lead technology evaluation and vendor selection
Design cost-optimization strategies
Create governance frameworks

Technical

Design end-to-end enterprise AI/ML platforms
Architect multi-cloud and hybrid solutions
Design high-availability, fault-tolerant systems
Create security and compliance architectures
Design scalable feature stores and data platforms

Leadership

Lead cross-functional architecture initiatives
Collaborate with business leaders on requirements
Mentor architects and senior engineers
Drive architectural improvements across teams
Lead architecture review boards

Getting Started

1. Environment Setup

# Clone this repository
git clone https://github.com/ai-infra-curriculum/ai-infra-architect-learning.git
cd ai-infra-architect-learning

# Set up Python environment (Python 3.11+)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install architecture tools
# - TOGAF certification study materials
# - Architecture modeling tools (ArchiMate, draw.io)
# - Cloud architecture tools (AWS Well-Architected, GCP Architecture Framework)

2. Access Required Tools

Essential Tools

Cloud Accounts: AWS, GCP, Azure (architect-level access)
Architecture Tools: Draw.io or Lucidchart for diagrams
TOGAF: Study materials for TOGAF 9 certification
Kubernetes: EKS, GKE, or AKS clusters for labs
Collaboration: Confluence or Notion for architecture documentation

Recommended Tools

Architecture Frameworks: ArchiMate for enterprise architecture
Cost Tools: CloudHealth, CloudCheckr, or native cloud cost tools
Security: Cloud security posture management tools
Monitoring: Enterprise observability platforms (DataDog, New Relic)

3. Follow the Learning Path

Start with Module 301: Enterprise Architecture Fundamentals
Complete modules sequentially: Each builds on previous knowledge
Apply learning in projects: Start projects after completing relevant modules
Take assessments: Verify understanding before progressing
Build portfolio: Document architectures and decisions

4. Join the Community

Discussions: GitHub Discussions for Q&A and collaboration
Issues: Report problems or suggest improvements
Slack/Discord: Join the AI Infrastructure community (link in repo)

Module Details

Module 301: Enterprise Architecture Fundamentals (50 hours)

Learn enterprise architecture frameworks, TOGAF, and governance processes.

Objectives: Apply TOGAF ADM, create architecture documentation, lead governance
Topics: Enterprise architecture, TOGAF, Zachman, governance, stakeholder management
Project: Design reference architecture for AI platform

Start Module 301 →

Module 302: Multi-Cloud and Hybrid Architecture (60 hours)

Master multi-cloud strategy and hybrid cloud architecture patterns.

Objectives: Design multi-cloud ML architectures, optimize vendor selection
Topics: Multi-cloud strategy, hybrid cloud, vendor selection, migration
Project: Multi-cloud ML platform architecture

Start Module 302 →

Module 303: Enterprise Security and Compliance (55 hours)

Design comprehensive security architectures for ML systems.

Objectives: Architect zero-trust ML systems, implement compliance frameworks
Topics: Zero-trust, GDPR, HIPAA, SOC2, data governance, privacy-preserving ML
Project: Security architecture for regulated industry ML system

Start Module 303 →

Module 304: Cost Optimization and FinOps (45 hours)

Master cost optimization and FinOps practices for AI infrastructure.

Objectives: Design cost-optimized architectures, implement FinOps
Topics: FinOps, TCO analysis, cost allocation, chargeback models
Project: Cost-optimized ML platform with FinOps governance

Start Module 304 →

Module 305: High-Availability and Disaster Recovery (50 hours)

Design HA/DR architectures for mission-critical ML systems.

Objectives: Design 99.95%+ uptime systems, create DR plans
Topics: HA patterns, DR planning, chaos engineering, self-healing systems
Project: HA ML platform with comprehensive DR plan

Start Module 305 →

Module 306: Enterprise MLOps Platform Architecture (55 hours)

Architect enterprise-scale MLOps platforms with governance.

Objectives: Design MLOps platforms, create governance frameworks
Topics: MLOps architecture, model lifecycle, feature stores, real-time serving
Project: Enterprise MLOps platform architecture

Start Module 306 →

Module 307: Data Architecture and Engineering for AI (50 hours)

Design data platforms supporting AI/ML workloads at scale.

Objectives: Architect data lakehouses, implement governance
Topics: Data architecture, lakehouse, streaming, governance, data quality
Project: Real-time data platform for ML

Start Module 307 →

Module 308: LLM Platform and RAG Architecture (55 hours)

Architect enterprise LLM platforms with RAG and governance.

Objectives: Design LLM platforms, implement RAG at scale, create governance
Topics: LLM architecture, RAG, vector databases, safety, cost optimization
Project: Enterprise LLM platform with RAG

Start Module 308 →

Module 309: Architecture Communication and Leadership (40 hours)

Master architecture communication and stakeholder management.

Objectives: Present to executives, lead governance, build consensus
Topics: Executive comms, visual communication, ADRs, stakeholder management
Project: Executive architecture presentation

Start Module 309 →

Module 310: Emerging Technologies and Innovation (40 hours)

Evaluate emerging AI technologies and design innovation frameworks.

Objectives: Assess emerging tech, create innovation frameworks, build roadmaps
Topics: Emerging AI hardware, edge AI, quantum computing, responsible AI
Project: Technology roadmap and innovation framework

Start Module 310 →

Projects

Project 301: Enterprise ML Platform Architecture

Duration: 80 hours | Difficulty: High

Design a complete enterprise MLOps platform architecture including governance, security, and cost optimization for an organization with 100+ data scientists.

Key Deliverables:

Comprehensive architecture documentation with diagrams
Architecture Decision Records (ADRs) for key decisions
Reference architecture artifacts
Governance framework
Cost model and optimization strategy

View Project Details →

Project 302: Multi-Cloud AI Infrastructure

Duration: 100 hours | Difficulty: Very High

Architect a multi-cloud ML platform spanning AWS, GCP, and Azure with HA/DR, cost optimization, and compliance.

Key Deliverables:

Multi-cloud architecture design
Vendor selection framework
HA/DR architecture and runbooks
Cost model and optimization
Migration roadmap

View Project Details →

Project 303: LLM Platform with RAG

Duration: 90 hours | Difficulty: Very High

Design an enterprise LLM platform with RAG capabilities, governance, safety guardrails, and cost optimization.

Key Deliverables:

LLM platform architecture
RAG system design at scale
Governance and safety framework
Cost-performance optimization strategy
Model selection framework

View Project Details →

Project 304: Data Platform for AI

Duration: 85 hours | Difficulty: High

Architect a real-time data platform supporting ML workloads with data governance, quality, and lineage.

Key Deliverables:

Data lakehouse architecture
Real-time streaming architecture
Data governance framework
Data quality monitoring system
Metadata and lineage tracking

View Project Details →

Project 305: Security and Compliance Framework

Duration: 70 hours | Difficulty: High

Create a comprehensive security architecture for ML systems in a regulated industry (healthcare or finance).

Key Deliverables:

Zero-trust security architecture
Compliance framework (HIPAA or GDPR)
Data governance and privacy architecture
Security audit checklist
Incident response playbooks

View Project Details →

Assessment Criteria

Knowledge Assessments

Quizzes: 80% minimum passing score on all module quizzes
Coverage: 15-20 questions per module covering key concepts
Retakes: Unlimited retakes with randomized questions

Practical Projects

Completion: All 5 projects must be completed
Quality: Architecture designs must meet acceptance criteria
Documentation: Comprehensive architecture documentation required
Peer Review: Optional peer review for feedback

Portfolio Development

Architecture Documentation: ADRs, reference architectures, design docs
Demonstrated Impact: Cost savings, efficiency improvements documented
Thought Leadership: Optional blog posts, presentations
Certifications: TOGAF 9 certification recommended

Recommended Certifications

Priority Certifications

TOGAF 9 Certified (Highest Priority)
AWS Solutions Architect – Professional
Google Cloud Professional Cloud Architect
Microsoft Azure Solutions Architect Expert

Supporting Certifications

Certified Kubernetes Administrator (CKA)
Certified Kubernetes Security Specialist (CKS)
CISSP or cloud security certification
FinOps Certified Practitioner

Career Outcomes

Upon completing this curriculum, you will be prepared for:

Roles

AI Infrastructure Architect
ML Platform Architect
Cloud Architect (AI/ML)
Principal ML Infrastructure Engineer
Solutions Architect (AI/ML)

Salary Range

Median: $210,000 USD
Range: $165,000 - $350,000 USD (total compensation with equity)
Top Companies: FAANG often exceeds $300,000

Skills Demonstrated

Enterprise-scale AI platform architecture
Multi-cloud and hybrid architecture design
Security and compliance architecture
Cost optimization and FinOps
High-availability and disaster recovery
Stakeholder management and communication
Technical strategy and roadmapping

Resources

Reading List: Recommended books, papers, and articles
Tools: Enterprise architecture and AI infrastructure tools
References: Documentation, standards, frameworks

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Ways to Contribute

Fix typos or errors in content
Add new exercises or examples
Improve architecture diagrams
Share your project solutions (anonymized)
Suggest new topics or modules
Provide feedback on curriculum

License

This curriculum is licensed under the MIT License. See LICENSE for details.

Contact

Email: ai-infra-curriculum@joshua-ferguson.com
Organization: github.com/ai-infra-curriculum
Discussions: GitHub Discussions for Q&A
Issues: GitHub Issues for bug reports

Acknowledgments

This curriculum was developed through research of real-world job requirements, industry best practices, and contributions from AI infrastructure practitioners at leading tech companies.

Ready to become an AI Infrastructure Architect? Start with Module 301 →

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
assessments/practical-exams		assessments/practical-exams
lessons		lessons
projects		projects
resources		resources
CONTRIBUTING.md		CONTRIBUTING.md
CURRICULUM.md		CURRICULUM.md
README.md		README.md
REPOSITORY-REPORT.md		REPOSITORY-REPORT.md
generate_structure.py		generate_structure.py

Folders and files

Latest commit

History

Repository files navigation

AI Infrastructure Architect - Learning Repository

Overview

What You'll Learn

Prerequisites

Required Foundation

Technical Prerequisites

Soft Skills Prerequisites

Repository Structure

Learning Path

Estimated Time to Completion

Module Overview (10 modules)

Project Overview (5 projects)

Architecture Responsibilities

Strategic

Technical

Leadership

Getting Started

1. Environment Setup

2. Access Required Tools

Essential Tools

Recommended Tools

3. Follow the Learning Path

4. Join the Community

Module Details

Module 301: Enterprise Architecture Fundamentals (50 hours)

Module 302: Multi-Cloud and Hybrid Architecture (60 hours)

Module 303: Enterprise Security and Compliance (55 hours)

Module 304: Cost Optimization and FinOps (45 hours)

Module 305: High-Availability and Disaster Recovery (50 hours)

Module 306: Enterprise MLOps Platform Architecture (55 hours)

Module 307: Data Architecture and Engineering for AI (50 hours)

Module 308: LLM Platform and RAG Architecture (55 hours)

Module 309: Architecture Communication and Leadership (40 hours)

Module 310: Emerging Technologies and Innovation (40 hours)

Projects

Project 301: Enterprise ML Platform Architecture

Project 302: Multi-Cloud AI Infrastructure

Project 303: LLM Platform with RAG

Project 304: Data Platform for AI

Project 305: Security and Compliance Framework

Assessment Criteria

Knowledge Assessments

Practical Projects

Portfolio Development

Recommended Certifications

Priority Certifications

Supporting Certifications

Career Outcomes

Roles

Salary Range

Skills Demonstrated

Resources

Contributing

Ways to Contribute

License

Contact

Acknowledgments

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages