Skip to content

Latest commit

 

History

History
589 lines (483 loc) · 19.5 KB

File metadata and controls

589 lines (483 loc) · 19.5 KB

AI Infrastructure Architect Curriculum Guide

Overview

This comprehensive curriculum develops enterprise-scale AI infrastructure architecture capabilities. Designed for senior engineers transitioning to architect roles, it covers enterprise architecture, multi-cloud design, security, cost optimization, and strategic technology leadership.

Curriculum Philosophy

Learning Approach

  • Theory + Practice: Blend enterprise architecture frameworks with hands-on design
  • Progressive Complexity: Build from fundamentals to enterprise-scale systems
  • Real-World Focus: All content based on actual industry requirements
  • Portfolio Building: Each project creates artifacts for professional portfolio

Assessment Approach

  • Continuous Evaluation: Quizzes after each module
  • Project-Based Assessment: Comprehensive architecture designs
  • Peer Review: Optional peer feedback on architecture decisions
  • Portfolio Development: Documentation and architecture artifacts

Learning Objectives

Upon completing this curriculum, you will be able to:

Strategic

  1. Define technical strategy and multi-year roadmaps for AI infrastructure
  2. Establish architectural patterns and standards across organizations
  3. Lead technology evaluation and vendor selection processes
  4. Design cost-optimization strategies for large-scale AI workloads
  5. Create governance frameworks for ML systems

Technical

  1. Design end-to-end enterprise AI/ML platform architectures
  2. Architect multi-cloud and hybrid AI deployment solutions
  3. Design high-availability, fault-tolerant ML systems (99.95%+ uptime)
  4. Create comprehensive security and compliance architectures
  5. Architect scalable feature stores and real-time data platforms
  6. Design enterprise LLM platforms with RAG capabilities
  7. Implement model lifecycle management and governance
  8. Design disaster recovery and business continuity strategies

Leadership

  1. Lead cross-functional architecture initiatives
  2. Collaborate effectively with business leaders on requirements
  3. Mentor architects and senior engineers across organizations
  4. Drive architectural improvements across teams
  5. Lead architecture review boards and governance processes
  6. Communicate architecture effectively to all stakeholders

Module Breakdown

Phase 1: Enterprise Architecture Foundations (Modules 301-302)

Module 301: Enterprise Architecture Fundamentals (50 hours)

Focus: TOGAF, ADM, governance, stakeholder management

Learning Outcomes:

  • Apply enterprise architecture frameworks (TOGAF ADM)
  • Design enterprise-scale system architectures
  • Create comprehensive architecture documentation
  • Lead architecture governance processes

Topics:

  • Introduction to enterprise architecture
  • TOGAF framework and Architecture Development Method (ADM)
  • Zachman Framework overview
  • Business architecture and value streams
  • Architecture viewpoints and perspectives
  • Architecture patterns for enterprise systems
  • Reference architectures and blueprints
  • Architecture governance and review boards
  • Architecture Decision Records (ADRs)
  • Stakeholder management for architects

Key Activities:

  • Apply TOGAF ADM to ML platform design
  • Create comprehensive architecture documentation
  • Design reference architecture for AI platform
  • Lead architecture review session
  • Develop architecture decision framework
  • Create stakeholder communication materials

Assessment:

  • Quiz: 15 questions on EA concepts and TOGAF
  • Practical: Design enterprise AI platform architecture
  • Deliverable: Architecture documentation with diagrams

Module 302: Multi-Cloud and Hybrid Architecture Design (60 hours)

Focus: Multi-cloud strategy, hybrid cloud, vendor management

Learning Outcomes:

  • Design comprehensive multi-cloud ML architectures
  • Architect hybrid cloud solutions
  • Optimize cloud vendor selection and strategy
  • Design cloud migration and integration patterns

Topics:

  • Multi-cloud strategy: when, why, and how
  • Cloud vendor comparison: AWS vs GCP vs Azure
  • Hybrid cloud architecture patterns
  • Cloud-agnostic design principles
  • Multi-cloud networking and connectivity
  • Data residency and sovereignty considerations
  • Multi-cloud cost optimization strategies
  • Vendor lock-in mitigation techniques
  • Cloud migration strategies and planning
  • Multi-cloud management and governance

Key Activities:

  • Design multi-cloud ML platform architecture
  • Create cloud vendor selection framework
  • Design hybrid cloud integration architecture
  • Develop multi-cloud cost model
  • Create cloud migration roadmap
  • Build multi-cloud governance framework

Assessment:

  • Quiz: 15 questions on multi-cloud concepts
  • Architecture design: Multi-cloud ML platform
  • Case study: Vendor selection analysis

Phase 2: Security, Cost, and Reliability (Modules 303-305)

Module 303: Enterprise Security and Compliance Architecture (55 hours)

Focus: Zero-trust, compliance frameworks, data governance

Learning Outcomes:

  • Design comprehensive security architectures for ML systems
  • Architect compliance frameworks for regulated industries
  • Implement zero-trust architectures at scale
  • Design data governance and privacy frameworks

Topics:

  • Enterprise security architecture principles
  • Zero-trust architecture for ML platforms
  • Identity and access management (IAM) at scale
  • Data encryption and key management
  • Compliance frameworks: GDPR, HIPAA, SOC2, ISO 27001
  • AI-specific regulations: EU AI Act, US frameworks
  • Data governance and lineage
  • Privacy-preserving ML (federated learning, differential privacy)
  • Security audit and penetration testing
  • Incident response and disaster recovery planning

Key Activities:

  • Design zero-trust architecture for ML platform
  • Create compliance framework for healthcare ML
  • Design data governance architecture
  • Implement privacy-preserving ML architecture
  • Conduct security architecture review
  • Create incident response playbooks

Assessment:

  • Quiz: 18 questions on security and compliance
  • Architecture: Secure ML platform for regulated industry
  • Deliverable: Compliance framework documentation

Module 304: Cost Optimization and FinOps Architecture (45 hours)

Focus: TCO, cost allocation, FinOps practices

Learning Outcomes:

  • Design cost-optimized ML infrastructure
  • Implement FinOps practices and frameworks
  • Create cost allocation and chargeback models
  • Optimize total cost of ownership (TCO)

Topics:

  • FinOps principles and frameworks
  • Cloud cost optimization strategies
  • GPU cost optimization techniques
  • Reserved capacity vs spot vs on-demand strategies
  • Cost allocation and tagging strategies
  • Chargeback and showback models
  • TCO analysis for ML infrastructure
  • Cost monitoring and anomaly detection
  • Resource right-sizing and optimization
  • Cost governance and budgeting

Key Activities:

  • Conduct TCO analysis for ML platform
  • Design cost optimization architecture
  • Create cost allocation framework
  • Build cost monitoring and alerting
  • Implement automated cost optimization
  • Develop FinOps governance model

Assessment:

  • Quiz: 12 questions on FinOps
  • Practical: Cost-optimized architecture design
  • Case study: TCO analysis

Module 305: High-Availability and Disaster Recovery Architecture (50 hours)

Focus: 99.95%+ uptime, DR planning, chaos engineering

Learning Outcomes:

  • Design highly available ML systems (99.95%+ uptime)
  • Architect disaster recovery and business continuity solutions
  • Implement fault tolerance and resilience patterns
  • Design chaos engineering frameworks

Topics:

  • High-availability architecture patterns
  • Fault tolerance and resilience in ML systems
  • Disaster recovery planning and strategies
  • RPO and RTO requirements for ML systems
  • Multi-region active-active architectures
  • Backup and restore strategies for ML systems
  • Chaos engineering principles and practices
  • Failure mode analysis (FMEA)
  • Circuit breakers and bulkheads
  • Health checks and self-healing systems

Key Activities:

  • Design HA architecture for mission-critical ML
  • Create DR plan and runbooks
  • Implement chaos engineering experiments
  • Design multi-region failover architecture
  • Conduct failure mode analysis
  • Build self-healing system capabilities

Assessment:

  • Quiz: 15 questions on HA/DR
  • Architecture: 99.99% uptime ML platform
  • Deliverable: DR plan and testing procedures

Phase 3: MLOps and Data Platforms (Modules 306-307)

Module 306: Enterprise MLOps Platform Architecture (55 hours)

Focus: Model governance, feature stores, real-time serving

Learning Outcomes:

  • Architect enterprise-scale MLOps platforms
  • Design model governance and lifecycle management
  • Create feature engineering platforms
  • Architect real-time ML serving systems

Topics:

  • Enterprise MLOps platform requirements
  • Model lifecycle management architecture
  • Model governance and compliance frameworks
  • Feature store architecture at scale
  • Real-time feature serving
  • Model registry and artifact management
  • Automated ML pipeline orchestration
  • Model monitoring and observability architecture
  • A/B testing and experimentation platforms
  • ML platform developer experience (DevEx)

Key Activities:

  • Design enterprise MLOps platform architecture
  • Create model governance framework
  • Design feature store for real-time serving
  • Architect experimentation platform
  • Build ML platform API and developer portal
  • Create MLOps maturity assessment

Assessment:

  • Quiz: 16 questions on MLOps architecture
  • Architecture: Enterprise ML platform
  • Deliverable: Governance framework

Module 307: Data Architecture and Engineering for AI (50 hours)

Focus: Data lakehouse, governance, real-time streaming

Learning Outcomes:

  • Design data lake and lakehouse architectures
  • Architect real-time streaming data platforms
  • Create data governance frameworks
  • Design data quality and validation systems

Topics:

  • Data architecture patterns: lake, warehouse, lakehouse, mesh
  • Real-time streaming architectures (Kafka, Kinesis, Pub/Sub)
  • Batch processing architectures (Spark, Flink)
  • Data governance and stewardship
  • Data catalog and metadata management
  • Data quality frameworks
  • Data lineage and provenance
  • Schema evolution and management
  • Data versioning strategies
  • Privacy and compliance in data architecture

Key Activities:

  • Design data lakehouse architecture for ML
  • Create real-time streaming data platform
  • Implement data governance framework
  • Design data quality monitoring system
  • Build data catalog and lineage tracking
  • Create data architecture documentation

Assessment:

  • Quiz: 15 questions on data architecture
  • Architecture: Enterprise data platform for AI
  • Deliverable: Data governance framework

Phase 4: LLM and Communication (Modules 308-309)

Module 308: LLM Platform and RAG Architecture (55 hours)

Focus: Enterprise LLM, RAG at scale, governance

Learning Outcomes:

  • Architect enterprise LLM platforms
  • Design RAG (Retrieval-Augmented Generation) systems at scale
  • Create LLM governance and safety frameworks
  • Optimize LLM infrastructure for cost and performance

Topics:

  • LLM platform architecture patterns
  • Model selection and evaluation frameworks
  • LLM inference optimization at scale
  • RAG architecture: indexing, retrieval, generation
  • Vector database architecture (Pinecone, Weaviate, Milvus)
  • LLM orchestration frameworks (LangChain, LlamaIndex)
  • Prompt engineering and management
  • LLM safety and guardrails
  • LLM observability and monitoring
  • Fine-tuning vs RAG vs prompt engineering tradeoffs

Key Activities:

  • Design enterprise LLM platform architecture
  • Architect scalable RAG system
  • Create LLM governance framework
  • Design vector database architecture
  • Build LLM cost optimization strategy
  • Implement LLM safety and monitoring

Assessment:

  • Quiz: 16 questions on LLM architecture
  • Architecture: Enterprise LLM platform with RAG
  • Case study: Cost-performance optimization

Module 309: Architecture Communication and Leadership (40 hours)

Focus: Executive communication, stakeholder management, ADRs

Learning Outcomes:

  • Communicate architecture effectively to stakeholders
  • Present to executive leadership and boards
  • Lead architecture governance processes
  • Build consensus across diverse stakeholder groups

Topics:

  • Architecture communication strategies
  • Executive presentation skills
  • Creating compelling architecture narratives
  • Visual communication and diagramming
  • Architecture documentation best practices
  • Leading architecture review boards
  • Building consensus and managing conflicts
  • Influencing without authority
  • Stakeholder analysis and management
  • Change management for architecture initiatives

Key Activities:

  • Create executive-level architecture presentation
  • Develop architecture vision and roadmap
  • Lead architecture review session
  • Build stakeholder engagement plan
  • Create compelling architecture diagrams
  • Facilitate architecture decision workshop

Assessment:

  • Presentation: Executive architecture briefing
  • Documentation: Architecture artifacts review
  • Scenario: Leadership exercises

Phase 5: Innovation and Future (Module 310)

Module 310: Emerging Technologies and Innovation (40 hours)

Focus: Future tech evaluation, innovation frameworks, roadmapping

Learning Outcomes:

  • Evaluate emerging AI technologies
  • Design innovation frameworks
  • Assess technology trends and impact
  • Create technology roadmaps

Topics:

  • Emerging AI hardware: TPUs, custom ASICs, neuromorphic
  • Edge AI and distributed intelligence
  • Federated learning architectures
  • Quantum computing for AI (awareness)
  • Green AI and sustainable computing
  • Responsible AI and ethics frameworks
  • Technology evaluation frameworks
  • Innovation management and R&D processes
  • Technology radar and trend analysis
  • Build vs buy vs partner decision frameworks

Key Activities:

  • Evaluate emerging technology for adoption
  • Create technology radar for AI infrastructure
  • Design innovation pilot program
  • Build technology evaluation framework
  • Develop multi-year technology roadmap
  • Assess responsible AI framework

Assessment:

  • Report: Technology evaluation
  • Framework: Innovation program design
  • Presentation: Technology roadmap

Project Integration

Project 301: Enterprise ML Platform Architecture (80 hours)

Prerequisites: Modules 301, 306 Deliverables: Platform architecture, ADRs, governance framework

Project 302: Multi-Cloud AI Infrastructure (100 hours)

Prerequisites: Modules 302, 305 Deliverables: Multi-cloud design, HA/DR plan, cost model

Project 303: LLM Platform with RAG (90 hours)

Prerequisites: Modules 308, 303 Deliverables: LLM architecture, governance, cost strategy

Project 304: Data Platform for AI (85 hours)

Prerequisites: Modules 307, 306 Deliverables: Lakehouse architecture, governance, lineage

Project 305: Security and Compliance Framework (70 hours)

Prerequisites: Modules 303, 309 Deliverables: Security architecture, compliance docs, playbooks

Assessment Criteria

Module Quizzes

  • Passing Score: 80% minimum
  • Format: Multiple choice, scenario-based questions
  • Retakes: Unlimited with randomized question pools
  • Time Limit: 45-60 minutes per quiz

Project Assessments

Evaluated on four dimensions:

1. Architecture Quality (40%)

  • Completeness of architecture design
  • Appropriate use of patterns and practices
  • Scalability and performance considerations
  • Cost-effectiveness
  • Security and compliance

2. Documentation (30%)

  • Clarity and completeness of documentation
  • Quality of architecture diagrams
  • ADRs with clear rationale
  • Reference architectures
  • Stakeholder-appropriate communication

3. Strategic Thinking (20%)

  • Alignment with business objectives
  • Long-term vision and roadmap
  • Risk assessment and mitigation
  • Innovation and competitive advantage
  • Technology selection rationale

4. Leadership (10%)

  • Stakeholder management approach
  • Governance framework design
  • Consensus building strategy
  • Change management considerations

Portfolio Development

  • Architecture Artifacts: Minimum 5 comprehensive designs
  • ADRs: 10+ architecture decision records
  • Reference Architectures: 2+ reusable patterns
  • Presentations: Executive-level communication samples
  • Certifications: TOGAF 9 recommended

Skill Development Path

Technical Skills Progression

Entering (Senior Engineer):

  • Advanced Kubernetes and cloud
  • CUDA and GPU optimization
  • Distributed training
  • MLOps platforms

Developing (Through Curriculum):

  • Enterprise architecture frameworks
  • Multi-cloud architecture
  • Security and compliance
  • Cost optimization and FinOps
  • HA/DR architecture design
  • Strategic technology evaluation

Mastered (Architect):

  • TOGAF and enterprise architecture
  • Multi-cloud strategy and design
  • Security architecture
  • Cost optimization at scale
  • HA/DR for mission-critical systems
  • LLM platform architecture
  • Stakeholder management
  • Architecture communication

Soft Skills Progression

Entering:

  • Technical leadership
  • Mentorship
  • Cross-functional collaboration

Developing:

  • Executive communication
  • Strategic thinking
  • Business acumen
  • Stakeholder management
  • Consensus building

Mastered:

  • Executive presence
  • Architecture governance
  • Change management
  • Visionary thinking
  • Organizational influence

Recommended Learning Path

Full-Time Track (10-12 months)

  1. Months 1-2: Modules 301-302 + Project 301
  2. Months 3-4: Modules 303-304 + Project 305
  3. Months 5-6: Module 305 + Project 302
  4. Months 7-8: Modules 306-307 + Project 304
  5. Months 9-10: Module 308 + Project 303
  6. Months 11-12: Modules 309-310 + Portfolio polish

Part-Time Track (20 months)

  • Pace: 1 module per month, projects in parallel
  • Commitment: 15-20 hours per week
  • Milestones: Review progress every 3 months

Self-Paced Track

  • Flexibility: Complete at your own pace
  • Recommended: Maintain consistent study schedule
  • Support: Join community for accountability

Success Metrics

Knowledge Acquisition

  • ✅ 80%+ on all module quizzes
  • ✅ Completion of all 10 modules
  • ✅ Understanding of TOGAF and EA frameworks

Practical Application

  • ✅ Completion of all 5 projects
  • ✅ Architecture designs meeting acceptance criteria
  • ✅ Portfolio of architecture artifacts

Professional Development

  • ✅ TOGAF 9 certification obtained
  • ✅ Cloud architect certifications (1-2)
  • ✅ Positive peer reviews on architectures
  • ✅ Ready for architect-level interviews

Career Advancement

  • ✅ Promotion to architect role OR
  • ✅ Job offer for architect position
  • ✅ Leading architecture initiatives
  • ✅ Mentoring other architects

Additional Resources

Books

See resources/reading-list.md for comprehensive list

Certifications

  • TOGAF 9 Certified (Priority)
  • AWS Solutions Architect – Professional
  • Google Cloud Professional Cloud Architect
  • Microsoft Azure Solutions Architect Expert

Communities

  • TOGAF Community: Open Group forums
  • Cloud Architecture: AWS, GCP, Azure communities
  • MLOps: Kubeflow, MLflow communities
  • FinOps Foundation: FinOps community

Next Steps

  1. Review Prerequisites: Ensure you meet all prerequisites
  2. Set Up Environment: Configure tools and accounts
  3. Start Module 301: Begin with enterprise architecture fundamentals
  4. Join Community: Connect with fellow learners
  5. Plan Your Journey: Choose full-time, part-time, or self-paced track

Questions? Open a discussion on GitHub or email ai-infra-curriculum@joshua-ferguson.com