This comprehensive curriculum develops enterprise-scale AI infrastructure architecture capabilities. Designed for senior engineers transitioning to architect roles, it covers enterprise architecture, multi-cloud design, security, cost optimization, and strategic technology leadership.
- Theory + Practice: Blend enterprise architecture frameworks with hands-on design
- Progressive Complexity: Build from fundamentals to enterprise-scale systems
- Real-World Focus: All content based on actual industry requirements
- Portfolio Building: Each project creates artifacts for professional portfolio
- Continuous Evaluation: Quizzes after each module
- Project-Based Assessment: Comprehensive architecture designs
- Peer Review: Optional peer feedback on architecture decisions
- Portfolio Development: Documentation and architecture artifacts
Upon completing this curriculum, you will be able to:
- Define technical strategy and multi-year roadmaps for AI infrastructure
- Establish architectural patterns and standards across organizations
- Lead technology evaluation and vendor selection processes
- Design cost-optimization strategies for large-scale AI workloads
- Create governance frameworks for ML systems
- Design end-to-end enterprise AI/ML platform architectures
- Architect multi-cloud and hybrid AI deployment solutions
- Design high-availability, fault-tolerant ML systems (99.95%+ uptime)
- Create comprehensive security and compliance architectures
- Architect scalable feature stores and real-time data platforms
- Design enterprise LLM platforms with RAG capabilities
- Implement model lifecycle management and governance
- Design disaster recovery and business continuity strategies
- Lead cross-functional architecture initiatives
- Collaborate effectively with business leaders on requirements
- Mentor architects and senior engineers across organizations
- Drive architectural improvements across teams
- Lead architecture review boards and governance processes
- Communicate architecture effectively to all stakeholders
Focus: TOGAF, ADM, governance, stakeholder management
Learning Outcomes:
- Apply enterprise architecture frameworks (TOGAF ADM)
- Design enterprise-scale system architectures
- Create comprehensive architecture documentation
- Lead architecture governance processes
Topics:
- Introduction to enterprise architecture
- TOGAF framework and Architecture Development Method (ADM)
- Zachman Framework overview
- Business architecture and value streams
- Architecture viewpoints and perspectives
- Architecture patterns for enterprise systems
- Reference architectures and blueprints
- Architecture governance and review boards
- Architecture Decision Records (ADRs)
- Stakeholder management for architects
Key Activities:
- Apply TOGAF ADM to ML platform design
- Create comprehensive architecture documentation
- Design reference architecture for AI platform
- Lead architecture review session
- Develop architecture decision framework
- Create stakeholder communication materials
Assessment:
- Quiz: 15 questions on EA concepts and TOGAF
- Practical: Design enterprise AI platform architecture
- Deliverable: Architecture documentation with diagrams
Focus: Multi-cloud strategy, hybrid cloud, vendor management
Learning Outcomes:
- Design comprehensive multi-cloud ML architectures
- Architect hybrid cloud solutions
- Optimize cloud vendor selection and strategy
- Design cloud migration and integration patterns
Topics:
- Multi-cloud strategy: when, why, and how
- Cloud vendor comparison: AWS vs GCP vs Azure
- Hybrid cloud architecture patterns
- Cloud-agnostic design principles
- Multi-cloud networking and connectivity
- Data residency and sovereignty considerations
- Multi-cloud cost optimization strategies
- Vendor lock-in mitigation techniques
- Cloud migration strategies and planning
- Multi-cloud management and governance
Key Activities:
- Design multi-cloud ML platform architecture
- Create cloud vendor selection framework
- Design hybrid cloud integration architecture
- Develop multi-cloud cost model
- Create cloud migration roadmap
- Build multi-cloud governance framework
Assessment:
- Quiz: 15 questions on multi-cloud concepts
- Architecture design: Multi-cloud ML platform
- Case study: Vendor selection analysis
Focus: Zero-trust, compliance frameworks, data governance
Learning Outcomes:
- Design comprehensive security architectures for ML systems
- Architect compliance frameworks for regulated industries
- Implement zero-trust architectures at scale
- Design data governance and privacy frameworks
Topics:
- Enterprise security architecture principles
- Zero-trust architecture for ML platforms
- Identity and access management (IAM) at scale
- Data encryption and key management
- Compliance frameworks: GDPR, HIPAA, SOC2, ISO 27001
- AI-specific regulations: EU AI Act, US frameworks
- Data governance and lineage
- Privacy-preserving ML (federated learning, differential privacy)
- Security audit and penetration testing
- Incident response and disaster recovery planning
Key Activities:
- Design zero-trust architecture for ML platform
- Create compliance framework for healthcare ML
- Design data governance architecture
- Implement privacy-preserving ML architecture
- Conduct security architecture review
- Create incident response playbooks
Assessment:
- Quiz: 18 questions on security and compliance
- Architecture: Secure ML platform for regulated industry
- Deliverable: Compliance framework documentation
Focus: TCO, cost allocation, FinOps practices
Learning Outcomes:
- Design cost-optimized ML infrastructure
- Implement FinOps practices and frameworks
- Create cost allocation and chargeback models
- Optimize total cost of ownership (TCO)
Topics:
- FinOps principles and frameworks
- Cloud cost optimization strategies
- GPU cost optimization techniques
- Reserved capacity vs spot vs on-demand strategies
- Cost allocation and tagging strategies
- Chargeback and showback models
- TCO analysis for ML infrastructure
- Cost monitoring and anomaly detection
- Resource right-sizing and optimization
- Cost governance and budgeting
Key Activities:
- Conduct TCO analysis for ML platform
- Design cost optimization architecture
- Create cost allocation framework
- Build cost monitoring and alerting
- Implement automated cost optimization
- Develop FinOps governance model
Assessment:
- Quiz: 12 questions on FinOps
- Practical: Cost-optimized architecture design
- Case study: TCO analysis
Focus: 99.95%+ uptime, DR planning, chaos engineering
Learning Outcomes:
- Design highly available ML systems (99.95%+ uptime)
- Architect disaster recovery and business continuity solutions
- Implement fault tolerance and resilience patterns
- Design chaos engineering frameworks
Topics:
- High-availability architecture patterns
- Fault tolerance and resilience in ML systems
- Disaster recovery planning and strategies
- RPO and RTO requirements for ML systems
- Multi-region active-active architectures
- Backup and restore strategies for ML systems
- Chaos engineering principles and practices
- Failure mode analysis (FMEA)
- Circuit breakers and bulkheads
- Health checks and self-healing systems
Key Activities:
- Design HA architecture for mission-critical ML
- Create DR plan and runbooks
- Implement chaos engineering experiments
- Design multi-region failover architecture
- Conduct failure mode analysis
- Build self-healing system capabilities
Assessment:
- Quiz: 15 questions on HA/DR
- Architecture: 99.99% uptime ML platform
- Deliverable: DR plan and testing procedures
Focus: Model governance, feature stores, real-time serving
Learning Outcomes:
- Architect enterprise-scale MLOps platforms
- Design model governance and lifecycle management
- Create feature engineering platforms
- Architect real-time ML serving systems
Topics:
- Enterprise MLOps platform requirements
- Model lifecycle management architecture
- Model governance and compliance frameworks
- Feature store architecture at scale
- Real-time feature serving
- Model registry and artifact management
- Automated ML pipeline orchestration
- Model monitoring and observability architecture
- A/B testing and experimentation platforms
- ML platform developer experience (DevEx)
Key Activities:
- Design enterprise MLOps platform architecture
- Create model governance framework
- Design feature store for real-time serving
- Architect experimentation platform
- Build ML platform API and developer portal
- Create MLOps maturity assessment
Assessment:
- Quiz: 16 questions on MLOps architecture
- Architecture: Enterprise ML platform
- Deliverable: Governance framework
Focus: Data lakehouse, governance, real-time streaming
Learning Outcomes:
- Design data lake and lakehouse architectures
- Architect real-time streaming data platforms
- Create data governance frameworks
- Design data quality and validation systems
Topics:
- Data architecture patterns: lake, warehouse, lakehouse, mesh
- Real-time streaming architectures (Kafka, Kinesis, Pub/Sub)
- Batch processing architectures (Spark, Flink)
- Data governance and stewardship
- Data catalog and metadata management
- Data quality frameworks
- Data lineage and provenance
- Schema evolution and management
- Data versioning strategies
- Privacy and compliance in data architecture
Key Activities:
- Design data lakehouse architecture for ML
- Create real-time streaming data platform
- Implement data governance framework
- Design data quality monitoring system
- Build data catalog and lineage tracking
- Create data architecture documentation
Assessment:
- Quiz: 15 questions on data architecture
- Architecture: Enterprise data platform for AI
- Deliverable: Data governance framework
Focus: Enterprise LLM, RAG at scale, governance
Learning Outcomes:
- Architect enterprise LLM platforms
- Design RAG (Retrieval-Augmented Generation) systems at scale
- Create LLM governance and safety frameworks
- Optimize LLM infrastructure for cost and performance
Topics:
- LLM platform architecture patterns
- Model selection and evaluation frameworks
- LLM inference optimization at scale
- RAG architecture: indexing, retrieval, generation
- Vector database architecture (Pinecone, Weaviate, Milvus)
- LLM orchestration frameworks (LangChain, LlamaIndex)
- Prompt engineering and management
- LLM safety and guardrails
- LLM observability and monitoring
- Fine-tuning vs RAG vs prompt engineering tradeoffs
Key Activities:
- Design enterprise LLM platform architecture
- Architect scalable RAG system
- Create LLM governance framework
- Design vector database architecture
- Build LLM cost optimization strategy
- Implement LLM safety and monitoring
Assessment:
- Quiz: 16 questions on LLM architecture
- Architecture: Enterprise LLM platform with RAG
- Case study: Cost-performance optimization
Focus: Executive communication, stakeholder management, ADRs
Learning Outcomes:
- Communicate architecture effectively to stakeholders
- Present to executive leadership and boards
- Lead architecture governance processes
- Build consensus across diverse stakeholder groups
Topics:
- Architecture communication strategies
- Executive presentation skills
- Creating compelling architecture narratives
- Visual communication and diagramming
- Architecture documentation best practices
- Leading architecture review boards
- Building consensus and managing conflicts
- Influencing without authority
- Stakeholder analysis and management
- Change management for architecture initiatives
Key Activities:
- Create executive-level architecture presentation
- Develop architecture vision and roadmap
- Lead architecture review session
- Build stakeholder engagement plan
- Create compelling architecture diagrams
- Facilitate architecture decision workshop
Assessment:
- Presentation: Executive architecture briefing
- Documentation: Architecture artifacts review
- Scenario: Leadership exercises
Focus: Future tech evaluation, innovation frameworks, roadmapping
Learning Outcomes:
- Evaluate emerging AI technologies
- Design innovation frameworks
- Assess technology trends and impact
- Create technology roadmaps
Topics:
- Emerging AI hardware: TPUs, custom ASICs, neuromorphic
- Edge AI and distributed intelligence
- Federated learning architectures
- Quantum computing for AI (awareness)
- Green AI and sustainable computing
- Responsible AI and ethics frameworks
- Technology evaluation frameworks
- Innovation management and R&D processes
- Technology radar and trend analysis
- Build vs buy vs partner decision frameworks
Key Activities:
- Evaluate emerging technology for adoption
- Create technology radar for AI infrastructure
- Design innovation pilot program
- Build technology evaluation framework
- Develop multi-year technology roadmap
- Assess responsible AI framework
Assessment:
- Report: Technology evaluation
- Framework: Innovation program design
- Presentation: Technology roadmap
Prerequisites: Modules 301, 306 Deliverables: Platform architecture, ADRs, governance framework
Prerequisites: Modules 302, 305 Deliverables: Multi-cloud design, HA/DR plan, cost model
Prerequisites: Modules 308, 303 Deliverables: LLM architecture, governance, cost strategy
Prerequisites: Modules 307, 306 Deliverables: Lakehouse architecture, governance, lineage
Prerequisites: Modules 303, 309 Deliverables: Security architecture, compliance docs, playbooks
- Passing Score: 80% minimum
- Format: Multiple choice, scenario-based questions
- Retakes: Unlimited with randomized question pools
- Time Limit: 45-60 minutes per quiz
Evaluated on four dimensions:
- Completeness of architecture design
- Appropriate use of patterns and practices
- Scalability and performance considerations
- Cost-effectiveness
- Security and compliance
- Clarity and completeness of documentation
- Quality of architecture diagrams
- ADRs with clear rationale
- Reference architectures
- Stakeholder-appropriate communication
- Alignment with business objectives
- Long-term vision and roadmap
- Risk assessment and mitigation
- Innovation and competitive advantage
- Technology selection rationale
- Stakeholder management approach
- Governance framework design
- Consensus building strategy
- Change management considerations
- Architecture Artifacts: Minimum 5 comprehensive designs
- ADRs: 10+ architecture decision records
- Reference Architectures: 2+ reusable patterns
- Presentations: Executive-level communication samples
- Certifications: TOGAF 9 recommended
Entering (Senior Engineer):
- Advanced Kubernetes and cloud
- CUDA and GPU optimization
- Distributed training
- MLOps platforms
Developing (Through Curriculum):
- Enterprise architecture frameworks
- Multi-cloud architecture
- Security and compliance
- Cost optimization and FinOps
- HA/DR architecture design
- Strategic technology evaluation
Mastered (Architect):
- TOGAF and enterprise architecture
- Multi-cloud strategy and design
- Security architecture
- Cost optimization at scale
- HA/DR for mission-critical systems
- LLM platform architecture
- Stakeholder management
- Architecture communication
Entering:
- Technical leadership
- Mentorship
- Cross-functional collaboration
Developing:
- Executive communication
- Strategic thinking
- Business acumen
- Stakeholder management
- Consensus building
Mastered:
- Executive presence
- Architecture governance
- Change management
- Visionary thinking
- Organizational influence
- Months 1-2: Modules 301-302 + Project 301
- Months 3-4: Modules 303-304 + Project 305
- Months 5-6: Module 305 + Project 302
- Months 7-8: Modules 306-307 + Project 304
- Months 9-10: Module 308 + Project 303
- Months 11-12: Modules 309-310 + Portfolio polish
- Pace: 1 module per month, projects in parallel
- Commitment: 15-20 hours per week
- Milestones: Review progress every 3 months
- Flexibility: Complete at your own pace
- Recommended: Maintain consistent study schedule
- Support: Join community for accountability
- ✅ 80%+ on all module quizzes
- ✅ Completion of all 10 modules
- ✅ Understanding of TOGAF and EA frameworks
- ✅ Completion of all 5 projects
- ✅ Architecture designs meeting acceptance criteria
- ✅ Portfolio of architecture artifacts
- ✅ TOGAF 9 certification obtained
- ✅ Cloud architect certifications (1-2)
- ✅ Positive peer reviews on architectures
- ✅ Ready for architect-level interviews
- ✅ Promotion to architect role OR
- ✅ Job offer for architect position
- ✅ Leading architecture initiatives
- ✅ Mentoring other architects
See resources/reading-list.md for comprehensive list
- TOGAF 9 Certified (Priority)
- AWS Solutions Architect – Professional
- Google Cloud Professional Cloud Architect
- Microsoft Azure Solutions Architect Expert
- TOGAF Community: Open Group forums
- Cloud Architecture: AWS, GCP, Azure communities
- MLOps: Kubeflow, MLflow communities
- FinOps Foundation: FinOps community
- Review Prerequisites: Ensure you meet all prerequisites
- Set Up Environment: Configure tools and accounts
- Start Module 301: Begin with enterprise architecture fundamentals
- Join Community: Connect with fellow learners
- Plan Your Journey: Choose full-time, part-time, or self-paced track
Questions? Open a discussion on GitHub or email ai-infra-curriculum@joshua-ferguson.com