Comprehensive reading list for developing enterprise-scale AI infrastructure architecture skills.
-
TOGAF 9 Foundation Study Guide by The Open Group
- Focus: Enterprise architecture framework and ADM
- Level: Essential for architects
- ISBN: 978-9087537203
- Link: The Open Group
-
Software Architecture in Practice (4th Edition) by Len Bass, Paul Clements, Rick Kazman
- Focus: Architecture patterns, quality attributes, documentation
- Level: Advanced
- ISBN: 978-0136886099
-
Fundamentals of Software Architecture by Mark Richards, Neal Ford
- Focus: Modern architecture patterns and practices
- Level: Intermediate to Advanced
- ISBN: 978-1492043454
-
Building Evolutionary Architectures by Neal Ford, Rebecca Parsons, Patrick Kua
- Focus: Designing for change and evolution
- Level: Advanced
- ISBN: 978-1491986363
-
Cloud Architecture Patterns by Bill Wilder
- Focus: Scalability, reliability, security in cloud
- Level: Intermediate
- ISBN: 978-1449319779
-
Architecting the Cloud: Design Decisions for Cloud Computing Service Models by Michael J. Kavis
- Focus: Cloud strategy and decision-making
- Level: Advanced
- ISBN: 978-1118617618
-
Designing Machine Learning Systems by Chip Huyen
- Focus: End-to-end ML systems design
- Level: Advanced
- ISBN: 978-1098107963
- Essential: Highly recommended for ML architects
-
Machine Learning Engineering by Andriy Burkov
- Focus: Practical ML engineering and infrastructure
- Level: Intermediate to Advanced
- ISBN: 978-1999579579
-
Reliable Machine Learning by Cathy Chen, Niall Murphy, et al.
- Focus: SRE principles for ML systems
- Level: Advanced
- ISBN: 978-1098106225
-
Building Machine Learning Powered Applications by Emmanuel Ameisen
- Focus: Practical ML product development
- Level: Intermediate
- ISBN: 978-1492045113
-
Kubernetes in Action (2nd Edition) by Marko Lukša
- Focus: Comprehensive Kubernetes guide
- Level: Intermediate
- ISBN: 978-1617297724
-
Kubernetes Patterns by Bilgin Ibryam, Roland Huß
- Focus: Design patterns for cloud-native apps
- Level: Advanced
- ISBN: 978-1492050285
-
Cloud Native DevOps with Kubernetes by John Arundel, Justin Domingus
- Focus: DevOps practices on Kubernetes
- Level: Intermediate
- ISBN: 978-1492040767
-
Data Mesh by Zhamak Dehghani
- Focus: Decentralized data architecture
- Level: Advanced
- ISBN: 978-1492092391
- Relevance: Modern data architecture thinking
-
Designing Data-Intensive Applications by Martin Kleppmann
- Focus: Foundations of data systems
- Level: Advanced
- ISBN: 978-1449373320
- Essential: Must-read for data architects
-
Streaming Systems by Tyler Akidau, Slava Chernyak, Reuven Lax
- Focus: Real-time data processing
- Level: Advanced
- ISBN: 978-1491983874
-
Zero Trust Networks by Evan Gilman, Doug Barth
- Focus: Zero-trust security architecture
- Level: Advanced
- ISBN: 978-1491962190
-
Practical Cloud Security by Chris Dotson
- Focus: Cloud security best practices
- Level: Intermediate to Advanced
- ISBN: 978-1492037521
- Cloud FinOps by J.R. Storment, Mike Fuller
- Focus: Financial operations for cloud
- Level: Intermediate
- ISBN: 978-1492054627
- Essential: For cost-conscious architects
-
The Software Architect Elevator by Gregor Hohpe
- Focus: Navigating organizational levels as architect
- Level: Advanced
- Essential: For architect-level communication
-
Talking with Tech Leads by Patrick Kua
- Focus: Technical leadership
- Level: Intermediate to Advanced
- ISBN: 978-1494745035
-
Hidden Technical Debt in Machine Learning Systems - Google (NIPS 2015)
- Link
- Essential understanding of ML systems complexity
-
Machine Learning: The High-Interest Credit Card of Technical Debt - Google
- Link
- ML technical debt and maintenance
-
Challenges in Deploying Machine Learning: A Survey of Case Studies - Cambridge (2020)
- Link
- Real-world ML deployment challenges
-
The Google File System - Google (SOSP 2003)
- Link
- Foundational distributed storage
-
MapReduce: Simplified Data Processing on Large Clusters - Google (OSDI 2004)
- Link
- Distributed data processing
-
The Chubby Lock Service for Loosely-Coupled Distributed Systems - Google (OSDI 2006)
- Link
- Distributed coordination
-
FlashAttention: Fast and Memory-Efficient Exact Attention (2022)
- Link
- LLM optimization techniques
-
Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM paper, 2023)
- Link
- Modern LLM serving
-
LoRA: Low-Rank Adaptation of Large Language Models (2021)
- Link
- Efficient LLM fine-tuning
-
AWS Well-Architected Framework - Amazon Web Services
- Link
- Cloud architecture best practices
-
Google Cloud Architecture Framework
- Link
- GCP architecture principles
-
Microsoft Azure Well-Architected Framework
- Link
- Azure architecture guidance
-
Uber's Michelangelo: ML Platform at Uber
-
Netflix's ML Infrastructure: Notebook to Production
-
Airbnb's ML Infrastructure: Bighead
-
Meta's ML Infrastructure: PyTorch at Scale
-
Martin Fowler's Architecture Blog
- Link
- Architecture patterns and practices
-
High Scalability Blog
- Link
- Architecture of large-scale systems
-
FinOps Foundation Resources
- Link
- FinOps frameworks and practices
-
The 10 Commandments of Cost Optimization - Corey Quinn
- Practical cost optimization
- TOGAF 9 Foundation and Certified - The Open Group
- Priority: Highest for architects
- Duration: 40-80 hours study
- Link: The Open Group
-
AWS Solutions Architect – Professional
- Provider: Amazon Web Services
- Duration: 3-6 months preparation
- Link: AWS Certification
-
Google Cloud Professional Cloud Architect
- Provider: Google Cloud
- Duration: 3-6 months preparation
- Link: GCP Certification
-
Microsoft Azure Solutions Architect Expert
- Provider: Microsoft
- Duration: 3-6 months preparation
- Link: Azure Certification
-
Certified Kubernetes Administrator (CKA)
- Provider: CNCF / Linux Foundation
- Duration: 2-3 months preparation
- Link: CNCF Certification
-
Certified Kubernetes Security Specialist (CKS)
- Provider: CNCF / Linux Foundation
- Duration: 2-3 months preparation (requires CKA)
- Link: CNCF Certification
- CISSP (Certified Information Systems Security Professional)
- Provider: ISC2
- Duration: 6-12 months preparation
- Link: ISC2 CISSP
- FinOps Certified Practitioner
- Provider: FinOps Foundation
- Duration: 1-2 months preparation
- Link: FinOps Foundation
-
KubeCon + CloudNativeCon - CNCF
- Focus: Kubernetes and cloud-native
- Link: YouTube
-
AWS re:Invent - Amazon Web Services
- Focus: AWS architecture and services
- Link: YouTube
-
Google Cloud Next - Google Cloud
- Focus: GCP architecture and ML
- Link: YouTube
-
MLOps Community - MLOps Talks
- Focus: ML operations and infrastructure
- Link: YouTube
- TechWorld with Nana - DevOps and Kubernetes
- Tech Lead Journal - Architecture interviews
- InfoQ - Software architecture and engineering
- GOTO Conferences - Software architecture talks
-
Software Engineering Radio
- Focus: Software architecture and engineering
- Frequency: Weekly
- Link: SE Radio
-
The Changelog
- Focus: Open source and software development
- Frequency: Weekly
-
Google Cloud Podcast
- Focus: Cloud architecture and GCP
-
AWS Podcast
- Focus: Cloud architecture and AWS
-
Kubernetes Podcast from Google
- Focus: Kubernetes and cloud-native
- TOGAF Community - Open Group Forums
- Cloud Architecture - AWS Forums, GCP Community, Azure Community
- Kubernetes - discuss.kubernetes.io
- MLOps Community - mlops.community
- FinOps Foundation - finops.org/community
- r/enterprisearchitecture - Enterprise architecture discussions
- r/cloudarchitecture - Cloud architecture patterns
- r/kubernetes - Kubernetes and cloud-native
- r/MachineLearning - ML research and engineering
- r/devops - DevOps and infrastructure
- TOGAF 9.2 - Enterprise Architecture Framework
- Zachman Framework - Enterprise Architecture Framework
- ITIL 4 - IT Service Management
- NIST Cybersecurity Framework - Security standards
- ISO/IEC 27001 - Information security management
- GDPR - Data protection regulation (EU)
- HIPAA - Healthcare data protection (US)
- SOC 2 - Security and availability standards
- TOGAF 9 Foundation Study Guide
- Designing Machine Learning Systems (Chip Huyen)
- Software Architecture in Practice
- Cloud FinOps
- Reliable Machine Learning
- The Software Architect Elevator
- Research papers on emerging technologies
- Cloud provider whitepapers and case studies
- Industry blogs and tech blogs
- Conference presentations
- Standards and framework updates
- Follow tech blogs from major companies (Google, Meta, Netflix, Uber, Airbnb)
- Subscribe to architecture newsletters
- Monitor GitHub trending repositories
- Read selected research papers
- Attend local meetups or webinars
- Review cloud provider updates
- Deep dive into one new technology area
- Update certifications or complete online course
- Review and update personal architecture knowledge base
- Attend major conference (KubeCon, re:Invent, etc.)
- Complete certification renewal or new certification
- Read 3-5 books from reading list
- Publish or present on architecture topic
Note: This reading list is continuously updated. Last updated: 2025-10-14
Suggestions: Have a book or resource to add? Open a pull request or issue!