diff --git a/content/posts/3-infrastructure-decisions-engineering-velocity.md b/content/posts/3-infrastructure-decisions-engineering-velocity.md new file mode 100644 index 00000000..491f771d --- /dev/null +++ b/content/posts/3-infrastructure-decisions-engineering-velocity.md @@ -0,0 +1,572 @@ +--- +title: "The 3 Infrastructure Decisions That Determine Your Engineering Velocity" +excerpt: "Provisioning model, environment strategy, and deployment surface. Everything else is optimization. Here's how to make these foundational choices without killing your team's momentum." +category: + name: "DevOps" + slug: "devops" +coverImage: "/images/posts/3-infrastructure-decisions-engineering-velocity.png" +ogImage: "/images/posts/3-infrastructure-decisions-engineering-velocity.png" +date: "2026-03-06" +publishedAt: "2026-03-06T10:00:00Z" +updatedAt: "2026-03-06T10:00:00Z" +readingTime: "11 min read" +author: + name: "DevOps Daily Team" + slug: "devops-daily-team" +tags: + - Engineering Velocity + - Infrastructure Strategy + - Team Productivity + - Platform Engineering + - DevOps +--- + +## TLDR + +- **Three decisions matter most**: Provisioning model (how you create infrastructure), environment strategy (dev/staging/prod topology), and deployment surface (where code runs) +- **Provisioning**: Manual ops → 2-3 days per change. Scripted → 4-8 hours. Terraform → 30-60 minutes. Platform-abstracted → 5-10 minutes +- **Environments**: Production-only → fast but risky. Dev/staging/prod → safe but slow (30-90 min deploys). Ephemeral per-PR → fast AND safe (5-15 min feedback loops) +- **Deployment surface**: Managed platforms = fastest (minutes to production), VMs = moderate (hours to days), Kubernetes = slowest (weeks to months for first deploy) +- **The velocity tax**: Each additional approval gate adds 15-45 minutes per deployment. At 40 deploys/day across a 20-person team, that's ~$31K/month in approval overhead alone +- **Decision principle**: Choose the simplest option that meets your requirements. Complexity kills velocity faster than any other factor + +--- + +## The Infrastructure Decisions That Actually Matter + +Engineering teams obsess over monitoring tools, service meshes, and database choices. They spend weeks evaluating container runtimes. They debate GitOps vs traditional CI/CD. + +These decisions matter, but they're **optimizations**. They affect developer experience and operational efficiency, but they don't fundamentally change your team's ability to ship quickly. + +Three infrastructure decisions have outsized impact on velocity: + +1. **Provisioning Model**: How you create, modify, and destroy infrastructure +2. **Environment Strategy**: Your dev/staging/prod topology and how work flows through it +3. **Deployment Surface**: Where your code actually runs + +Get these wrong, and your team will struggle no matter how good your other choices are. Get these right, and you'll ship faster than teams with "better" infrastructure. + +This guide examines each decision, quantifies the velocity impact, and provides frameworks for choosing correctly based on team size and maturity. + +--- + +## Decision 1: Provisioning Model + +**The question**: When you need a new database, load balancer, or storage bucket, how long does it take from decision to usable resource? + +### The Four Provisioning Maturity Levels + +**Level 1: Manual Operations** (2-3 days per change) +- Someone clicks through cloud console UI +- Takes screenshots for documentation +- No repeatability or versioning +- Common in early startups (1-5 engineers) + +**Velocity impact**: Every infrastructure change requires dedicated focus time. Deploying a new service that needs a database, cache, and message queue? That's 3× 2-3 days = **6-9 days of infrastructure work** before your first line of application code runs. + +**Level 2: Scripted Provisioning** (4-8 hours per change) +- Bash scripts or CLI commands +- Some documentation, minimal versioning +- Better than manual, but fragile +- Common in growing startups (5-15 engineers) + +**Velocity impact**: You've eliminated the "where's that button in the console?" tax, but scripts break when cloud APIs change. You'll spend 2-4 hours fixing brittle automation quarterly. At 10 infrastructure changes/month, that's **40-80 hours/month** of provisioning work. + +**Level 3: Infrastructure as Code (Terraform, Pulumi, CloudFormation)** (30-60 minutes per change) +- Declarative configuration +- Version controlled and reviewed +- Plan before apply, state management +- Standard for mid-stage companies (15-50 engineers) + +**Velocity impact**: The goldilocks zone for most teams. Changes are fast enough that infrastructure isn't the bottleneck, but controlled enough that you don't accidentally destroy production. **30-60 minutes from PR to merged infrastructure.** + +**Level 4: Platform-Abstracted** (5-10 minutes per change) +- Developers self-serve through internal platform or managed service +- Infrastructure provisioned automatically based on application config +- Examples: Heroku (provisions DB on `heroku addons:create`), internal IDP with service catalogs +- Practical for larger companies (50+ engineers with platform team) + +**Velocity impact**: Infrastructure becomes invisible. Developers declare "I need PostgreSQL" and get it without thinking about VPCs, security groups, or backup policies. **5-10 minutes from config change to usable resource.** + +### The Provisioning Velocity Tax + +Let's quantify this with a real scenario: Your team builds 2 new microservices per month. Each needs: +- PostgreSQL database +- Redis cache +- S3 bucket for file storage +- Application secrets +- Load balancer/ingress + +**Time to provision per service:** + +``` +Manual Operations: 3 days × 5 resources = 15 days (120 hours) +Scripted: 6 hours × 5 resources = 30 hours +Infrastructure as Code: 45 min × 5 resources = 3.75 hours +Platform-Abstracted: 7 min × 5 resources = 35 minutes +``` + +**Monthly cost** (2 services at $16,700/engineer-month): + +``` +Manual: 240 hours = 1.5 FTE = $25,000/month +Scripted: 60 hours = 0.375 FTE = $6,300/month +IaC: 7.5 hours = 0.047 FTE = $780/month +Platform: 1.2 hours = 0.007 FTE = $120/month +``` + +The difference between manual and IaC is **$24,220/month** in engineer time—nearly a senior engineer's salary. + +### Choosing Your Provisioning Model + +**Stay at Level 1 (Manual)** if: +- You're pre-product-market-fit (1-3 engineers) +- You provision infrastructure less than once/week +- Your total infrastructure is <10 resources + +**Move to Level 2 (Scripted)** when: +- You're provisioning infrastructure 2-3×/week +- Multiple people need to provision similar resources +- You hit 10-30 total infrastructure resources + +**Adopt Level 3 (IaC)** when: +- You have 5+ engineers touching infrastructure +- You need multi-environment (dev/staging/prod) provisioning +- You're spending >10 hours/week on infrastructure changes + +**Build Level 4 (Platform)** when: +- You have 50+ engineers +- You can dedicate 2+ FTEs to platform engineering +- Developer self-service is a bottleneck (>5 infrastructure requests/day) + +**Most teams should aim for Level 3 (IaC) and stop.** Level 4 is only worth the investment at significant scale. + +--- + +## Decision 2: Environment Strategy + +**The question**: How does code flow from development to production, and how many approval gates exist? + +### The Environment Spectrum + +**Pattern 1: Production Only** (Deploy time: 5-15 minutes) + +Developers push directly to production. No staging environment. Works for: +- Very early startups (<5 engineers) +- Teams with comprehensive automated testing +- Low-risk applications (internal tools, content sites) + +**Velocity impact**: Maximum speed. Feature branches merge to main, CI runs, production deploys. **Total time from merge to production: 5-15 minutes.** + +**Risk**: No safety net. Bugs reach users immediately. Requires excellent testing culture and fast rollback capability. + +**Pattern 2: Staging + Production** (Deploy time: 30-90 minutes) + +The industry standard: +1. Merge to main +2. CI deploys to staging +3. Manual QA/smoke tests +4. Promote to production (manual or automated) + +**Velocity impact**: Adds 15-75 minutes of waiting between merge and production deploy. At 10 deploys/day across a 10-person team, that's **2.5-12.5 engineer-hours daily** waiting for staging validation. + +**The hidden cost**: Staging environments drift from production. Database state differs. Traffic patterns don't match. You'll discover production-only bugs monthly, costing **4-8 hours of debugging** each. + +**Pattern 3: Dev + Staging + Production** (Deploy time: 60-180 minutes) + +Common in regulated industries: +1. Develop in local/dev environment +2. Merge deploys to shared dev environment +3. Promote to staging for QA +4. Promote to production after approvals + +**Velocity impact**: Each environment adds 15-45 minutes of wait time. **Total time from commit to production: 60-180 minutes** depending on automation. + +**Cost**: Maintaining 3 environments costs $2,000-$5,000/month in infrastructure alone for typical web applications. Add 5-10 hours/week of "fixing dev environment" work. + +**Pattern 4: Ephemeral PR Environments** (Feedback time: 5-15 minutes) + +Modern approach: +- Each pull request gets isolated environment +- Automated tests + human review happen in PR environment +- Merge to main deploys directly to production +- PR environments destroyed after merge + +**Velocity impact**: Fastest feedback loops. Reviewers see changes running in real environment within **5-15 minutes of pushing code.** No waiting for shared staging environment. + +**Tools**: Vercel/Netlify (frontend), Render Preview Environments, Railway PR Deploys, Kubernetes with Argo CD + preview namespaces + +**Cost**: Variable based on PR volume. Roughly $500-$2,000/month for teams with 20-50 open PRs simultaneously. + +### The Approval Gate Tax + +Every manual approval step adds latency. Let's quantify: + +**Scenario**: 20-engineer team, each engineer deploys 2×/day average +- Total deploys: 40/day +- Each manual approval: 15-30 min average (including context switching for approver) +- Daily cost: 40 × 22.5 min = **900 minutes = 15 hours/day** of combined waiting + approval time +- Monthly cost: 15 hours/day × 20 workdays = 300 hours/month × $104/hour = **~$31,000/month in approval overhead** + +And that's assuming approvals happen within 30 minutes. In practice, approvals often wait hours for the right person to be available, multiplying this cost. + +### Choosing Your Environment Strategy + +**Production-only** if: +- Pre-PMF startup (<5 engineers) +- Excellent automated test coverage (>80%) +- Low user impact from bugs (internal tools, dev tools) +- Fast rollback capability (<5 minutes) + +**Ephemeral PR + Production** if: +- 5-50 engineers +- Modern tooling (React, Next.js, containerized services) +- Fast CI/CD (<10 min test suite) +- Supported by your deployment platform + +**Staging + Production** if: +- Complex integrations requiring manual QA +- Slow test suites (>15 minutes) +- High cost of bugs in production +- You're not ready for ephemeral environments + +**Dev + Staging + Production** if: +- Regulated industry with compliance requirements +- Multiple external integrations that need isolated testing +- Explicitly required by customers/contracts + +**Avoid this unless required.** The velocity tax rarely justifies the additional safety. + +--- + +## Decision 3: Deployment Surface + +**The question**: Where does your application code actually run? + +This is the "hosting choice" decision: managed platform vs VMs vs containers/Kubernetes. + +### The Three Deployment Surface Tiers + +**Tier 1: Managed Platforms** (Time to first production deploy: 1-3 days) + +Examples: Heroku, Render, Railway, Fly.io, DigitalOcean App Platform, Vercel, Netlify + +**What they handle:** +- Runtime environment (Node, Python, Ruby, etc.) +- SSL certificates +- Load balancing +- Log aggregation +- Metrics and health checks +- Deployment pipeline +- Auto-scaling (some platforms) + +**What you handle:** +- Application code +- Database schema migrations +- Environment configuration + +**Velocity impact**: Fastest time-to-production. Typical flow: +1. Connect Git repository (5 minutes) +2. Configure environment variables (10 minutes) +3. Deploy (platform builds and deploys automatically) + +**From zero to production: 1-3 days** for a new service including development time. + +**Cost**: $50-$500/month per service for typical web applications. + +**Trade-off**: Less control over infrastructure. Limited customization of networking, OS-level packages, or deployment strategies. + +**Best for**: Stateless web apps, APIs, background workers, frontend applications. Teams under 30 engineers. + +**Tier 2: Virtual Machines** (Time to first production deploy: 1-2 weeks) + +Examples: EC2, DigitalOcean Droplets, Linode, Azure VMs + +**What you handle:** +- OS configuration and patching +- Runtime installation (Node, Python, etc.) +- Process management (systemd, supervisord) +- SSL via Let's Encrypt/ACM +- Application deployment scripts +- Log shipping to external service +- Monitoring agent installation + +**Velocity impact**: Slower initial setup, but full control. Typical flow: +1. Provision VM and configure networking (2-4 hours) +2. Install runtime and dependencies (1-2 hours) +3. Configure deployment automation (4-8 hours) +4. Set up monitoring/logging (2-4 hours) +5. Security hardening (2-4 hours) + +**From zero to production: 1-2 weeks** including development. + +**Cost**: $20-$200/month per VM depending on size. Additional monitoring/logging costs. + +**Trade-off**: More flexibility, more maintenance. You're responsible for OS patches, security updates, and daemon management. + +**Best for**: Stateful applications (databases, message queues), legacy applications with specific OS requirements, long-running processes. Teams 10-50 engineers. + +**Tier 3: Container Orchestration (Kubernetes)** (Time to first production deploy: 1-3 months) + +Examples: EKS, GKE, AKS, self-managed K8s + +**What you handle:** +- Cluster provisioning and upgrades +- Node pool management +- Networking (CNI, ingress controllers, service mesh) +- Storage (CSI drivers, PVCs) +- Deployment manifests (YAML, Helm charts) +- GitOps tooling (Argo CD, Flux) +- Observability (Prometheus, Grafana) +- Security policies (NetworkPolicies, PodSecurityPolicies) + +**Velocity impact**: Slowest initial setup, highest operational complexity. Typical flow for first service: +1. Provision cluster (4-8 hours with managed K8s) +2. Configure networking/ingress (8-16 hours) +3. Set up CI/CD pipeline (16-32 hours) +4. Create deployment manifests (4-8 hours) +5. Configure monitoring/logging (8-16 hours) +6. Security hardening (8-16 hours) +7. Team training (40-80 hours) + +**From zero to production: 1-3 months** for first service with team ramp-up. + +**Cost**: $200-$1,000/month for managed control plane + nodes. Additional tooling costs (ingress controllers, monitoring, etc.) can add $200-$500/month. + +**Trade-off**: Maximum flexibility and control. Can handle complex deployment strategies, multi-region, advanced networking. Requires dedicated platform engineering time. + +**Best for**: Large organizations (50+ engineers), multi-region requirements, complex microservices architectures, teams with existing K8s expertise. + +### The Deployment Surface Velocity Matrix + +| Factor | Managed Platform | VMs | Kubernetes | +|--------|------------------|-----|------------| +| First deploy | 1-3 days | 1-2 weeks | 1-3 months | +| Subsequent deploys | 5-15 min | 10-30 min | 15-45 min | +| New service onboarding | 1-4 hours | 1-2 days | 3-5 days | +| Team onboarding | 1-2 days | 3-5 days | 2-4 weeks | +| Maintenance burden | <5% FTE | 10-15% FTE | 20-40% FTE | +| Debugging complexity | Low | Medium | High | +| Multi-region support | Limited | Manual | Native | +| Scaling complexity | Automatic | Manual/scripts | Complex but powerful | + +### The Real Cost: Opportunity Cost + +The question isn't "how much does Kubernetes cost?" It's "what could we build with the time we spend on Kubernetes?" + +**Example**: 30-engineer team chooses Kubernetes +- Initial setup: 120 engineer-hours = $10,000 +- Ongoing maintenance: 20% of 2 platform engineers = 0.4 FTE = $6,680/month +- Annual cost: $10,000 + ($6,680 × 12) = **$90,160/year** + +**Alternative**: Same team chooses managed platform +- Initial setup: 8 engineer-hours = $670 +- Ongoing: 5% of 0.5 FTE = $420/month +- Annual cost: $670 + ($420 × 12) = **$5,710/year** + +**Savings**: $84,450/year in engineering time = **~0.5 FTE freed up** for product development + +### Choosing Your Deployment Surface + +**Choose managed platforms** if: +- Team <30 engineers +- Standard web applications/APIs +- No multi-region requirements +- Willing to trade control for velocity + +**Choose VMs** if: +- Specific OS-level dependencies +- Long-running stateful processes +- Predictable load (no need for rapid scaling) +- Team 10-50 engineers + +**Choose Kubernetes** if: +- 50+ engineers with 2+ dedicated platform engineers +- Multi-region deployment required +- Complex networking requirements +- Existing K8s expertise on team + +**Default to the simplest option.** You can always migrate later, but you can't reclaim the engineering time spent on complex infrastructure. + +--- + +## Putting It All Together: Real-World Scenarios + +Let's see how these three decisions compound in real teams: + +### Scenario A: Early Startup (8 Engineers) + +**Infrastructure choices:** +- **Provisioning**: Manual via cloud console (Level 1) +- **Environments**: Production only +- **Deployment**: Render (managed platform) + +**Velocity profile:** +- Deploy frequency: 15-20×/day across team +- Average deploy time: 8 minutes +- Infrastructure changes: 2-3×/week taking 2 hours each +- Team time on infrastructure: ~5% of capacity + +**Result**: **Maximum product velocity.** Team ships features fast. Infrastructure is barely a consideration. + +**When to evolve**: At 15+ engineers or when infrastructure changes hit 5+/week. + +### Scenario B: Growth-Stage Startup (35 Engineers) + +**Infrastructure choices:** +- **Provisioning**: Terraform (Level 3 IaC) +- **Environments**: Ephemeral PR environments + Production +- **Deployment**: Mix of Railway (web services) and DigitalOcean VMs (databases, Redis) + +**Velocity profile:** +- Deploy frequency: 60-80×/day across team +- Average deploy time: 12 minutes +- Infrastructure changes: 10-15×/week taking 45 min each +- Team time on infrastructure: ~10% of capacity +- 1 dedicated platform engineer (not full-time) + +**Result**: **Balanced velocity and control.** Ephemeral environments enable fast feedback. Terraform enables self-service infrastructure for developers. Still shipping quickly without operational burden. + +**When to evolve**: At 60+ engineers or when single-region becomes a scaling bottleneck. + +### Scenario C: Mid-Stage Company (120 Engineers) + +**Infrastructure choices:** +- **Provisioning**: Internal developer platform built on Terraform (Level 4) +- **Environments**: Ephemeral per-PR + Staging + Production +- **Deployment**: Kubernetes (EKS) with Argo CD + +**Velocity profile:** +- Deploy frequency: 200-300×/day across team +- Average deploy time: 18 minutes +- Infrastructure changes: Self-service via platform (5 min each) +- Team time on infrastructure: ~3% of capacity (concentrated in 5-person platform team) +- Platform team: 5 dedicated engineers + +**Result**: **High velocity despite complexity.** Platform team abstracts Kubernetes complexity. Developers self-serve infrastructure. Most teams don't interact with K8s directly. + +**Trade-off**: 5 engineers maintain platform = ~$1M/year. Only justified at this scale. + +### Scenario D: Enterprise (400+ Engineers) + +**Infrastructure choices:** +- **Provisioning**: Multi-cloud IDP with approval workflows (Level 4) +- **Environments**: Dev + Staging + Prod + per-PR ephemeral (for web tier) +- **Deployment**: Multi-region Kubernetes across AWS + GCP + +**Velocity profile:** +- Deploy frequency: 800-1,200×/day across org +- Average deploy time: 25 minutes (includes compliance checks) +- Infrastructure changes: Fully self-service +- Team time on infrastructure: ~2% (concentrated in 20-person platform org) +- Platform org: 20+ engineers, product managers, SREs + +**Result**: **Velocity maintained at scale through automation.** Extensive tooling and process required. Complex infrastructure is justified by organization size. + +**Cost**: Platform team is ~$4M+/year. Only viable at enterprise scale with $50M+ engineering budget. + +--- + +## The Velocity Decision Framework + +When making infrastructure decisions, use this framework: + +### Step 1: Define Your Velocity Target + +What's acceptable for your stage? + +**Seed stage (<10 engineers):** Deploy 10-20×/day, 5-10 min per deploy +**Series A (10-30 engineers):** Deploy 30-60×/day, 8-15 min per deploy +**Series B+ (30-100 engineers):** Deploy 100-200×/day, 10-20 min per deploy +**Enterprise (100+ engineers):** Deploy 200+/day, 15-30 min per deploy + +### Step 2: Calculate Current Velocity Tax + +Measure these: +- Average time from merge to production deploy +- % of engineering time spent on infrastructure work +- # of manual approval gates in deployment pipeline +- Time to provision new infrastructure resources +- Time to onboard new engineer to deployment workflow + +### Step 3: Identify the Highest-Impact Change + +What's your biggest bottleneck? + +**If provisioning takes >2 hours:** Adopt infrastructure as code (Terraform) +**If deployment takes >30 min:** Reduce environment gates or adopt ephemeral environments +**If maintenance >15% FTE:** Simplify deployment surface (consider managed platform) + +### Step 4: Choose Simplicity Over Flexibility + +When in doubt, choose the simpler option: +- Manual provisioning → Scripted → Terraform → Platform (Stop at Terraform unless 50+ engineers) +- Production-only → Staging+Prod → Ephemeral+Prod (Stop at Ephemeral unless regulated) +- Managed Platform → VMs → Kubernetes (Stop at VMs unless 50+ engineers) + +**Complexity is expensive.** Every additional layer costs 10-20% of an engineer's time in maintenance. + +### Step 5: Plan Your Evolution + +Don't optimize for year 3 on day 1. Plan to evolve: + +**Today (5-15 engineers):** +- Manual provisioning or simple scripts +- Production-only or Production+Staging +- Managed platforms + +**Next 12 months (15-30 engineers):** +- Terraform for infrastructure +- Ephemeral PR environments + Production +- Still on managed platforms or considering VMs + +**Next 24 months (30-60 engineers):** +- Terraform + self-service patterns +- Ephemeral + Production (maybe Staging for critical paths) +- Evaluating Kubernetes (but probably don't need it yet) + +**Next 36+ months (60+ engineers):** +- Internal platform (if justified) +- Kubernetes (if multi-region or complex requirements) +- Dedicated platform team (2-5 engineers) + +--- + +## Key Takeaways + +**1. Three decisions determine velocity**: Provisioning model, environment strategy, deployment surface. Everything else is secondary. + +**2. Manual provisioning costs $24K/month** more than Terraform at typical growth-stage scale. Automate infrastructure as code by 10 engineers. + +**3. Each approval gate costs 15-45 minutes** per deployment. At 40 deploys/day, that's $4,000+/month in pure waiting time. + +**4. Managed platforms are fastest** for 90% of applications. Don't adopt Kubernetes before 50 engineers unless you have specific requirements. + +**5. Ephemeral PR environments** provide the best balance of safety and speed for modern development workflows. + +**6. Complexity costs 10-20% engineer capacity** per infrastructure layer. A team on Kubernetes + multi-cloud + 5 environments might spend 40%+ of time on infrastructure maintenance. + +**7. Optimize for iteration speed** over theoretical scalability. You can migrate to more complex infrastructure later. You can't reclaim engineering time spent maintaining unnecessary complexity. + +--- + +## The Bottom Line + +The best infrastructure for your team is **the simplest one that meets your actual requirements.** + +Not the infrastructure your competitors use. Not the infrastructure in conference talks. Not the infrastructure that sounds impressive in blog posts. + +The infrastructure that lets your team ship features quickly, safely, and without operational burden. + +Most teams should: +- **Use Terraform** for infrastructure provisioning (Level 3) +- **Adopt ephemeral PR environments + production** (unless regulated) +- **Deploy to managed platforms** (until 30+ engineers) + +This combination provides: +- **~15 minute** time from merge to production +- **<10% engineer capacity** spent on infrastructure +- **Minimal operational complexity** +- **Easy onboarding** for new team members + +Everything else is optimization. Start simple, evolve as needed, and always measure the velocity tax of complexity before adopting it. diff --git a/public/images/posts/3-infrastructure-decisions-engineering-velocity.svg b/public/images/posts/3-infrastructure-decisions-engineering-velocity.svg new file mode 100644 index 00000000..8d142202 --- /dev/null +++ b/public/images/posts/3-infrastructure-decisions-engineering-velocity.svg @@ -0,0 +1,33 @@ + + + + + + + + + + + + + + + + + + + + + + + + DEVOPS + + + The 3 Infrastructure Decisions +That Determine Your +Engineering Velocity + + + DevOps Daily + \ No newline at end of file