Skip to content

Commit bb02b9b

Browse files
authored
feat: Add 18 interview questions across all tiers (fixes #784) (#787)
Junior tier (6): - Cloud regions/availability zones - Container orchestration basics - Git workflow strategies - Shell scripting fundamentals - Configuration management basics - YAML/JSON basics Mid tier (6): - Service mesh concepts - Secrets management - Database backup/recovery - Log aggregation strategies - Performance optimization - GitOps principles Senior tier (6): - Multi-cloud architecture - Compliance and governance - Capacity planning - Platform team scaling - FinOps cost management - Zero trust architecture Also adds skipOgImage flag to validation script since interview questions are rendered on tier pages, not individual pages.
1 parent 517538e commit bb02b9b

20 files changed

+886
-15
lines changed
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
{
2+
"id": "capacity-planning",
3+
"slug": "capacity-planning",
4+
"title": "Capacity Planning and Scaling",
5+
"question": "How do you approach capacity planning for a growing production system? What metrics and strategies do you use?",
6+
"answer": "Capacity planning ensures systems can handle current and future load. Process: 1) Establish baselines - current CPU, memory, disk, network utilization and request rates. 2) Understand growth patterns - historical trends, seasonality, planned campaigns. 3) Define headroom - typically 30-40% buffer for unexpected spikes. 4) Model scenarios - what happens at 2x, 5x, 10x traffic? 5) Identify bottlenecks - database connections, API rate limits, stateful components. 6) Plan scaling strategy - vertical vs horizontal, auto-scaling policies. 7) Load test regularly. Review capacity quarterly.",
7+
"explanation": "Capacity planning is both art and science. Too much capacity wastes money; too little causes outages. Cloud auto-scaling helps but doesn't solve everything - databases, third-party APIs, and stateful services often can't scale horizontally. Senior engineers must think about bottlenecks that aren't obvious and plan for Black Friday scenarios before they happen.",
8+
"category": "SRE",
9+
"difficulty": "advanced",
10+
"tier": "senior",
11+
"tags": [
12+
"capacity-planning",
13+
"scaling",
14+
"sre",
15+
"performance",
16+
"architecture"
17+
],
18+
"codeExamples": [
19+
{
20+
"language": "yaml",
21+
"label": "Horizontal Pod Autoscaler",
22+
"code": "apiVersion: autoscaling/v2\nkind: HorizontalPodAutoscaler\nmetadata:\n name: api-server\nspec:\n scaleTargetRef:\n apiVersion: apps/v1\n kind: Deployment\n name: api-server\n minReplicas: 3\n maxReplicas: 50\n metrics:\n - type: Resource\n resource:\n name: cpu\n target:\n type: Utilization\n averageUtilization: 70\n - type: Pods\n pods:\n metric:\n name: requests_per_second\n target:\n type: AverageValue\n averageValue: 1000\n behavior:\n scaleUp:\n stabilizationWindowSeconds: 60\n scaleDown:\n stabilizationWindowSeconds: 300"
23+
},
24+
{
25+
"language": "bash",
26+
"label": "Capacity analysis queries",
27+
"code": "# PromQL: CPU headroom percentage\n100 - (\n avg(rate(container_cpu_usage_seconds_total{pod=~\"api-.*\"}[5m]))\n /\n avg(kube_pod_container_resource_limits{resource=\"cpu\"})\n) * 100\n\n# Current vs limit memory usage\nsum(container_memory_working_set_bytes{pod=~\"api-.*\"})\n/\nsum(kube_pod_container_resource_limits{resource=\"memory\"})\n\n# Request rate trend (7-day growth)\npredict_linear(rate(http_requests_total[1h])[7d:1h], 30 * 24 * 3600)"
28+
}
29+
],
30+
"followUpQuestions": [
31+
"How do you handle capacity planning for stateful services like databases?",
32+
"What is the difference between scaling up and scaling out?",
33+
"How do you account for third-party API rate limits in capacity planning?"
34+
],
35+
"commonMistakes": [
36+
"Only planning for average load, not peak load",
37+
"Forgetting about dependent services that may become bottlenecks",
38+
"Not accounting for the time it takes to scale (cold start, provisioning)"
39+
],
40+
"relatedTopics": [
41+
"auto-scaling",
42+
"load-testing",
43+
"performance",
44+
"reliability"
45+
]
46+
}
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
{
2+
"id": "cloud-regions-availability-zones",
3+
"slug": "cloud-regions-availability-zones",
4+
"title": "Cloud Regions and Availability Zones",
5+
"question": "What are cloud regions and availability zones? How do they affect application architecture?",
6+
"answer": "A region is a geographic area containing multiple data centers (e.g., us-east-1). Availability zones (AZs) are isolated data centers within a region, connected by low-latency links. Deploy across multiple AZs for high availability - if one AZ fails, your app continues running. Choose regions based on user proximity (latency), compliance requirements, and service availability.",
7+
"explanation": "Understanding regions and AZs is fundamental to designing resilient cloud applications. AWS has 30+ regions with 3-6 AZs each. Deploying across AZs provides fault tolerance with minimal latency penalty. Multi-region deployments add disaster recovery but increase complexity and cost.",
8+
"category": "Cloud",
9+
"difficulty": "beginner",
10+
"tier": "junior",
11+
"tags": [
12+
"cloud",
13+
"aws",
14+
"infrastructure",
15+
"availability",
16+
"fundamentals"
17+
],
18+
"codeExamples": [
19+
{
20+
"language": "bash",
21+
"label": "AWS CLI region commands",
22+
"code": "# List all available regions\naws ec2 describe-regions --output table\n\n# List AZs in current region\naws ec2 describe-availability-zones --output table\n\n# Set default region\nexport AWS_DEFAULT_REGION=us-west-2\n\n# Run command in specific region\naws ec2 describe-instances --region eu-west-1"
23+
},
24+
{
25+
"language": "hcl",
26+
"label": "Multi-AZ deployment in Terraform",
27+
"code": "# Get available AZs\ndata \"aws_availability_zones\" \"available\" {\n state = \"available\"\n}\n\n# Create subnets across AZs\nresource \"aws_subnet\" \"app\" {\n count = 2\n vpc_id = aws_vpc.main.id\n cidr_block = cidrsubnet(var.vpc_cidr, 8, count.index)\n availability_zone = data.aws_availability_zones.available.names[count.index]\n}"
28+
}
29+
],
30+
"followUpQuestions": [
31+
"What is the difference between high availability and disaster recovery?",
32+
"How do you handle data replication across availability zones?",
33+
"What factors influence region selection for a new application?"
34+
],
35+
"commonMistakes": [
36+
"Deploying everything in a single AZ, creating a single point of failure",
37+
"Choosing regions only based on cost without considering user latency",
38+
"Not accounting for data residency and compliance requirements"
39+
],
40+
"relatedTopics": [
41+
"high-availability",
42+
"disaster-recovery",
43+
"load-balancing",
44+
"multi-region"
45+
]
46+
}
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
{
2+
"id": "compliance-governance",
3+
"slug": "compliance-governance",
4+
"title": "Compliance and Governance in Cloud",
5+
"question": "How do you implement compliance and governance controls in a cloud-native environment?",
6+
"answer": "Cloud compliance involves policies, automation, and audit capabilities. Key components: 1) Policy-as-code (OPA/Gatekeeper, AWS SCPs) to enforce rules automatically. 2) Tagging standards for resource ownership/cost tracking. 3) Centralized logging and audit trails. 4) Network segmentation and least-privilege IAM. 5) Encryption at rest and in transit. 6) Regular compliance scanning (Prowler, ScoutSuite). 7) Change management processes. Frameworks to consider: SOC2, HIPAA, PCI-DSS, GDPR depending on industry. Shift compliance left - catch violations before deployment.",
7+
"explanation": "Compliance isn't just checkbox exercises - it's about building secure, auditable systems. Cloud makes compliance both easier (APIs for everything) and harder (velocity of change). Modern approaches embed compliance into CI/CD: policy checks in pipelines, infrastructure scanning, and continuous monitoring. The goal is making compliant the default, not an afterthought.",
8+
"category": "Security",
9+
"difficulty": "advanced",
10+
"tier": "senior",
11+
"tags": [
12+
"compliance",
13+
"governance",
14+
"security",
15+
"policy",
16+
"cloud"
17+
],
18+
"codeExamples": [
19+
{
20+
"language": "yaml",
21+
"label": "OPA Gatekeeper policy",
22+
"code": "apiVersion: constraints.gatekeeper.sh/v1beta1\nkind: K8sRequiredLabels\nmetadata:\n name: require-team-label\nspec:\n match:\n kinds:\n - apiGroups: [\"\"]\n kinds: [\"Namespace\"]\n parameters:\n labels:\n - key: team\n - key: environment\n - key: cost-center\n message: \"All namespaces must have team, environment, and cost-center labels\""
23+
},
24+
{
25+
"language": "hcl",
26+
"label": "AWS Service Control Policy",
27+
"code": "{\n \"Version\": \"2012-10-17\",\n \"Statement\": [\n {\n \"Sid\": \"DenyNonApprovedRegions\",\n \"Effect\": \"Deny\",\n \"NotAction\": [\n \"iam:*\",\n \"organizations:*\",\n \"support:*\"\n ],\n \"Resource\": \"*\",\n \"Condition\": {\n \"StringNotEquals\": {\n \"aws:RequestedRegion\": [\n \"us-east-1\",\n \"eu-west-1\"\n ]\n }\n }\n },\n {\n \"Sid\": \"RequireEncryption\",\n \"Effect\": \"Deny\",\n \"Action\": \"s3:PutObject\",\n \"Resource\": \"*\",\n \"Condition\": {\n \"Null\": {\n \"s3:x-amz-server-side-encryption\": \"true\"\n }\n }\n }\n ]\n}"
28+
}
29+
],
30+
"followUpQuestions": [
31+
"How do you handle compliance in a rapidly changing environment with frequent deployments?",
32+
"What is the difference between preventive and detective controls?",
33+
"How do you prepare for and conduct compliance audits?"
34+
],
35+
"commonMistakes": [
36+
"Treating compliance as a one-time project instead of continuous process",
37+
"Relying solely on manual reviews that can't scale",
38+
"Not documenting exceptions and their justifications"
39+
],
40+
"relatedTopics": [
41+
"policy-as-code",
42+
"opa",
43+
"soc2",
44+
"security"
45+
]
46+
}
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
{
2+
"id": "configuration-management-basics",
3+
"slug": "configuration-management-basics",
4+
"title": "Configuration Management Basics",
5+
"question": "What is configuration management? Why is it important and what tools are commonly used?",
6+
"answer": "Configuration management is the practice of automating and standardizing system configurations across environments. It ensures consistency (all servers configured the same), enables version control of configs, provides audit trails, and reduces manual errors. Common tools: Ansible (agentless, YAML), Puppet (agent-based, DSL), Chef (agent-based, Ruby), and SaltStack. CM enables infrastructure as code and is essential for managing servers at scale.",
7+
"explanation": "Without configuration management, maintaining consistency across dozens or hundreds of servers becomes impossible. CM tools enforce desired state - if someone manually changes a config, the CM tool reverts it. This is crucial for security compliance and reproducible deployments. Ansible has become dominant due to its simplicity and agentless architecture.",
8+
"category": "Infrastructure",
9+
"difficulty": "beginner",
10+
"tier": "junior",
11+
"tags": [
12+
"ansible",
13+
"configuration-management",
14+
"iac",
15+
"automation",
16+
"fundamentals"
17+
],
18+
"codeExamples": [
19+
{
20+
"language": "yaml",
21+
"label": "Simple Ansible playbook",
22+
"code": "---\n- name: Configure web servers\n hosts: webservers\n become: yes\n tasks:\n - name: Install nginx\n apt:\n name: nginx\n state: present\n update_cache: yes\n\n - name: Start nginx service\n service:\n name: nginx\n state: started\n enabled: yes\n\n - name: Copy config file\n copy:\n src: nginx.conf\n dest: /etc/nginx/nginx.conf\n notify: Restart nginx\n\n handlers:\n - name: Restart nginx\n service:\n name: nginx\n state: restarted"
23+
},
24+
{
25+
"language": "bash",
26+
"label": "Running Ansible",
27+
"code": "# Check connectivity to hosts\nansible all -m ping\n\n# Run playbook\nansible-playbook site.yml\n\n# Run with specific inventory\nansible-playbook -i production site.yml\n\n# Dry run (check mode)\nansible-playbook site.yml --check\n\n# Limit to specific hosts\nansible-playbook site.yml --limit webserver1"
28+
}
29+
],
30+
"followUpQuestions": [
31+
"What is idempotency and why is it important in configuration management?",
32+
"What is the difference between push and pull-based configuration management?",
33+
"When would you use configuration management vs. containers?"
34+
],
35+
"commonMistakes": [
36+
"Not testing playbooks in staging before production",
37+
"Hardcoding environment-specific values instead of using variables",
38+
"Running as root when privilege escalation should be explicit"
39+
],
40+
"relatedTopics": [
41+
"ansible",
42+
"infrastructure-as-code",
43+
"automation",
44+
"desired-state"
45+
]
46+
}
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
{
2+
"id": "container-orchestration-basics",
3+
"slug": "container-orchestration-basics",
4+
"title": "Container Orchestration Basics",
5+
"question": "What is container orchestration and why do we need it? Name some common orchestration platforms.",
6+
"answer": "Container orchestration automates container deployment, scaling, networking, and management. It handles: scheduling containers across hosts, load balancing traffic, auto-scaling based on demand, self-healing (restarting failed containers), rolling updates, and service discovery. Popular platforms: Kubernetes (most common), Docker Swarm (simpler), Amazon ECS, and Nomad.",
7+
"explanation": "Running a few containers manually is easy, but managing hundreds or thousands in production requires orchestration. Kubernetes has become the industry standard, supported by all major cloud providers. Understanding basic orchestration concepts is essential even if you primarily use managed services.",
8+
"category": "Kubernetes",
9+
"difficulty": "beginner",
10+
"tier": "junior",
11+
"tags": [
12+
"kubernetes",
13+
"containers",
14+
"orchestration",
15+
"docker",
16+
"fundamentals"
17+
],
18+
"codeExamples": [
19+
{
20+
"language": "yaml",
21+
"label": "Simple Kubernetes deployment",
22+
"code": "apiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: web-app\nspec:\n replicas: 3 # Run 3 instances\n selector:\n matchLabels:\n app: web\n template:\n metadata:\n labels:\n app: web\n spec:\n containers:\n - name: app\n image: nginx:1.25\n ports:\n - containerPort: 80"
23+
},
24+
{
25+
"language": "bash",
26+
"label": "Basic kubectl commands",
27+
"code": "# View running pods\nkubectl get pods\n\n# View deployments\nkubectl get deployments\n\n# Scale a deployment\nkubectl scale deployment web-app --replicas=5\n\n# View pod logs\nkubectl logs pod-name\n\n# Describe a resource\nkubectl describe pod pod-name"
28+
}
29+
],
30+
"followUpQuestions": [
31+
"What is the difference between a Pod and a Container in Kubernetes?",
32+
"How does Kubernetes know when a container is healthy?",
33+
"What happens when a container crashes in a Kubernetes pod?"
34+
],
35+
"commonMistakes": [
36+
"Thinking Kubernetes is always necessary - simpler solutions may suffice",
37+
"Running stateful applications without understanding persistent storage",
38+
"Not setting resource limits, leading to noisy neighbor problems"
39+
],
40+
"relatedTopics": [
41+
"docker",
42+
"kubernetes",
43+
"microservices",
44+
"scaling"
45+
]
46+
}
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
{
2+
"id": "database-backup-recovery",
3+
"slug": "database-backup-recovery",
4+
"title": "Database Backup and Recovery",
5+
"question": "Describe database backup strategies and how you would design a recovery plan for production databases.",
6+
"answer": "Key backup types: 1) Full backups - complete database copy, resource-intensive. 2) Incremental - only changes since last backup. 3) Point-in-time recovery (PITR) - using transaction logs/WAL. Strategy: daily full backups + continuous WAL archiving for PITR. Store backups in separate region/account. Test restores regularly! Recovery plan: define RTO (Recovery Time Objective) and RPO (Recovery Point Objective), document restore procedures, automate where possible, and practice with chaos engineering.",
7+
"explanation": "Backups are worthless if you can't restore from them. Every organization has horror stories of corrupted backups or untested restore procedures. RTO defines how quickly you must recover, RPO defines maximum acceptable data loss. These requirements drive your backup strategy - if RPO is 5 minutes, you need continuous replication, not daily backups.",
8+
"category": "Infrastructure",
9+
"difficulty": "intermediate",
10+
"tier": "mid",
11+
"tags": [
12+
"database",
13+
"backup",
14+
"disaster-recovery",
15+
"postgres",
16+
"devops"
17+
],
18+
"codeExamples": [
19+
{
20+
"language": "bash",
21+
"label": "PostgreSQL backup strategies",
22+
"code": "# Logical backup (SQL dump)\npg_dump -h localhost -U postgres mydb > backup.sql\n\n# Compressed backup\npg_dump -h localhost -U postgres -Fc mydb > backup.dump\n\n# Physical backup with pg_basebackup\npg_basebackup -h localhost -U repl_user -D /backups/base \\\n -Fp -Xs -P\n\n# Restore from dump\npg_restore -h localhost -U postgres -d mydb backup.dump\n\n# Enable WAL archiving in postgresql.conf\n# archive_mode = on\n# archive_command = 'aws s3 cp %p s3://bucket/wal/%f'"
23+
},
24+
{
25+
"language": "yaml",
26+
"label": "Kubernetes CronJob for backups",
27+
"code": "apiVersion: batch/v1\nkind: CronJob\nmetadata:\n name: postgres-backup\nspec:\n schedule: \"0 2 * * *\" # Daily at 2 AM\n jobTemplate:\n spec:\n template:\n spec:\n containers:\n - name: backup\n image: postgres:15\n command:\n - /bin/sh\n - -c\n - |\n pg_dump -h $DB_HOST -U $DB_USER $DB_NAME | \\\n gzip | aws s3 cp - s3://backups/$(date +%Y%m%d).sql.gz\n restartPolicy: OnFailure"
28+
}
29+
],
30+
"followUpQuestions": [
31+
"How do you test that backups are actually restorable?",
32+
"What is the difference between RTO and RPO?",
33+
"How do you handle backups for databases with terabytes of data?"
34+
],
35+
"commonMistakes": [
36+
"Never testing restore procedures until an actual disaster",
37+
"Storing backups in the same region/account as production",
38+
"Not encrypting backups containing sensitive data",
39+
"Ignoring backup retention policies and running out of storage"
40+
],
41+
"relatedTopics": [
42+
"disaster-recovery",
43+
"postgres",
44+
"rto-rpo",
45+
"point-in-time-recovery"
46+
]
47+
}

0 commit comments

Comments
 (0)