feat(resiliency-extension): add new resiliency extension#265
Conversation
|
|
||
| ```markdown | ||
| ## Question: Resiliency Extensions | ||
| Should resiliency extension rules be enforced for this project? |
There was a problem hiding this comment.
For a user to answer yes/no to this question, it'll help if the user understood what "resiliency extension" means. You could say things like ... "resilient systems are fault-tolerant, highly available etc".
There was a problem hiding this comment.
Done, added that additional information.
raj-jain-aws
left a comment
There was a problem hiding this comment.
The resilience extension references AWS-specific services (AZs, EC2, EKS, ECS), yet AI-DLC's methodology is cloud- and infrastructure-agnostic. Two paths forward: either generalize the extension to be provider-neutral, or scope it explicitly to AWS deployments via the opt-in prompt — e.g., 'If you are deploying to AWS and want to enable resilience patterns, you may activate this extension.'
Additionally, include A/B test results demonstrating the measurable outcomes this extension produces, so its effectiveness is independently verifiable.
|
@raj-jain-aws - Made the 3 requested changes: |
Resiliency Extension for AI-DLC
Overview
Introduces a resiliency extensions framework to the AI-DLC workflow, enforcing reliability and disaster recovery best practices across all development phases. Rules are derived from the AWS Well-Architected Framework Reliability Pillar (WAR) and the Resilience best practices.
Changes
1. Resiliency Baseline Extension (
extensions/resiliency/baseline/resiliency-baseline.md)2. Extension Opt-In (
extensions/resiliency/baseline/resiliency-baseline.opt-in.md)3. CloudFormation Templates for Validation
template-baseline.yaml: Application without resiliency extensiontemplate-resilient.yaml: Same application with resiliency rules enforced4. Narrative Documentation
5. Resiliency Template Review Skill (
.kiro/skills/resiliency-template-review.md)Requirements Phase Flow
Context Impact
Context-optimized loading: full rules load only on opt-in; opt-out keeps footprint minimal.
Validation Results
Comparative testing with baseline vs resilient templates using AWS Resilience Hub, CloudFormation deployment validation, and the template review skill.
A/B Test Results: Measurable Outcomes
Conducted an A/B test by building the same full-stack serverless application twice through AI-DLC: once with the resiliency extension disabled (Group A — baseline) and once with the extension enabled (Group B — resilient). Both groups used identical functional requirements; the only difference was whether the resiliency extension was opted in. Results were independently verified using the Resiliency Template Review Skill against the 15 rules and via AWS Resilience Hub assessment.
Independent verification: All metrics above are verifiable directly from the two CloudFormation templates included in this PR (
template-baseline.yamlvstemplate-resilient.yaml) using the Resiliency Template Review Skill or by inspecting the templates manually. AWS Resilience Hub assessment is reproducible by deploying both templates and importing each into a Resilience Hub application with the same RTO/RPO policy.Workflow-level outcomes:
Mitigating Failure Patterns from Recent AWS LSEs
The following maps recent AWS LSEs to rules in this extension. Workloads built with these rules enforced could have reduced customer-side impact, though mitigation depends on workload-specific architecture. Not guarantees — illustrates rule alignment with observed failure patterns.
October 2025: DynamoDB DNS Failure (US-EAST-1) — 15-hour outage affecting 70+ services, 1,000+ companies
May 2025: Thermal Event (US-EAST-1 AZ use1-az4) — Cooling failure disrupting EC2/EBS and 15+ services
October 2025: NLB Health Monitor Failure (US-EAST-1) — 15-hour DNS/service-discovery degradation
Common patterns: single-region deployments without failover (RESILIENCY-11, 13), missing or shallow health checks (RESILIENCY-06), inadequate observability (RESILIENCY-05). Workloads built with this extension have baseline protections against these failure modes.
Benefits
Coverage Note
Covers 11 of 13 WAR Reliability Pillar questions. REL 2 (Network Topology) and REL 3 (Service Architecture) are intentionally excluded — addressed by core AI-DLC workflow stages (Infrastructure Design and Application Design respectively).
Testing
Full-stack application deployment on AWS with Resilience Hub assessment against defined RTO/RPO policy targets.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.