Skip to content

feat(resiliency-extension): add new resiliency extension#265

Open
ambavaar wants to merge 3 commits into
awslabs:mainfrom
ambavaar:feature/resiliency-extension
Open

feat(resiliency-extension): add new resiliency extension#265
ambavaar wants to merge 3 commits into
awslabs:mainfrom
ambavaar:feature/resiliency-extension

Conversation

@ambavaar
Copy link
Copy Markdown

@ambavaar ambavaar commented May 14, 2026

Resiliency Extension for AI-DLC

Overview

Introduces a resiliency extensions framework to the AI-DLC workflow, enforcing reliability and disaster recovery best practices across all development phases. Rules are derived from the AWS Well-Architected Framework Reliability Pillar (WAR) and the Resilience best practices.

Changes

1. Resiliency Baseline Extension (extensions/resiliency/baseline/resiliency-baseline.md)

  • 15 rules organized across six pillars: Business Goals, Change Management & Automation, Integrated Observability, High Availability, Disaster Recovery, Continuous Improvement
  • Verification criteria for each rule; blocking findings for non-compliance
  • Maps to WAR Reliability questions REL 1 through REL 13 (11 of 13 covered)
  • RESILIENCY-02 includes embedded follow-up question capturing RTO/RPO goals before finalizing requirements

2. Extension Opt-In (extensions/resiliency/baseline/resiliency-baseline.opt-in.md)

  • Opt-in question during requirements phase (enforce rules Yes/No)
  • If user opts IN, RESILIENCY-02 triggers follow-up question on RTO/RPO and DR strategy (Backup & Restore, Pilot Light, Warm Standby, Active/Active)

3. CloudFormation Templates for Validation

  • template-baseline.yaml: Application without resiliency extension
  • template-resilient.yaml: Same application with resiliency rules enforced

4. Narrative Documentation

  • One-pager narrative document introducing the extension

5. Resiliency Template Review Skill (.kiro/skills/resiliency-template-review.md)

  • Activatable skill to review CloudFormation templates against the 15 rules
  • Side-by-side comparison (baseline vs resilient) with gap analysis and recommendations

Requirements Phase Flow

  1. Opt-in question — Should resiliency rules be enforced? (Yes / No)
  2. Follow-up question (only if opted IN) — Asked as part of RESILIENCY-02: captures RTO/RPO targets and DR strategy. Propagates to all downstream stages.

Context Impact

File Approx Tokens
resiliency-baseline.md (rules) ~4,200
resiliency-baseline.opt-in.md ~170
Total when enabled ~4,370

Context-optimized loading: full rules load only on opt-in; opt-out keeps footprint minimal.

Validation Results

Comparative testing with baseline vs resilient templates using AWS Resilience Hub, CloudFormation deployment validation, and the template review skill.

  • ✅ Resiliency requirements integrated into NFR requirements and infrastructure design
  • ✅ Observability (alarms, dashboards, tracing) generated by default when enabled
  • ✅ DR strategy and cross-region replication reflected in templates
  • ✅ Health check endpoints and monitoring in application design
  • ✅ Service quota awareness enforced during design phase
  • ✅ IaC templates aligned with defined RTO/RPO targets
  • ✅ Resiliency controls verified in application code and AWS resource configurations

A/B Test Results: Measurable Outcomes

Conducted an A/B test by building the same full-stack serverless application twice through AI-DLC: once with the resiliency extension disabled (Group A — baseline) and once with the extension enabled (Group B — resilient). Both groups used identical functional requirements; the only difference was whether the resiliency extension was opted in. Results were independently verified using the Resiliency Template Review Skill against the 15 rules and via AWS Resilience Hub assessment.

Measurable Outcome Group A (Baseline) Group B (Resilient) Delta
Resiliency rules compliant 3 / 15 9 / 15 +6 rules (+200%)
Resiliency rules non-compliant 7 / 15 1 / 15 -6 rules (-86%)
CloudWatch alarms configured 1 5 +4 alarms
Distributed tracing enabled No Yes (Lambda + API Gateway) New capability
Dashboard configurations 0 1 +1 dashboard
Health check endpoints 0 1 +1 endpoint
Cross-region DR resources 0 2 (DR bucket, replication role) New capability
S3 versioning for point-in-time recovery Disabled Enabled New capability
Reserved/provisioned concurrency for Lambda Not configured Configured New capability
Documented RTO/RPO targets None RTO/RPO tags on all resources New capability
RESILIENCY-02 follow-up question asked during requirements No Yes Process improvement
WAR Reliability Pillar question coverage 4 / 13 11 / 13 +7 questions covered

Independent verification: All metrics above are verifiable directly from the two CloudFormation templates included in this PR (template-baseline.yaml vs template-resilient.yaml) using the Resiliency Template Review Skill or by inspecting the templates manually. AWS Resilience Hub assessment is reproducible by deploying both templates and importing each into a Resilience Hub application with the same RTO/RPO policy.

Workflow-level outcomes:

  • Group A produced no resiliency-specific NFR requirements; Group B produced explicit RTO/RPO targets, DR strategy selection, and observability requirements that propagated to design and infrastructure stages
  • Group B's design and infrastructure artifacts referenced resiliency rules by ID, enabling traceability from requirement to implementation
  • Group B's stage completion summaries included a "Resiliency Compliance" section identifying compliant, non-compliant, and N/A rules at each stage

Mitigating Failure Patterns from Recent AWS LSEs

The following maps recent AWS LSEs to rules in this extension. Workloads built with these rules enforced could have reduced customer-side impact, though mitigation depends on workload-specific architecture. Not guarantees — illustrates rule alignment with observed failure patterns.

October 2025: DynamoDB DNS Failure (US-EAST-1) — 15-hour outage affecting 70+ services, 1,000+ companies

  • RESILIENCY-11 (DR Strategy): Multi-region workloads could shift traffic to unaffected regions
  • RESILIENCY-12 (Backup/Replication): Cross-region replication via Global Tables supports data availability
  • RESILIENCY-13 (Failover): Route 53 health-checked routing could redirect traffic
  • RESILIENCY-06 (Health Checks): Deep health checks could detect degradation sooner

May 2025: Thermal Event (US-EAST-1 AZ use1-az4) — Cooling failure disrupting EC2/EBS and 15+ services

  • RESILIENCY-08 (Multi-AZ): Multi-AZ workloads continue during single-AZ failure
  • RESILIENCY-09 (Auto-Scaling): Healthy AZs absorb traffic
  • RESILIENCY-06 (Health Checks): Load balancer health checks remove unhealthy instances
  • RESILIENCY-05 (Monitoring): Early detection via CloudWatch alarms

October 2025: NLB Health Monitor Failure (US-EAST-1) — 15-hour DNS/service-discovery degradation

  • RESILIENCY-10 (Circuit Breaking): Circuit breakers prevent cascading failures
  • RESILIENCY-11 (DR Strategy): Warm standby supports failover during regional degradation
  • RESILIENCY-07 (Resiliency Monitoring): Earlier warning signals via replication lag alarms
  • RESILIENCY-13 (Failover): Documented runbooks reduce time to recovery

Common patterns: single-region deployments without failover (RESILIENCY-11, 13), missing or shallow health checks (RESILIENCY-06), inadequate observability (RESILIENCY-05). Workloads built with this extension have baseline protections against these failure modes.

Benefits

  • Enforces resiliency by default for production applications
  • Flexibility for PoCs/prototypes via opt-in/opt-out
  • Clear verification criteria at each workflow stage
  • Aligns with WAR Reliability Pillar and RRR
  • Captures RTO/RPO targets early (via RESILIENCY-02 follow-up)
  • Extensible for organization-specific requirements
  • Shifts resiliency left into design phase
  • Addresses failure patterns from recent AWS LSEs
  • Minimal context overhead (~4,370 tokens when enabled, zero when opted out)

Coverage Note

Covers 11 of 13 WAR Reliability Pillar questions. REL 2 (Network Topology) and REL 3 (Service Architecture) are intentionally excluded — addressed by core AI-DLC workflow stages (Infrastructure Design and Application Design respectively).

Testing

Full-stack application deployment on AWS with Resilience Hub assessment against defined RTO/RPO policy targets.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

@ambavaar ambavaar requested review from a team as code owners May 14, 2026 13:58
@github-actions github-actions Bot added the rules label May 14, 2026
@ambavaar ambavaar changed the title Feature/resiliency extension Feat/resiliency extension May 14, 2026
@ambavaar ambavaar changed the title Feat/resiliency extension feat (resiliency extension) : Add new extension for resiliency May 14, 2026
@ambavaar ambavaar changed the title feat (resiliency extension) : Add new extension for resiliency feat(resiliency extension) : Add new extension for resiliency May 15, 2026
@ambavaar ambavaar changed the title feat(resiliency extension) : Add new extension for resiliency feat(resiliency-extension): add new resiliency extension May 15, 2026

```markdown
## Question: Resiliency Extensions
Should resiliency extension rules be enforced for this project?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a user to answer yes/no to this question, it'll help if the user understood what "resiliency extension" means. You could say things like ... "resilient systems are fault-tolerant, highly available etc".

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added that additional information.

Copy link
Copy Markdown
Contributor

@raj-jain-aws raj-jain-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resilience extension references AWS-specific services (AZs, EC2, EKS, ECS), yet AI-DLC's methodology is cloud- and infrastructure-agnostic. Two paths forward: either generalize the extension to be provider-neutral, or scope it explicitly to AWS deployments via the opt-in prompt — e.g., 'If you are deploying to AWS and want to enable resilience patterns, you may activate this extension.'

Additionally, include A/B test results demonstrating the measurable outcomes this extension produces, so its effectiveness is independently verifiable.

@ambavaar
Copy link
Copy Markdown
Author

@raj-jain-aws - Made the 3 requested changes:
1- Added description to the opt-in question.
2- Made the extension to be generic and cloud agnostic
3- Added A/B test results. We used 2 methods:
We ran the extension with a simple app with and without the extension, which generated 2 cloud formation templates. Then we compared the two using:
a. A skills file
b. AWS Resilience Hub by uploading the 2 templates and run the assessment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants