anthropics · saurabh-oss · Apr 18, 2026 · Apr 19, 2026
diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json
@@ -38,7 +38,8 @@
         "./skills/slack-gif-creator",
         "./skills/theme-factory",
         "./skills/web-artifacts-builder",
-        "./skills/webapp-testing"
+        "./skills/webapp-testing",
+        "./skills/software-architecture-review"
       ]
     }
     ,

diff --git a/skills/software-architecture-review/SKILL.md b/skills/software-architecture-review/SKILL.md
@@ -0,0 +1,213 @@
+---
+name: software-architecture-review
+description: >
+  Performs structured software architecture reviews covering design patterns, quality attributes,
+  ADR (Architecture Decision Record) generation, anti-pattern detection, and scoring. Use this skill
+  whenever the user mentions architecture review, system design evaluation, tech stack assessment,
+  ADR creation, reviewing microservices/event-driven/layered/hexagonal designs, Gen AI or RAG
+  architecture review, or asks for architectural fitness scoring — even if they don't say
+  "architecture review" explicitly. Also trigger for questions like "is my design good?",
+  "what's wrong with my system design?", or "how should I structure my AI pipeline?"
+---
+
+# Software Architecture Review
+
+A skill for performing structured, expert-level software architecture reviews. Covers
+traditional enterprise systems, cloud-native architectures, and modern Gen AI / RAG-based systems.
+
+---
+
+## When to Use This Skill
+
+Trigger for any of these intents:
+- "Review my architecture / system design"
+- "What are the trade-offs of this design?"
+- "Generate an ADR for this decision"
+- "Is my RAG pipeline well-architected?"
+- "What anti-patterns exist in my design?"
+- "Score my architecture against quality attributes"
+- Diagrams, C4 models, or architecture descriptions shared for feedback
+
+---
+
+## Review Process
+
+Follow this four-phase process:
+
+### Phase 1 — Understand Context
+
+Gather or infer:
+1. **System type** — Web app, microservices, event-driven, monolith, Gen AI pipeline, RAG system, etc.
+2. **Quality priorities** — Ask the user to rank: Scalability, Security, Maintainability, Observability, Performance, Cost
+3. **Constraints** — Cloud provider, team size, compliance requirements (HIPAA, SOC2, etc.)
+4. **Maturity stage** — POC / MVP / Production / Legacy migration
+
+If the user has shared a diagram or description, extract answers directly from it before asking questions.
+
+---
+
+### Phase 2 — Structural Analysis
+
+Evaluate the architecture against these lenses:
+
+#### Design Patterns
+Check which architectural style is in use and whether it is applied correctly:
+
+| Style | Key Concerns to Evaluate |
+|---|---|
+| Microservices | Service boundaries, data ownership, inter-service contracts |
+| Event-Driven | Topic naming, consumer groups, event schema versioning |
+| Layered (N-Tier) | Layer isolation, cross-layer dependency leaks |
+| Hexagonal (Ports & Adapters) | Port definitions, adapter swap-ability |
+| RAG / Gen AI Pipeline | Chunking strategy, embedding model choice, retrieval accuracy, LLM prompt isolation |
+| CQRS / Event Sourcing | Read/write model separation, event store durability |
+
+#### Quality Attributes Assessment
+
+Score each attribute 1–5 (1 = critical gap, 5 = excellent):
+
+- **Scalability** — Can the system handle 10x load? Where is the bottleneck?
+- **Security** — Auth/AuthZ, secrets management, data-in-transit/at-rest encryption
+- **Observability** — Logging, tracing, metrics — is the system debuggable in prod?
+- **Maintainability** — Modularity, separation of concerns, test coverage design
+- **Resilience** — Circuit breakers, retries, graceful degradation
+- **Cost Efficiency** — Over-provisioned components, expensive API calls without caching
+
+---
+
+### Phase 3 — Anti-Pattern Detection
+
+Check for and flag the following:
+
+**Structural Anti-Patterns**
+- ❌ **Distributed Monolith** — Microservices that share a database or deploy together
+- ❌ **God Service / God Object** — One service/class doing everything
+- ❌ **Chatty Microservices** — Excessive synchronous inter-service calls
+- ❌ **Tight Coupling** — Components that cannot change independently
+- ❌ **Anemic Domain Model** — Domain objects with no behavior, all logic in services
+- ❌ **Spaghetti Integration** — Point-to-point integrations without an abstraction layer
+
+**Gen AI / RAG Specific Anti-Patterns**
+- ❌ **Naive Chunking** — Fixed-size chunking ignoring semantic boundaries
+- ❌ **Missing Retrieval Evaluation** — No feedback loop measuring retrieval relevance
+- ❌ **Prompt Injection Risk** — User input directly concatenated into system prompts
+- ❌ **LLM as Orchestrator Without Guardrails** — Agentic loops without human-in-the-loop or fallback
+- ❌ **Embedding Model Mismatch** — Query embedding model differs from document embedding model
+- ❌ **No Hallucination Mitigation** — No grounding check, citation tracking, or confidence thresholds
+- ❌ **Missing Responsible AI Layer** — No content filtering, bias checks, or audit logging
+
+---
+
+### Phase 4 — Output Generation
+
+Always produce the following output sections:
+
+#### Architecture Scorecard
+
+```
+## Architecture Scorecard: [System Name]
+
+| Quality Attribute | Score (1–5) | Key Finding |
+|---|---|---|
+| Scalability       | X/5         | ... |
+| Security          | X/5         | ... |
+| Observability     | X/5         | ... |
+| Maintainability   | X/5         | ... |
+| Resilience        | X/5         | ... |
+| Cost Efficiency   | X/5         | ... |
+| **Overall**       | **X/5**     | ... |
+```
+
+#### Findings Summary
+
+List findings as:
+- 🔴 **Critical** — Must fix before production
+- 🟡 **Warning** — Should fix in next sprint
+- 🟢 **Positive** — Well-designed aspect worth preserving
+
+#### Recommendations
+
+For each critical/warning finding, provide:
+1. **What** the issue is
+2. **Why** it matters (impact on quality attribute)
+3. **How** to fix it (concrete, actionable — not just "add caching")
+
+#### ADR Generation (if requested or if a major decision is identified)
+
+```markdown
+# ADR-[NUMBER]: [Title]
+
+**Status:** Proposed | Accepted | Deprecated | Superseded
+
+**Context:**
+[What is the problem or situation that prompted this decision?]
+
+**Decision:**
+[What was decided and why?]
+
+**Consequences:**
+- ✅ Positive: ...
+- ❌ Trade-off: ...
+- ⚠️ Risks: ...
+
+**Alternatives Considered:**
+| Option | Pros | Cons |
+|---|---|---|
+| Option A | ... | ... |
+| Option B | ... | ... |
+```
+
+---
+
+## Gen AI / RAG Architecture Review Module
+
+For RAG and Gen AI systems, additionally evaluate:
+
+### RAG Pipeline Checklist
+
+| Component | What to Check |
+|---|---|
+| **Data Ingestion** | Source diversity, update frequency, metadata preservation |
+| **Chunking Strategy** | Semantic vs. fixed-size, overlap, chunk size vs. context window |
+| **Embedding** | Model alignment (query vs. doc), dimensionality, update strategy |
+| **Vector Store** | Index type (HNSW, IVF), distance metric, filtering capability |
+| **Retrieval** | Top-K tuning, hybrid search (dense + sparse), re-ranking |
+| **Prompt Design** | System prompt isolation, context injection, few-shot examples |
+| **LLM Response** | Citation grounding, hallucination mitigation, temperature settings |
+| **Evaluation** | RAGAS or equivalent metrics (faithfulness, relevancy, context recall) |
+| **Responsible AI** | Content filters, audit logging, human-in-the-loop for high-stakes outputs |
+
+### Agentic / Multi-Agent Review (LangGraph, AutoGen, CrewAI)
+
+- Are agent roles and boundaries clearly defined?
+- Is there a supervisor or orchestration pattern?
+- Are there defined exit conditions to prevent infinite loops?
+- Is state management deterministic and recoverable?
+- Are tool calls sandboxed and permission-scoped?
+
+---
+
+## Examples
+
+**Example 1:**
+Input: "Here's my system — React frontend, Node.js BFF, three Python microservices sharing a PostgreSQL database, deployed on Kubernetes."
+Output: Scorecard highlighting Distributed Monolith anti-pattern (shared DB), recommendation to introduce service-specific schemas or migrate to event-driven data ownership, ADR for database-per-service.
+
+**Example 2:**
+Input: "Review my RAG pipeline: I chunk PDFs by 512 tokens, embed with OpenAI text-embedding-3-small, store in Pinecone, retrieve top-5, send to GPT-4o."
+Output: RAG checklist evaluation, flag missing hybrid search and re-ranking, flag no hallucination mitigation layer, score retrieval design, recommend adding RAGAS evaluation.
+
+**Example 3:**
+Input: "Generate an ADR for choosing Kafka over RabbitMQ for our event bus."
+Output: Full ADR document with context, decision rationale, trade-offs, and alternatives comparison table.
+
+---
+
+## Guidelines
+
+- Always explain **why** a finding matters, not just what is wrong — help the user build architectural intuition, not just fix a checklist.
+- Tailor the depth to the system's maturity — a POC needs different advice than a production system handling millions of requests.
+- When reviewing Gen AI systems, always check for Responsible AI coverage — this is a critical quality attribute often overlooked.
+- If the user shares a diagram (C4, sequence, ER), reference it directly in your findings.
+- If no architecture is shared yet, prompt with: "Could you share your architecture diagram, a description of the components, or a C4 model? Even a rough sketch helps."
+- Avoid generic advice like "add caching" — always specify *where*, *what type*, and *why*.
diff --git a/skills/software-architecture-review/evals/evals.json b/skills/software-architecture-review/evals/evals.json
@@ -0,0 +1,95 @@
+{
+  "skill_name": "software-architecture-review",
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "I've got a system where the React frontend calls 6 different Python microservices directly over REST, and all of them share the same PostgreSQL database. It's getting really slow and hard to deploy. Can you review this architecture?",
+      "expected_output": "Scorecard identifying Distributed Monolith anti-pattern (shared DB), Chatty Microservices anti-pattern (direct frontend-to-service calls), recommendations for API Gateway pattern, database-per-service strategy, and an ADR for event-driven migration.",
+      "assertions": [
+        {
+          "name": "identifies-distributed-monolith",
+          "type": "contains_concept",
+          "description": "Response identifies the shared database as a Distributed Monolith anti-pattern",
+          "expected": "Flags shared PostgreSQL database across microservices as Distributed Monolith anti-pattern"
+        },
+        {
+          "name": "produces-scorecard",
+          "type": "format_check",
+          "description": "Response includes an Architecture Scorecard table with quality attribute scores",
+          "expected": "Markdown table with Scalability, Maintainability, Resilience scores and findings"
+        },
+        {
+          "name": "actionable-recommendations",
+          "type": "quality_check",
+          "description": "Recommendations are specific and actionable, not generic",
+          "expected": "Each recommendation includes what, why, and how to fix \u2014 not just 'add caching'"
+        },
+        {
+          "name": "rag-module-not-triggered",
+          "type": "scope_check",
+          "description": "Response does not invoke Gen AI / RAG review module for a non-AI system",
+          "expected": "RAG-specific checklist is not included in the output"
+        }
+      ],
+      "files": []
+    },
+    {
+      "id": 2,
+      "prompt": "Here's my RAG pipeline for a healthcare Q&A bot at my company: I chunk patient documents at 512 fixed tokens, embed them with text-embedding-3-small, store in Pinecone, retrieve top-3, stuff into a GPT-4o prompt and return the answer directly to clinicians. Can you review this?",
+      "expected_output": "RAG pipeline checklist evaluation flagging fixed-size chunking, missing re-ranking, no hallucination mitigation, missing Responsible AI layer for healthcare context, no RAGAS evaluation. Critical flags for prompt injection risk and missing human-in-the-loop for clinical decisions.",
+      "assertions": [
+        {
+          "name": "rag-checklist-triggered",
+          "type": "contains_concept",
+          "description": "Response uses the Gen AI / RAG review module and checklist",
+          "expected": "RAG pipeline checklist is evaluated with chunking, embedding, retrieval, prompt, response sections"
+        },
+        {
+          "name": "responsible-ai-flagged",
+          "type": "critical_check",
+          "description": "Response flags missing Responsible AI layer as critical for healthcare context",
+          "expected": "\ud83d\udd34 Critical finding for missing content filtering, audit logging, or human-in-the-loop in clinical context"
+        },
+        {
+          "name": "hallucination-mitigation-flagged",
+          "type": "contains_concept",
+          "description": "Response identifies missing hallucination mitigation as a risk",
+          "expected": "Flags no grounding check, citation tracking, or confidence thresholds"
+        },
+        {
+          "name": "chunking-anti-pattern",
+          "type": "contains_concept",
+          "description": "Response identifies fixed-size chunking as Naive Chunking anti-pattern",
+          "expected": "Recommends semantic chunking with overlap and boundary-aware splitting"
+        }
+      ],
+      "files": []
+    },
+    {
+      "id": 3,
+      "prompt": "My team is deciding between Kafka and Azure Service Bus for our event-driven order processing system. We're on Azure, team of 8 engineers, and we expect ~50k events/day with occasional spikes to 500k. Can you generate an ADR for this decision?",
+      "expected_output": "Full ADR document with context, decision (Azure Service Bus recommended for Azure-native team at this scale), consequences including positive and trade-offs, alternatives comparison table between Kafka and Azure Service Bus across latency, ops complexity, cost, and Azure integration.",
+      "assertions": [
+        {
+          "name": "adr-format-correct",
+          "type": "format_check",
+          "description": "Output follows the ADR template with all required sections",
+          "expected": "ADR includes Status, Context, Decision, Consequences (positive + trade-offs + risks), and Alternatives Considered table"
+        },
+        {
+          "name": "context-specific-recommendation",
+          "type": "quality_check",
+          "description": "ADR recommendation is tailored to the Azure context and team size, not generic",
+          "expected": "Decision accounts for Azure-native deployment, 8-engineer team operational burden, and 50k-500k event volume"
+        },
+        {
+          "name": "alternatives-table",
+          "type": "format_check",
+          "description": "Alternatives section includes a comparison table, not just prose",
+          "expected": "Markdown table comparing Kafka vs Azure Service Bus on multiple dimensions"
+        }
+      ],
+      "files": []
+    }
+  ]
+}
diff --git a/skills/software-architecture-review/evals/trigger-evals.json b/skills/software-architecture-review/evals/trigger-evals.json
@@ -0,0 +1,66 @@
+[
+  {
+    "query": "our backend is a monolith with 12 modules all tightly coupled, the team wants to migrate to microservices but i'm not sure if the boundaries are right \u2014 can you review the design?",
+    "should_trigger": true
+  },
+  {
+    "query": "just built a langgraph agent with 4 nodes but the loops never terminate properly and i have no idea if the state management is correct \u2014 can someone review this?",
+    "should_trigger": true
+  },
+  {
+    "query": "my rag pipeline chunks at 512 tokens, retrieves top-5, feeds into claude \u2014 getting hallucinations in prod. is there something architecturally wrong here?",
+    "should_trigger": true
+  },
+  {
+    "query": "is it bad that all my microservices share one postgres db? deployment has been a nightmare lately",
+    "should_trigger": true
+  },
+  {
+    "query": "need an ADR for choosing between REST and GraphQL for our new internal API, team is split 50/50",
+    "should_trigger": true
+  },
+  {
+    "query": "here's a c4 diagram of our event-driven system \u2014 can you score it against scalability and resilience attributes?",
+    "should_trigger": true
+  },
+  {
+    "query": "reviewing tech stack for new project: kafka for events, redis for cache, postgres for storage, kubernetes on gcp. good choices? any red flags?",
+    "should_trigger": true
+  },
+  {
+    "query": "our AI pipeline sends raw user input directly into the system prompt with no sanitization \u2014 someone said this is dangerous but i don't understand why",
+    "should_trigger": true
+  },
+  {
+    "query": "write me a python script that reads a csv and outputs a bar chart using matplotlib",
+    "should_trigger": false
+  },
+  {
+    "query": "how do i install kafka on ubuntu 22.04 step by step",
+    "should_trigger": false
+  },
+  {
+    "query": "can you review my pull request? i changed the auth middleware to use jwt instead of sessions",
+    "should_trigger": false
+  },
+  {
+    "query": "what's the difference between sql and nosql databases?",
+    "should_trigger": false
+  },
+  {
+    "query": "create a mcp server that connects to my postgres database and exposes a query tool",
+    "should_trigger": false
+  },
+  {
+    "query": "my react frontend is slow \u2014 can you look at the component re-rendering and suggest optimizations?",
+    "should_trigger": false
+  },
+  {
+    "query": "write unit tests for this user service class in python",
+    "should_trigger": false
+  },
+  {
+    "query": "i need to pick between openai and anthropic apis for my chatbot \u2014 which one is cheaper?",
+    "should_trigger": false
+  }
+]