diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json index 03d8e71c4..64a284845 100644 --- a/.claude-plugin/marketplace.json +++ b/.claude-plugin/marketplace.json @@ -38,7 +38,8 @@ "./skills/slack-gif-creator", "./skills/theme-factory", "./skills/web-artifacts-builder", - "./skills/webapp-testing" + "./skills/webapp-testing", + "./skills/software-architecture-review" ] } , diff --git a/skills/software-architecture-review/SKILL.md b/skills/software-architecture-review/SKILL.md new file mode 100644 index 000000000..39a71e1f1 --- /dev/null +++ b/skills/software-architecture-review/SKILL.md @@ -0,0 +1,213 @@ +--- +name: software-architecture-review +description: > + Performs structured software architecture reviews covering design patterns, quality attributes, + ADR (Architecture Decision Record) generation, anti-pattern detection, and scoring. Use this skill + whenever the user mentions architecture review, system design evaluation, tech stack assessment, + ADR creation, reviewing microservices/event-driven/layered/hexagonal designs, Gen AI or RAG + architecture review, or asks for architectural fitness scoring — even if they don't say + "architecture review" explicitly. Also trigger for questions like "is my design good?", + "what's wrong with my system design?", or "how should I structure my AI pipeline?" +--- + +# Software Architecture Review + +A skill for performing structured, expert-level software architecture reviews. Covers +traditional enterprise systems, cloud-native architectures, and modern Gen AI / RAG-based systems. + +--- + +## When to Use This Skill + +Trigger for any of these intents: +- "Review my architecture / system design" +- "What are the trade-offs of this design?" +- "Generate an ADR for this decision" +- "Is my RAG pipeline well-architected?" +- "What anti-patterns exist in my design?" +- "Score my architecture against quality attributes" +- Diagrams, C4 models, or architecture descriptions shared for feedback + +--- + +## Review Process + +Follow this four-phase process: + +### Phase 1 — Understand Context + +Gather or infer: +1. **System type** — Web app, microservices, event-driven, monolith, Gen AI pipeline, RAG system, etc. +2. **Quality priorities** — Ask the user to rank: Scalability, Security, Maintainability, Observability, Performance, Cost +3. **Constraints** — Cloud provider, team size, compliance requirements (HIPAA, SOC2, etc.) +4. **Maturity stage** — POC / MVP / Production / Legacy migration + +If the user has shared a diagram or description, extract answers directly from it before asking questions. + +--- + +### Phase 2 — Structural Analysis + +Evaluate the architecture against these lenses: + +#### Design Patterns +Check which architectural style is in use and whether it is applied correctly: + +| Style | Key Concerns to Evaluate | +|---|---| +| Microservices | Service boundaries, data ownership, inter-service contracts | +| Event-Driven | Topic naming, consumer groups, event schema versioning | +| Layered (N-Tier) | Layer isolation, cross-layer dependency leaks | +| Hexagonal (Ports & Adapters) | Port definitions, adapter swap-ability | +| RAG / Gen AI Pipeline | Chunking strategy, embedding model choice, retrieval accuracy, LLM prompt isolation | +| CQRS / Event Sourcing | Read/write model separation, event store durability | + +#### Quality Attributes Assessment + +Score each attribute 1–5 (1 = critical gap, 5 = excellent): + +- **Scalability** — Can the system handle 10x load? Where is the bottleneck? +- **Security** — Auth/AuthZ, secrets management, data-in-transit/at-rest encryption +- **Observability** — Logging, tracing, metrics — is the system debuggable in prod? +- **Maintainability** — Modularity, separation of concerns, test coverage design +- **Resilience** — Circuit breakers, retries, graceful degradation +- **Cost Efficiency** — Over-provisioned components, expensive API calls without caching + +--- + +### Phase 3 — Anti-Pattern Detection + +Check for and flag the following: + +**Structural Anti-Patterns** +- ❌ **Distributed Monolith** — Microservices that share a database or deploy together +- ❌ **God Service / God Object** — One service/class doing everything +- ❌ **Chatty Microservices** — Excessive synchronous inter-service calls +- ❌ **Tight Coupling** — Components that cannot change independently +- ❌ **Anemic Domain Model** — Domain objects with no behavior, all logic in services +- ❌ **Spaghetti Integration** — Point-to-point integrations without an abstraction layer + +**Gen AI / RAG Specific Anti-Patterns** +- ❌ **Naive Chunking** — Fixed-size chunking ignoring semantic boundaries +- ❌ **Missing Retrieval Evaluation** — No feedback loop measuring retrieval relevance +- ❌ **Prompt Injection Risk** — User input directly concatenated into system prompts +- ❌ **LLM as Orchestrator Without Guardrails** — Agentic loops without human-in-the-loop or fallback +- ❌ **Embedding Model Mismatch** — Query embedding model differs from document embedding model +- ❌ **No Hallucination Mitigation** — No grounding check, citation tracking, or confidence thresholds +- ❌ **Missing Responsible AI Layer** — No content filtering, bias checks, or audit logging + +--- + +### Phase 4 — Output Generation + +Always produce the following output sections: + +#### Architecture Scorecard + +``` +## Architecture Scorecard: [System Name] + +| Quality Attribute | Score (1–5) | Key Finding | +|---|---|---| +| Scalability | X/5 | ... | +| Security | X/5 | ... | +| Observability | X/5 | ... | +| Maintainability | X/5 | ... | +| Resilience | X/5 | ... | +| Cost Efficiency | X/5 | ... | +| **Overall** | **X/5** | ... | +``` + +#### Findings Summary + +List findings as: +- 🔴 **Critical** — Must fix before production +- 🟡 **Warning** — Should fix in next sprint +- 🟢 **Positive** — Well-designed aspect worth preserving + +#### Recommendations + +For each critical/warning finding, provide: +1. **What** the issue is +2. **Why** it matters (impact on quality attribute) +3. **How** to fix it (concrete, actionable — not just "add caching") + +#### ADR Generation (if requested or if a major decision is identified) + +```markdown +# ADR-[NUMBER]: [Title] + +**Status:** Proposed | Accepted | Deprecated | Superseded + +**Context:** +[What is the problem or situation that prompted this decision?] + +**Decision:** +[What was decided and why?] + +**Consequences:** +- ✅ Positive: ... +- ❌ Trade-off: ... +- ⚠️ Risks: ... + +**Alternatives Considered:** +| Option | Pros | Cons | +|---|---|---| +| Option A | ... | ... | +| Option B | ... | ... | +``` + +--- + +## Gen AI / RAG Architecture Review Module + +For RAG and Gen AI systems, additionally evaluate: + +### RAG Pipeline Checklist + +| Component | What to Check | +|---|---| +| **Data Ingestion** | Source diversity, update frequency, metadata preservation | +| **Chunking Strategy** | Semantic vs. fixed-size, overlap, chunk size vs. context window | +| **Embedding** | Model alignment (query vs. doc), dimensionality, update strategy | +| **Vector Store** | Index type (HNSW, IVF), distance metric, filtering capability | +| **Retrieval** | Top-K tuning, hybrid search (dense + sparse), re-ranking | +| **Prompt Design** | System prompt isolation, context injection, few-shot examples | +| **LLM Response** | Citation grounding, hallucination mitigation, temperature settings | +| **Evaluation** | RAGAS or equivalent metrics (faithfulness, relevancy, context recall) | +| **Responsible AI** | Content filters, audit logging, human-in-the-loop for high-stakes outputs | + +### Agentic / Multi-Agent Review (LangGraph, AutoGen, CrewAI) + +- Are agent roles and boundaries clearly defined? +- Is there a supervisor or orchestration pattern? +- Are there defined exit conditions to prevent infinite loops? +- Is state management deterministic and recoverable? +- Are tool calls sandboxed and permission-scoped? + +--- + +## Examples + +**Example 1:** +Input: "Here's my system — React frontend, Node.js BFF, three Python microservices sharing a PostgreSQL database, deployed on Kubernetes." +Output: Scorecard highlighting Distributed Monolith anti-pattern (shared DB), recommendation to introduce service-specific schemas or migrate to event-driven data ownership, ADR for database-per-service. + +**Example 2:** +Input: "Review my RAG pipeline: I chunk PDFs by 512 tokens, embed with OpenAI text-embedding-3-small, store in Pinecone, retrieve top-5, send to GPT-4o." +Output: RAG checklist evaluation, flag missing hybrid search and re-ranking, flag no hallucination mitigation layer, score retrieval design, recommend adding RAGAS evaluation. + +**Example 3:** +Input: "Generate an ADR for choosing Kafka over RabbitMQ for our event bus." +Output: Full ADR document with context, decision rationale, trade-offs, and alternatives comparison table. + +--- + +## Guidelines + +- Always explain **why** a finding matters, not just what is wrong — help the user build architectural intuition, not just fix a checklist. +- Tailor the depth to the system's maturity — a POC needs different advice than a production system handling millions of requests. +- When reviewing Gen AI systems, always check for Responsible AI coverage — this is a critical quality attribute often overlooked. +- If the user shares a diagram (C4, sequence, ER), reference it directly in your findings. +- If no architecture is shared yet, prompt with: "Could you share your architecture diagram, a description of the components, or a C4 model? Even a rough sketch helps." +- Avoid generic advice like "add caching" — always specify *where*, *what type*, and *why*. diff --git a/skills/software-architecture-review/evals/evals.json b/skills/software-architecture-review/evals/evals.json new file mode 100644 index 000000000..42821a610 --- /dev/null +++ b/skills/software-architecture-review/evals/evals.json @@ -0,0 +1,95 @@ +{ + "skill_name": "software-architecture-review", + "evals": [ + { + "id": 1, + "prompt": "I've got a system where the React frontend calls 6 different Python microservices directly over REST, and all of them share the same PostgreSQL database. It's getting really slow and hard to deploy. Can you review this architecture?", + "expected_output": "Scorecard identifying Distributed Monolith anti-pattern (shared DB), Chatty Microservices anti-pattern (direct frontend-to-service calls), recommendations for API Gateway pattern, database-per-service strategy, and an ADR for event-driven migration.", + "assertions": [ + { + "name": "identifies-distributed-monolith", + "type": "contains_concept", + "description": "Response identifies the shared database as a Distributed Monolith anti-pattern", + "expected": "Flags shared PostgreSQL database across microservices as Distributed Monolith anti-pattern" + }, + { + "name": "produces-scorecard", + "type": "format_check", + "description": "Response includes an Architecture Scorecard table with quality attribute scores", + "expected": "Markdown table with Scalability, Maintainability, Resilience scores and findings" + }, + { + "name": "actionable-recommendations", + "type": "quality_check", + "description": "Recommendations are specific and actionable, not generic", + "expected": "Each recommendation includes what, why, and how to fix \u2014 not just 'add caching'" + }, + { + "name": "rag-module-not-triggered", + "type": "scope_check", + "description": "Response does not invoke Gen AI / RAG review module for a non-AI system", + "expected": "RAG-specific checklist is not included in the output" + } + ], + "files": [] + }, + { + "id": 2, + "prompt": "Here's my RAG pipeline for a healthcare Q&A bot at my company: I chunk patient documents at 512 fixed tokens, embed them with text-embedding-3-small, store in Pinecone, retrieve top-3, stuff into a GPT-4o prompt and return the answer directly to clinicians. Can you review this?", + "expected_output": "RAG pipeline checklist evaluation flagging fixed-size chunking, missing re-ranking, no hallucination mitigation, missing Responsible AI layer for healthcare context, no RAGAS evaluation. Critical flags for prompt injection risk and missing human-in-the-loop for clinical decisions.", + "assertions": [ + { + "name": "rag-checklist-triggered", + "type": "contains_concept", + "description": "Response uses the Gen AI / RAG review module and checklist", + "expected": "RAG pipeline checklist is evaluated with chunking, embedding, retrieval, prompt, response sections" + }, + { + "name": "responsible-ai-flagged", + "type": "critical_check", + "description": "Response flags missing Responsible AI layer as critical for healthcare context", + "expected": "\ud83d\udd34 Critical finding for missing content filtering, audit logging, or human-in-the-loop in clinical context" + }, + { + "name": "hallucination-mitigation-flagged", + "type": "contains_concept", + "description": "Response identifies missing hallucination mitigation as a risk", + "expected": "Flags no grounding check, citation tracking, or confidence thresholds" + }, + { + "name": "chunking-anti-pattern", + "type": "contains_concept", + "description": "Response identifies fixed-size chunking as Naive Chunking anti-pattern", + "expected": "Recommends semantic chunking with overlap and boundary-aware splitting" + } + ], + "files": [] + }, + { + "id": 3, + "prompt": "My team is deciding between Kafka and Azure Service Bus for our event-driven order processing system. We're on Azure, team of 8 engineers, and we expect ~50k events/day with occasional spikes to 500k. Can you generate an ADR for this decision?", + "expected_output": "Full ADR document with context, decision (Azure Service Bus recommended for Azure-native team at this scale), consequences including positive and trade-offs, alternatives comparison table between Kafka and Azure Service Bus across latency, ops complexity, cost, and Azure integration.", + "assertions": [ + { + "name": "adr-format-correct", + "type": "format_check", + "description": "Output follows the ADR template with all required sections", + "expected": "ADR includes Status, Context, Decision, Consequences (positive + trade-offs + risks), and Alternatives Considered table" + }, + { + "name": "context-specific-recommendation", + "type": "quality_check", + "description": "ADR recommendation is tailored to the Azure context and team size, not generic", + "expected": "Decision accounts for Azure-native deployment, 8-engineer team operational burden, and 50k-500k event volume" + }, + { + "name": "alternatives-table", + "type": "format_check", + "description": "Alternatives section includes a comparison table, not just prose", + "expected": "Markdown table comparing Kafka vs Azure Service Bus on multiple dimensions" + } + ], + "files": [] + } + ] +} \ No newline at end of file diff --git a/skills/software-architecture-review/evals/trigger-evals.json b/skills/software-architecture-review/evals/trigger-evals.json new file mode 100644 index 000000000..9988a246b --- /dev/null +++ b/skills/software-architecture-review/evals/trigger-evals.json @@ -0,0 +1,66 @@ +[ + { + "query": "our backend is a monolith with 12 modules all tightly coupled, the team wants to migrate to microservices but i'm not sure if the boundaries are right \u2014 can you review the design?", + "should_trigger": true + }, + { + "query": "just built a langgraph agent with 4 nodes but the loops never terminate properly and i have no idea if the state management is correct \u2014 can someone review this?", + "should_trigger": true + }, + { + "query": "my rag pipeline chunks at 512 tokens, retrieves top-5, feeds into claude \u2014 getting hallucinations in prod. is there something architecturally wrong here?", + "should_trigger": true + }, + { + "query": "is it bad that all my microservices share one postgres db? deployment has been a nightmare lately", + "should_trigger": true + }, + { + "query": "need an ADR for choosing between REST and GraphQL for our new internal API, team is split 50/50", + "should_trigger": true + }, + { + "query": "here's a c4 diagram of our event-driven system \u2014 can you score it against scalability and resilience attributes?", + "should_trigger": true + }, + { + "query": "reviewing tech stack for new project: kafka for events, redis for cache, postgres for storage, kubernetes on gcp. good choices? any red flags?", + "should_trigger": true + }, + { + "query": "our AI pipeline sends raw user input directly into the system prompt with no sanitization \u2014 someone said this is dangerous but i don't understand why", + "should_trigger": true + }, + { + "query": "write me a python script that reads a csv and outputs a bar chart using matplotlib", + "should_trigger": false + }, + { + "query": "how do i install kafka on ubuntu 22.04 step by step", + "should_trigger": false + }, + { + "query": "can you review my pull request? i changed the auth middleware to use jwt instead of sessions", + "should_trigger": false + }, + { + "query": "what's the difference between sql and nosql databases?", + "should_trigger": false + }, + { + "query": "create a mcp server that connects to my postgres database and exposes a query tool", + "should_trigger": false + }, + { + "query": "my react frontend is slow \u2014 can you look at the component re-rendering and suggest optimizations?", + "should_trigger": false + }, + { + "query": "write unit tests for this user service class in python", + "should_trigger": false + }, + { + "query": "i need to pick between openai and anthropic apis for my chatbot \u2014 which one is cheaper?", + "should_trigger": false + } +] \ No newline at end of file