Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .claude-plugin/marketplace.json
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,8 @@
"./skills/slack-gif-creator",
"./skills/theme-factory",
"./skills/web-artifacts-builder",
"./skills/webapp-testing"
"./skills/webapp-testing",
"./skills/software-architecture-review"
]
}
,
Expand Down
213 changes: 213 additions & 0 deletions skills/software-architecture-review/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
---
name: software-architecture-review
description: >
Performs structured software architecture reviews covering design patterns, quality attributes,
ADR (Architecture Decision Record) generation, anti-pattern detection, and scoring. Use this skill
whenever the user mentions architecture review, system design evaluation, tech stack assessment,
ADR creation, reviewing microservices/event-driven/layered/hexagonal designs, Gen AI or RAG
architecture review, or asks for architectural fitness scoring — even if they don't say
"architecture review" explicitly. Also trigger for questions like "is my design good?",
"what's wrong with my system design?", or "how should I structure my AI pipeline?"
---

# Software Architecture Review

A skill for performing structured, expert-level software architecture reviews. Covers
traditional enterprise systems, cloud-native architectures, and modern Gen AI / RAG-based systems.

---

## When to Use This Skill

Trigger for any of these intents:
- "Review my architecture / system design"
- "What are the trade-offs of this design?"
- "Generate an ADR for this decision"
- "Is my RAG pipeline well-architected?"
- "What anti-patterns exist in my design?"
- "Score my architecture against quality attributes"
- Diagrams, C4 models, or architecture descriptions shared for feedback

---

## Review Process

Follow this four-phase process:

### Phase 1 — Understand Context

Gather or infer:
1. **System type** — Web app, microservices, event-driven, monolith, Gen AI pipeline, RAG system, etc.
2. **Quality priorities** — Ask the user to rank: Scalability, Security, Maintainability, Observability, Performance, Cost
3. **Constraints** — Cloud provider, team size, compliance requirements (HIPAA, SOC2, etc.)
4. **Maturity stage** — POC / MVP / Production / Legacy migration

If the user has shared a diagram or description, extract answers directly from it before asking questions.

---

### Phase 2 — Structural Analysis

Evaluate the architecture against these lenses:

#### Design Patterns
Check which architectural style is in use and whether it is applied correctly:

| Style | Key Concerns to Evaluate |
|---|---|
| Microservices | Service boundaries, data ownership, inter-service contracts |
| Event-Driven | Topic naming, consumer groups, event schema versioning |
| Layered (N-Tier) | Layer isolation, cross-layer dependency leaks |
| Hexagonal (Ports & Adapters) | Port definitions, adapter swap-ability |
| RAG / Gen AI Pipeline | Chunking strategy, embedding model choice, retrieval accuracy, LLM prompt isolation |
| CQRS / Event Sourcing | Read/write model separation, event store durability |

#### Quality Attributes Assessment

Score each attribute 1–5 (1 = critical gap, 5 = excellent):

- **Scalability** — Can the system handle 10x load? Where is the bottleneck?
- **Security** — Auth/AuthZ, secrets management, data-in-transit/at-rest encryption
- **Observability** — Logging, tracing, metrics — is the system debuggable in prod?
- **Maintainability** — Modularity, separation of concerns, test coverage design
- **Resilience** — Circuit breakers, retries, graceful degradation
- **Cost Efficiency** — Over-provisioned components, expensive API calls without caching

---

### Phase 3 — Anti-Pattern Detection

Check for and flag the following:

**Structural Anti-Patterns**
- ❌ **Distributed Monolith** — Microservices that share a database or deploy together
- ❌ **God Service / God Object** — One service/class doing everything
- ❌ **Chatty Microservices** — Excessive synchronous inter-service calls
- ❌ **Tight Coupling** — Components that cannot change independently
- ❌ **Anemic Domain Model** — Domain objects with no behavior, all logic in services
- ❌ **Spaghetti Integration** — Point-to-point integrations without an abstraction layer

**Gen AI / RAG Specific Anti-Patterns**
- ❌ **Naive Chunking** — Fixed-size chunking ignoring semantic boundaries
- ❌ **Missing Retrieval Evaluation** — No feedback loop measuring retrieval relevance
- ❌ **Prompt Injection Risk** — User input directly concatenated into system prompts
- ❌ **LLM as Orchestrator Without Guardrails** — Agentic loops without human-in-the-loop or fallback
- ❌ **Embedding Model Mismatch** — Query embedding model differs from document embedding model
- ❌ **No Hallucination Mitigation** — No grounding check, citation tracking, or confidence thresholds
- ❌ **Missing Responsible AI Layer** — No content filtering, bias checks, or audit logging

---

### Phase 4 — Output Generation

Always produce the following output sections:

#### Architecture Scorecard

```
## Architecture Scorecard: [System Name]

| Quality Attribute | Score (1–5) | Key Finding |
|---|---|---|
| Scalability | X/5 | ... |
| Security | X/5 | ... |
| Observability | X/5 | ... |
| Maintainability | X/5 | ... |
| Resilience | X/5 | ... |
| Cost Efficiency | X/5 | ... |
| **Overall** | **X/5** | ... |
```

#### Findings Summary

List findings as:
- 🔴 **Critical** — Must fix before production
- 🟡 **Warning** — Should fix in next sprint
- 🟢 **Positive** — Well-designed aspect worth preserving

#### Recommendations

For each critical/warning finding, provide:
1. **What** the issue is
2. **Why** it matters (impact on quality attribute)
3. **How** to fix it (concrete, actionable — not just "add caching")

#### ADR Generation (if requested or if a major decision is identified)

```markdown
# ADR-[NUMBER]: [Title]

**Status:** Proposed | Accepted | Deprecated | Superseded

**Context:**
[What is the problem or situation that prompted this decision?]

**Decision:**
[What was decided and why?]

**Consequences:**
- ✅ Positive: ...
- ❌ Trade-off: ...
- ⚠️ Risks: ...

**Alternatives Considered:**
| Option | Pros | Cons |
|---|---|---|
| Option A | ... | ... |
| Option B | ... | ... |
```

---

## Gen AI / RAG Architecture Review Module

For RAG and Gen AI systems, additionally evaluate:

### RAG Pipeline Checklist

| Component | What to Check |
|---|---|
| **Data Ingestion** | Source diversity, update frequency, metadata preservation |
| **Chunking Strategy** | Semantic vs. fixed-size, overlap, chunk size vs. context window |
| **Embedding** | Model alignment (query vs. doc), dimensionality, update strategy |
| **Vector Store** | Index type (HNSW, IVF), distance metric, filtering capability |
| **Retrieval** | Top-K tuning, hybrid search (dense + sparse), re-ranking |
| **Prompt Design** | System prompt isolation, context injection, few-shot examples |
| **LLM Response** | Citation grounding, hallucination mitigation, temperature settings |
| **Evaluation** | RAGAS or equivalent metrics (faithfulness, relevancy, context recall) |
| **Responsible AI** | Content filters, audit logging, human-in-the-loop for high-stakes outputs |

### Agentic / Multi-Agent Review (LangGraph, AutoGen, CrewAI)

- Are agent roles and boundaries clearly defined?
- Is there a supervisor or orchestration pattern?
- Are there defined exit conditions to prevent infinite loops?
- Is state management deterministic and recoverable?
- Are tool calls sandboxed and permission-scoped?

---

## Examples

**Example 1:**
Input: "Here's my system — React frontend, Node.js BFF, three Python microservices sharing a PostgreSQL database, deployed on Kubernetes."
Output: Scorecard highlighting Distributed Monolith anti-pattern (shared DB), recommendation to introduce service-specific schemas or migrate to event-driven data ownership, ADR for database-per-service.

**Example 2:**
Input: "Review my RAG pipeline: I chunk PDFs by 512 tokens, embed with OpenAI text-embedding-3-small, store in Pinecone, retrieve top-5, send to GPT-4o."
Output: RAG checklist evaluation, flag missing hybrid search and re-ranking, flag no hallucination mitigation layer, score retrieval design, recommend adding RAGAS evaluation.

**Example 3:**
Input: "Generate an ADR for choosing Kafka over RabbitMQ for our event bus."
Output: Full ADR document with context, decision rationale, trade-offs, and alternatives comparison table.

---

## Guidelines

- Always explain **why** a finding matters, not just what is wrong — help the user build architectural intuition, not just fix a checklist.
- Tailor the depth to the system's maturity — a POC needs different advice than a production system handling millions of requests.
- When reviewing Gen AI systems, always check for Responsible AI coverage — this is a critical quality attribute often overlooked.
- If the user shares a diagram (C4, sequence, ER), reference it directly in your findings.
- If no architecture is shared yet, prompt with: "Could you share your architecture diagram, a description of the components, or a C4 model? Even a rough sketch helps."
- Avoid generic advice like "add caching" — always specify *where*, *what type*, and *why*.
95 changes: 95 additions & 0 deletions skills/software-architecture-review/evals/evals.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
{
"skill_name": "software-architecture-review",
"evals": [
{
"id": 1,
"prompt": "I've got a system where the React frontend calls 6 different Python microservices directly over REST, and all of them share the same PostgreSQL database. It's getting really slow and hard to deploy. Can you review this architecture?",
"expected_output": "Scorecard identifying Distributed Monolith anti-pattern (shared DB), Chatty Microservices anti-pattern (direct frontend-to-service calls), recommendations for API Gateway pattern, database-per-service strategy, and an ADR for event-driven migration.",
"assertions": [
{
"name": "identifies-distributed-monolith",
"type": "contains_concept",
"description": "Response identifies the shared database as a Distributed Monolith anti-pattern",
"expected": "Flags shared PostgreSQL database across microservices as Distributed Monolith anti-pattern"
},
{
"name": "produces-scorecard",
"type": "format_check",
"description": "Response includes an Architecture Scorecard table with quality attribute scores",
"expected": "Markdown table with Scalability, Maintainability, Resilience scores and findings"
},
{
"name": "actionable-recommendations",
"type": "quality_check",
"description": "Recommendations are specific and actionable, not generic",
"expected": "Each recommendation includes what, why, and how to fix \u2014 not just 'add caching'"
},
{
"name": "rag-module-not-triggered",
"type": "scope_check",
"description": "Response does not invoke Gen AI / RAG review module for a non-AI system",
"expected": "RAG-specific checklist is not included in the output"
}
],
"files": []
},
{
"id": 2,
"prompt": "Here's my RAG pipeline for a healthcare Q&A bot at my company: I chunk patient documents at 512 fixed tokens, embed them with text-embedding-3-small, store in Pinecone, retrieve top-3, stuff into a GPT-4o prompt and return the answer directly to clinicians. Can you review this?",
"expected_output": "RAG pipeline checklist evaluation flagging fixed-size chunking, missing re-ranking, no hallucination mitigation, missing Responsible AI layer for healthcare context, no RAGAS evaluation. Critical flags for prompt injection risk and missing human-in-the-loop for clinical decisions.",
"assertions": [
{
"name": "rag-checklist-triggered",
"type": "contains_concept",
"description": "Response uses the Gen AI / RAG review module and checklist",
"expected": "RAG pipeline checklist is evaluated with chunking, embedding, retrieval, prompt, response sections"
},
{
"name": "responsible-ai-flagged",
"type": "critical_check",
"description": "Response flags missing Responsible AI layer as critical for healthcare context",
"expected": "\ud83d\udd34 Critical finding for missing content filtering, audit logging, or human-in-the-loop in clinical context"
},
{
"name": "hallucination-mitigation-flagged",
"type": "contains_concept",
"description": "Response identifies missing hallucination mitigation as a risk",
"expected": "Flags no grounding check, citation tracking, or confidence thresholds"
},
{
"name": "chunking-anti-pattern",
"type": "contains_concept",
"description": "Response identifies fixed-size chunking as Naive Chunking anti-pattern",
"expected": "Recommends semantic chunking with overlap and boundary-aware splitting"
}
],
"files": []
},
{
"id": 3,
"prompt": "My team is deciding between Kafka and Azure Service Bus for our event-driven order processing system. We're on Azure, team of 8 engineers, and we expect ~50k events/day with occasional spikes to 500k. Can you generate an ADR for this decision?",
"expected_output": "Full ADR document with context, decision (Azure Service Bus recommended for Azure-native team at this scale), consequences including positive and trade-offs, alternatives comparison table between Kafka and Azure Service Bus across latency, ops complexity, cost, and Azure integration.",
"assertions": [
{
"name": "adr-format-correct",
"type": "format_check",
"description": "Output follows the ADR template with all required sections",
"expected": "ADR includes Status, Context, Decision, Consequences (positive + trade-offs + risks), and Alternatives Considered table"
},
{
"name": "context-specific-recommendation",
"type": "quality_check",
"description": "ADR recommendation is tailored to the Azure context and team size, not generic",
"expected": "Decision accounts for Azure-native deployment, 8-engineer team operational burden, and 50k-500k event volume"
},
{
"name": "alternatives-table",
"type": "format_check",
"description": "Alternatives section includes a comparison table, not just prose",
"expected": "Markdown table comparing Kafka vs Azure Service Bus on multiple dimensions"
}
],
"files": []
}
]
}
66 changes: 66 additions & 0 deletions skills/software-architecture-review/evals/trigger-evals.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
[
{
"query": "our backend is a monolith with 12 modules all tightly coupled, the team wants to migrate to microservices but i'm not sure if the boundaries are right \u2014 can you review the design?",
"should_trigger": true
},
{
"query": "just built a langgraph agent with 4 nodes but the loops never terminate properly and i have no idea if the state management is correct \u2014 can someone review this?",
"should_trigger": true
},
{
"query": "my rag pipeline chunks at 512 tokens, retrieves top-5, feeds into claude \u2014 getting hallucinations in prod. is there something architecturally wrong here?",
"should_trigger": true
},
{
"query": "is it bad that all my microservices share one postgres db? deployment has been a nightmare lately",
"should_trigger": true
},
{
"query": "need an ADR for choosing between REST and GraphQL for our new internal API, team is split 50/50",
"should_trigger": true
},
{
"query": "here's a c4 diagram of our event-driven system \u2014 can you score it against scalability and resilience attributes?",
"should_trigger": true
},
{
"query": "reviewing tech stack for new project: kafka for events, redis for cache, postgres for storage, kubernetes on gcp. good choices? any red flags?",
"should_trigger": true
},
{
"query": "our AI pipeline sends raw user input directly into the system prompt with no sanitization \u2014 someone said this is dangerous but i don't understand why",
"should_trigger": true
},
{
"query": "write me a python script that reads a csv and outputs a bar chart using matplotlib",
"should_trigger": false
},
{
"query": "how do i install kafka on ubuntu 22.04 step by step",
"should_trigger": false
},
{
"query": "can you review my pull request? i changed the auth middleware to use jwt instead of sessions",
"should_trigger": false
},
{
"query": "what's the difference between sql and nosql databases?",
"should_trigger": false
},
{
"query": "create a mcp server that connects to my postgres database and exposes a query tool",
"should_trigger": false
},
{
"query": "my react frontend is slow \u2014 can you look at the component re-rendering and suggest optimizations?",
"should_trigger": false
},
{
"query": "write unit tests for this user service class in python",
"should_trigger": false
},
{
"query": "i need to pick between openai and anthropic apis for my chatbot \u2014 which one is cheaper?",
"should_trigger": false
}
]