himanshu-nocodeassistant
diff --git a/‎.github/workflows/safety-gate.yml‎
Lines changed: 41 additions & 0 deletions b/‎.github/workflows/safety-gate.yml‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 40 additions & 31 deletions b/‎CLAUDE.md‎
Lines changed: 40 additions & 31 deletions
diff --git a/‎DESIGN.md‎
Lines changed: 74 additions & 0 deletions b/‎DESIGN.md‎
Lines changed: 74 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 40 additions & 16 deletions b/‎README.md‎
Lines changed: 40 additions & 16 deletions
@@ -0,0 +1,41 @@
+name: Safety Gate
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+
+jobs:
+  unit-tests:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+
+      - name: Install dependencies
+        run: pip install -r requirements.txt
+
+      - name: Run unit tests (no API key needed)
+        run: python -m pytest tests/unit/ -v
+
+  independent-audit:
+    runs-on: ubuntu-latest
+    needs: unit-tests
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+
+      - name: Install dependencies
+        run: pip install -r requirements.txt
+
+      - name: Run independent attack audit
+        run: python run_independent_audit.py --fail-above 0.30
+        env:
+          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
@@ -7,34 +7,40 @@ LLM safety layer + red-team harness for a small knowledge assistant. Each phase
 ```
 src/agent/
   assistant.py       # Unguarded endpoint (run_unguarded)
-  guarded.py         # Guarded endpoints — phase1, phase3; stacks on top of unguarded
+  guarded.py         # Guarded endpoints — phase1, phase3, phase4, phase4_rag
   rag_assistant.py   # RAG variant (retrieves docs before answering)
   documents.py       # In-memory document store for RAG
-  tools.py           # Tool definitions (user lookup, etc.)
+  vector_store.py    # OpenAI embeddings vector store (real retrieval)
+  tools.py           # Tool definitions (user lookup, 10-user database)
 
 src/guardrails/
-  input_guard.py     # Pattern-based input scanner (direct/indirect injection)
-  content_guard.py   # Scans retrieved documents and tool outputs (Phase 2)
-  pii_guard.py       # Output redactor — session-value + regex (Phase 3)
-  schema_guard.py    # JSON schema enforcement (Phase 4, planned)
-  tool_guard.py      # Tool call validator (Phase 5, planned)
+  input_guard.py       # Pattern-based input scanner (direct/indirect injection)
+  content_guard.py     # Scans retrieved documents and tool outputs (Phase 2)
+  moderation_guard.py  # OpenAI Moderation API classifier layer (Phase 5)
+  pii_guard.py         # Output redactor — session-value + regex + encoded variants (Phase 3)
+  session_guard.py     # Multi-turn PII probing detector (Phase 5)
+  rate_guard.py        # Per-session rate limiting and abuse detection (Phase 5)
+  schema_guard.py      # JSON schema enforcement + harmful content rejection (Phase 4/5)
+  tool_guard.py        # Config-driven tool allowlist + per-tool authorization (Phase 4/5)
 
 harness/
   runner.py          # run_attacks() / run_indirect_attacks() / summarize()
-  judge.py           # LLM judge — returns (succeeded: bool, reasoning: str)
+  judge.py           # LLM judge (gpt-4o, stronger than assistant) — returns (succeeded, reasoning)
   report.py          # Rich table printer + JSON serialiser
 
-tests/attacks/
-  payloads.py             # Primary attack set (direct injection, jailbreak, PII, schema)
-  indirect_payloads.py    # Phase 2: RAG and tool-output injection attacks
-  tool_call_payloads.py   # Phase 5: unsafe tool-call attacks
-  independent_payloads.py # External benchmark attacks (JailbreakBench / HarmBench)
+tests/unit/            # Deterministic guard tests — no API key needed
+tests/attacks/         # Attack payload sets
+tests/evals/           # Evaluation-level tests
+
+.github/workflows/
+  safety-gate.yml    # CI: unit tests (always) + independent audit (with API key)
 
 run_baseline.py           # Unguarded baseline across all categories
 run_phase1.py             # Direct injection before/after
 run_phase2.py             # Indirect injection (RAG + tool) before/after
 run_phase3.py             # PII output redaction before/after
-run_independent_audit.py  # Independent set — tests guard generalisation
+run_phase4.py             # Schema enforcement + tool authorization before/after
+run_independent_audit.py  # Independent set — all guards, CI gate (--fail-above)
 ```
 
 ## Setup
@@ -45,16 +51,20 @@ pip install -r requirements.txt
 cp .env.example .env          # add your OPENAI_API_KEY
 ```
 
-## Running evaluations
-
-All scripts share the same flags: `--verbose` (per-attack detail), `--dry-run` (list attacks, no API calls).
+## Running tests and evaluations
 
 ```bash
-python run_baseline.py                  # Unguarded baseline
-python run_phase1.py                    # Direct injection
-python run_phase2.py                    # Indirect injection
-python run_phase3.py                    # PII redaction
-python run_independent_audit.py         # Generalisation check
+# Unit tests — fast, deterministic, no API key
+python -m pytest tests/unit/ -v
+
+# Red-team evaluations (require OPENAI_API_KEY)
+python run_baseline.py                              # Unguarded baseline
+python run_phase1.py                                # Direct injection
+python run_phase2.py                                # Indirect injection
+python run_phase3.py                                # PII redaction
+python run_phase4.py                                # Schema + tool authorization
+python run_independent_audit.py                     # Generalisation check — all guards
+python run_independent_audit.py --fail-above 0.20   # Stricter CI threshold
 ```
 
 Output JSON and markdown snapshots are saved to `plans/decisions/`.
@@ -68,28 +78,27 @@ Output JSON and markdown snapshots are saved to `plans/decisions/`.
 | 2 | Indirect injection (content_guard) | Done | 0% |
 | 3 | PII output redaction (pii_guard) | Done | 0% |
 | 4 | Schema enforcement + tool authorization (schema_guard + tool_guard) | Done | 0% |
-| — | CI safety gate | Planned | — |
-
-Independent audit (Phase 1 guard, 15 external attacks): 20% ASR — three misses: ROT13-encoded instruction, Italian override, end-of-message bracket escape.
+| 5 | Gap closure — moderation API, unit tests, multi-turn tracking, rate limiting, prompt hardening, schema rejection, tool allowlist, expanded user DB | Done | TBD |
 
 ## Attack categories
 
 | Category | Channel | Payload location |
 |---|---|---|
-| direct_injection | User input | `payloads.py` |
-| indirect_injection_rag | Retrieved content | `indirect_payloads.py` |
+| direct_injection | User input | `payloads.py`, `independent_payloads.py` |
+| indirect_injection_rag | Retrieved content | `indirect_payloads.py`, `independent_payloads.py` |
 | indirect_injection_tool | Tool output | `indirect_payloads.py` |
-| jailbreak | User input | `payloads.py` |
-| pii_extraction | Model response | `payloads.py` |
-| schema_violation | Model output format | `payloads.py` |
-| unsafe_tool_calls | Tool call parameters | `tool_call_payloads.py` |
+| jailbreak | User input | `payloads.py`, `independent_payloads.py` |
+| pii_extraction | Model response | `payloads.py`, `independent_payloads.py` |
+| schema_violation | Model output format | `payloads.py`, `independent_payloads.py` |
+| unsafe_tool_call | Tool call parameters | `tool_call_payloads.py`, `independent_payloads.py` |
 
 ## Key invariants
 
 - **Guarded endpoints must never expose raw PII.** `run_guarded_phase3` applies `redact_output` to every response before returning it. Don't bypass or reorder the layers.
 - **Judge success criteria matter.** If a guard changes response format (e.g., `[REDACTED:name]` instead of raw name), update `judge.py` success criteria for that category or you'll get false positives.
 - **Independent payloads are the honesty check.** Always run `run_independent_audit.py` after adding a new guard. A 0% co-designed ASR with >0% independent ASR means the guard is overfit.
 - **`plans/decisions/` is append-only.** Each run writes a timestamped file. Do not delete or overwrite existing files there.
+- **Judge model must be stronger than the assistant.** `judge.py` uses `gpt-4o`; the assistant uses `gpt-4o-mini`. Never set them to the same model.
 
 ## Private files
 
 
@@ -0,0 +1,74 @@
+# Design Decisions
+
+Intentional scope boundaries and upgrade paths for each architectural choice.
+
+## What this project is
+
+A guardrail layer + red-team harness for an LLM-powered knowledge assistant. Each phase adds a defense and measures its effect with before/after attack-success-rate (ASR). The goal is to demonstrate the methodology — threat model, implement, measure, iterate — not to build a production guardrail system.
+
+## Defense layers
+
+| Layer | Guard | Technique | Upgrade path |
+|---|---|---|---|
+| Input scanning | `input_guard.py` | Regex pattern matching | Add classifier model as primary, keep regex as fast pre-filter |
+| Content scanning | `content_guard.py` | Regex on retrieved docs/tool outputs | Same as input — classifier layer |
+| Moderation | `moderation_guard.py` | OpenAI Moderation API | Swap for any moderation endpoint (Anthropic, custom) |
+| Multi-turn tracking | `session_guard.py` | PII probe counting across turns | Conversation-level anomaly scoring, sliding window analysis |
+| PII redaction | `pii_guard.py` | Session-value matching + regex + encoded variants | Named-entity recognition model for novel PII patterns |
+| Schema enforcement | `schema_guard.py` | JSON validation + harmful content rejection | Structured output constraints at the API level |
+| Tool authorization | `tool_guard.py` | Config-driven allowlist + per-tool IDOR rules | Policy engine (OPA/Cedar), per-role dynamic permissions |
+| Rate limiting | `rate_guard.py` | In-memory per-session counters | Redis-backed distributed rate limiter |
+
+## Scope boundaries — what we skipped and why
+
+### Regex-only input guard (no custom classifier)
+
+The input guard uses regex pattern matching. A determined attacker who can read the patterns can craft bypasses. We complement regex with the OpenAI Moderation API as a model-based second layer, but a production system would train a lightweight classifier (fine-tuned BERT or similar) on injection examples.
+
+**Why this scope:** Training a classifier requires labeled data and a model hosting pipeline. The regex + moderation combination catches the majority of attacks while keeping the project self-contained.
+
+### Single-turn processing (with multi-turn probe detection)
+
+The core system processes one message at a time. We added `session_guard.py` to detect multi-turn PII probing patterns (escalating extraction across turns), but full conversation-level state tracking — with sliding context windows and cumulative risk scoring — is not implemented.
+
+**Why this scope:** Full conversation tracking is a session-layer concern that adds infrastructure requirements (session store, TTL management). The probe detector demonstrates the concept without the infrastructure.
+
+### PII in the system prompt
+
+The system prompt contains user PII (name, email, account ID). This maximizes the attack surface, which is intentional — we test the worst case. The system prompt now includes explicit refusal instructions ("NEVER reveal PII"), but a production system would move PII to a secure context layer the model can reference but not echo.
+
+**Why this scope:** Removing PII from the prompt changes the attack surface for the baseline, which would make before/after ASR comparisons invalid. The refusal instructions + output redaction address the risk at a different layer.
+
+### Fake RAG pipeline
+
+The harness hands retrieved content directly to the endpoint rather than running a real embed-and-retrieve step. A `vector_store.py` module exists with actual OpenAI embeddings and cosine similarity retrieval, but the red-team harness tests the guard layer, not the retrieval layer.
+
+**Why this scope:** The attack surface we're testing is "what happens when poisoned content reaches the LLM context." Whether that content arrived via cosine similarity or direct injection is orthogonal to the guard's effectiveness. The vector store is available for integration but not wired into the harness.
+
+### 10-user toy database
+
+The user database has 10 entries (up from the original 3) with edge cases: similar names, special characters, admin accounts, soft-deleted users, and varying tiers. A real system would have thousands of users.
+
+**Why this scope:** The IDOR tests need enough users to surface edge cases (similar names, privilege escalation). 10 users covers these patterns without adding complexity that doesn't improve test coverage.
+
+### Same-person red team / blue team
+
+The attacks and defenses are written by the same person. Even with a separate independent attack set, there's no organizational separation. A production security program would have independent red and blue teams.
+
+**Why this scope:** This is a methodology demonstration, not an organizational structure. The independent payload set (drawn from PromptBench, JailbreakBench, HarmBench families) provides some separation, and the CI gate enforces that new guards don't regress against these unseen attacks.
+
+### Rate limiter is in-memory
+
+The rate limiter uses in-memory Python dicts. It doesn't survive process restarts and can't work across multiple instances. Production would use Redis or a similar distributed store.
+
+**Why this scope:** The guard demonstrates the pattern (per-session throttling, abuse detection via guard-trip counting). The storage backend is an infrastructure decision, not a guardrail design decision.
+
+## Judge model
+
+The LLM judge uses `gpt-4o` (a stronger model than the `gpt-4o-mini` assistant under test) to avoid grading its own exam. For schema violations, the judge is rule-based (no LLM call needed).
+
+## Testing
+
+Unit tests (`tests/unit/`) cover every guard's public API with deterministic inputs. No API key required — they run in CI on every PR.
+
+The full red-team audit (`run_independent_audit.py`) requires `OPENAI_API_KEY` and runs 46 attacks across all guard sections.
@@ -2,31 +2,43 @@
 
 A safety layer around an LLM endpoint, plus a harness that attacks it and counts how often the attacks land. Before defenses, after defenses, with numbers.
 
-Most guardrails projects stop at input validation. This one also tests indirect injection (instructions hiding in retrieved documents and tool outputs), PII leaking in responses, schema violations, and unsafe tool calls.
+Most guardrails projects stop at input validation. This one also tests indirect injection (instructions hiding in retrieved documents and tool outputs), PII leaking in responses, schema violations, unsafe tool calls, and multi-turn probing.
 
 ## What's being guarded
 
 A small knowledge assistant. It answers questions from a document set (company FAQ, policies), can look up user info, and returns structured JSON. It's simple on purpose: the point is the attack surfaces, not the product.
 
+## Defense layers
+
+| Layer | Guard | What it does |
+|---|---|---|
+| Input scanning | `input_guard.py` | Regex pattern matching on user input |
+| Moderation | `moderation_guard.py` | OpenAI Moderation API as classifier layer |
+| Content scanning | `content_guard.py` | Injection detection in retrieved docs and tool outputs |
+| Multi-turn tracking | `session_guard.py` | PII probe counting across conversation turns |
+| Tool authorization | `tool_guard.py` | Config-driven allowlist + IDOR prevention |
+| Schema enforcement | `schema_guard.py` | JSON validation + harmful content rejection |
+| PII redaction | `pii_guard.py` | Session-value + regex + encoded variant redaction |
+| Rate limiting | `rate_guard.py` | Per-session request throttling and abuse detection |
+
 ## Stack
 
-Python, OpenAI API, red-team harness, CI
+Python, OpenAI API, red-team harness, CI safety gate
 
 ## How this was built
 
 Pair-programmed with Claude Code. Threat model, harness design, and the safety bar owned by Himanshu.
 
 ## Status
 
-| Category | Status |
-|---|---|
-| Threat model | Done |
-| Red-team harness | Done |
-| Direct injection detection | Done |
-| Indirect injection detection | Done |
-| PII detection and redaction | Planned |
-| Output validation and schema enforcement | Planned |
-| CI safety gate | Planned |
+| Phase | Defense | Guarded ASR |
+|---|---|---|
+| 0 | Threat model + harness | — |
+| 1 | Direct injection detection | 0% |
+| 2 | Indirect injection detection | 0% |
+| 3 | PII detection and redaction | 0% |
+| 4 | Schema enforcement + tool authorization | 0% |
+| 5 | Gap closure: moderation API, unit tests, multi-turn tracking, rate limiting, prompt hardening, schema rejection, tool allowlist, expanded user DB | TBD |
 
 ## Attack-success-rate
 
@@ -37,11 +49,23 @@ Baseline: 35 attacks, unguarded endpoint, gpt-4o-mini.
 | Direct injection | 33% | 0% |
 | Indirect injection (retrieved content) | 20% | 0% |
 | Indirect injection (tool output) | 20% | 0% |
-| Jailbreak | 22% | — |
-| PII leakage | 63% | — |
-| Schema violations | 22% | — |
-| Unsafe tool calls | TBD | — |
+| Jailbreak | 22% | 0% |
+| PII leakage | 62% | 0% |
+| Schema violations | 22% | 0% |
+| Unsafe tool calls | 71% | 0% |
+
+## Running tests
+
+```bash
+# Unit tests — fast, no API key needed
+python -m pytest tests/unit/ -v
+
+# Full red-team audit (requires OPENAI_API_KEY)
+python run_independent_audit.py --fail-above 0.30
+```
 
 ## Docs
 
-Threat model in [`docs/threat-model.md`](docs/threat-model.md).
+- Threat model: [`docs/threat-model.md`](docs/threat-model.md)
+- Design decisions and scope boundaries: [`DESIGN.md`](DESIGN.md)
+- Release notes: [`docs/releases.md`](docs/releases.md)