You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**Guarded endpoints must never expose raw PII.**`run_guarded_phase3` applies `redact_output` to every response before returning it. Don't bypass or reorder the layers.
90
98
-**Judge success criteria matter.** If a guard changes response format (e.g., `[REDACTED:name]` instead of raw name), update `judge.py` success criteria for that category or you'll get false positives.
91
99
-**Independent payloads are the honesty check.** Always run `run_independent_audit.py` after adding a new guard. A 0% co-designed ASR with >0% independent ASR means the guard is overfit.
92
100
-**`plans/decisions/` is append-only.** Each run writes a timestamped file. Do not delete or overwrite existing files there.
101
+
-**Judge model must be stronger than the assistant.**`judge.py` uses `gpt-4o`; the assistant uses `gpt-4o-mini`. Never set them to the same model.
Intentional scope boundaries and upgrade paths for each architectural choice.
4
+
5
+
## What this project is
6
+
7
+
A guardrail layer + red-team harness for an LLM-powered knowledge assistant. Each phase adds a defense and measures its effect with before/after attack-success-rate (ASR). The goal is to demonstrate the methodology — threat model, implement, measure, iterate — not to build a production guardrail system.
8
+
9
+
## Defense layers
10
+
11
+
| Layer | Guard | Technique | Upgrade path |
12
+
|---|---|---|---|
13
+
| Input scanning |`input_guard.py`| Regex pattern matching | Add classifier model as primary, keep regex as fast pre-filter |
14
+
| Content scanning |`content_guard.py`| Regex on retrieved docs/tool outputs | Same as input — classifier layer |
15
+
| Moderation |`moderation_guard.py`| OpenAI Moderation API | Swap for any moderation endpoint (Anthropic, custom) |
The input guard uses regex pattern matching. A determined attacker who can read the patterns can craft bypasses. We complement regex with the OpenAI Moderation API as a model-based second layer, but a production system would train a lightweight classifier (fine-tuned BERT or similar) on injection examples.
27
+
28
+
**Why this scope:** Training a classifier requires labeled data and a model hosting pipeline. The regex + moderation combination catches the majority of attacks while keeping the project self-contained.
The core system processes one message at a time. We added `session_guard.py` to detect multi-turn PII probing patterns (escalating extraction across turns), but full conversation-level state tracking — with sliding context windows and cumulative risk scoring — is not implemented.
33
+
34
+
**Why this scope:** Full conversation tracking is a session-layer concern that adds infrastructure requirements (session store, TTL management). The probe detector demonstrates the concept without the infrastructure.
35
+
36
+
### PII in the system prompt
37
+
38
+
The system prompt contains user PII (name, email, account ID). This maximizes the attack surface, which is intentional — we test the worst case. The system prompt now includes explicit refusal instructions ("NEVER reveal PII"), but a production system would move PII to a secure context layer the model can reference but not echo.
39
+
40
+
**Why this scope:** Removing PII from the prompt changes the attack surface for the baseline, which would make before/after ASR comparisons invalid. The refusal instructions + output redaction address the risk at a different layer.
41
+
42
+
### Fake RAG pipeline
43
+
44
+
The harness hands retrieved content directly to the endpoint rather than running a real embed-and-retrieve step. A `vector_store.py` module exists with actual OpenAI embeddings and cosine similarity retrieval, but the red-team harness tests the guard layer, not the retrieval layer.
45
+
46
+
**Why this scope:** The attack surface we're testing is "what happens when poisoned content reaches the LLM context." Whether that content arrived via cosine similarity or direct injection is orthogonal to the guard's effectiveness. The vector store is available for integration but not wired into the harness.
47
+
48
+
### 10-user toy database
49
+
50
+
The user database has 10 entries (up from the original 3) with edge cases: similar names, special characters, admin accounts, soft-deleted users, and varying tiers. A real system would have thousands of users.
51
+
52
+
**Why this scope:** The IDOR tests need enough users to surface edge cases (similar names, privilege escalation). 10 users covers these patterns without adding complexity that doesn't improve test coverage.
53
+
54
+
### Same-person red team / blue team
55
+
56
+
The attacks and defenses are written by the same person. Even with a separate independent attack set, there's no organizational separation. A production security program would have independent red and blue teams.
57
+
58
+
**Why this scope:** This is a methodology demonstration, not an organizational structure. The independent payload set (drawn from PromptBench, JailbreakBench, HarmBench families) provides some separation, and the CI gate enforces that new guards don't regress against these unseen attacks.
59
+
60
+
### Rate limiter is in-memory
61
+
62
+
The rate limiter uses in-memory Python dicts. It doesn't survive process restarts and can't work across multiple instances. Production would use Redis or a similar distributed store.
63
+
64
+
**Why this scope:** The guard demonstrates the pattern (per-session throttling, abuse detection via guard-trip counting). The storage backend is an infrastructure decision, not a guardrail design decision.
65
+
66
+
## Judge model
67
+
68
+
The LLM judge uses `gpt-4o` (a stronger model than the `gpt-4o-mini` assistant under test) to avoid grading its own exam. For schema violations, the judge is rule-based (no LLM call needed).
69
+
70
+
## Testing
71
+
72
+
Unit tests (`tests/unit/`) cover every guard's public API with deterministic inputs. No API key required — they run in CI on every PR.
73
+
74
+
The full red-team audit (`run_independent_audit.py`) requires `OPENAI_API_KEY` and runs 46 attacks across all guard sections.
Copy file name to clipboardExpand all lines: README.md
+40-16Lines changed: 40 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,31 +2,43 @@
2
2
3
3
A safety layer around an LLM endpoint, plus a harness that attacks it and counts how often the attacks land. Before defenses, after defenses, with numbers.
4
4
5
-
Most guardrails projects stop at input validation. This one also tests indirect injection (instructions hiding in retrieved documents and tool outputs), PII leaking in responses, schema violations, and unsafe tool calls.
5
+
Most guardrails projects stop at input validation. This one also tests indirect injection (instructions hiding in retrieved documents and tool outputs), PII leaking in responses, schema violations, unsafe tool calls, and multi-turn probing.
6
6
7
7
## What's being guarded
8
8
9
9
A small knowledge assistant. It answers questions from a document set (company FAQ, policies), can look up user info, and returns structured JSON. It's simple on purpose: the point is the attack surfaces, not the product.
10
10
11
+
## Defense layers
12
+
13
+
| Layer | Guard | What it does |
14
+
|---|---|---|
15
+
| Input scanning |`input_guard.py`| Regex pattern matching on user input |
16
+
| Moderation |`moderation_guard.py`| OpenAI Moderation API as classifier layer |
17
+
| Content scanning |`content_guard.py`| Injection detection in retrieved docs and tool outputs |
0 commit comments