Skip to content

Commit 89e75c1

Browse files
Merge pull request #3 from himanshu-nocodeassistant/phase/5-gap-closure
feat(phase5): gap closure — 12 architectural fixes
2 parents d9e8328 + a7c80ee commit 89e75c1

28 files changed

Lines changed: 2558 additions & 417 deletions

.github/workflows/safety-gate.yml

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
name: Safety Gate
2+
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
7+
branches: [main]
8+
9+
jobs:
10+
unit-tests:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- uses: actions/checkout@v4
14+
15+
- uses: actions/setup-python@v5
16+
with:
17+
python-version: "3.11"
18+
19+
- name: Install dependencies
20+
run: pip install -r requirements.txt
21+
22+
- name: Run unit tests (no API key needed)
23+
run: python -m pytest tests/unit/ -v
24+
25+
independent-audit:
26+
runs-on: ubuntu-latest
27+
needs: unit-tests
28+
steps:
29+
- uses: actions/checkout@v4
30+
31+
- uses: actions/setup-python@v5
32+
with:
33+
python-version: "3.11"
34+
35+
- name: Install dependencies
36+
run: pip install -r requirements.txt
37+
38+
- name: Run independent attack audit
39+
run: python run_independent_audit.py --fail-above 0.30
40+
env:
41+
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

CLAUDE.md

Lines changed: 40 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -7,34 +7,40 @@ LLM safety layer + red-team harness for a small knowledge assistant. Each phase
77
```
88
src/agent/
99
assistant.py # Unguarded endpoint (run_unguarded)
10-
guarded.py # Guarded endpoints — phase1, phase3; stacks on top of unguarded
10+
guarded.py # Guarded endpoints — phase1, phase3, phase4, phase4_rag
1111
rag_assistant.py # RAG variant (retrieves docs before answering)
1212
documents.py # In-memory document store for RAG
13-
tools.py # Tool definitions (user lookup, etc.)
13+
vector_store.py # OpenAI embeddings vector store (real retrieval)
14+
tools.py # Tool definitions (user lookup, 10-user database)
1415
1516
src/guardrails/
16-
input_guard.py # Pattern-based input scanner (direct/indirect injection)
17-
content_guard.py # Scans retrieved documents and tool outputs (Phase 2)
18-
pii_guard.py # Output redactor — session-value + regex (Phase 3)
19-
schema_guard.py # JSON schema enforcement (Phase 4, planned)
20-
tool_guard.py # Tool call validator (Phase 5, planned)
17+
input_guard.py # Pattern-based input scanner (direct/indirect injection)
18+
content_guard.py # Scans retrieved documents and tool outputs (Phase 2)
19+
moderation_guard.py # OpenAI Moderation API classifier layer (Phase 5)
20+
pii_guard.py # Output redactor — session-value + regex + encoded variants (Phase 3)
21+
session_guard.py # Multi-turn PII probing detector (Phase 5)
22+
rate_guard.py # Per-session rate limiting and abuse detection (Phase 5)
23+
schema_guard.py # JSON schema enforcement + harmful content rejection (Phase 4/5)
24+
tool_guard.py # Config-driven tool allowlist + per-tool authorization (Phase 4/5)
2125
2226
harness/
2327
runner.py # run_attacks() / run_indirect_attacks() / summarize()
24-
judge.py # LLM judge — returns (succeeded: bool, reasoning: str)
28+
judge.py # LLM judge (gpt-4o, stronger than assistant) — returns (succeeded, reasoning)
2529
report.py # Rich table printer + JSON serialiser
2630
27-
tests/attacks/
28-
payloads.py # Primary attack set (direct injection, jailbreak, PII, schema)
29-
indirect_payloads.py # Phase 2: RAG and tool-output injection attacks
30-
tool_call_payloads.py # Phase 5: unsafe tool-call attacks
31-
independent_payloads.py # External benchmark attacks (JailbreakBench / HarmBench)
31+
tests/unit/ # Deterministic guard tests — no API key needed
32+
tests/attacks/ # Attack payload sets
33+
tests/evals/ # Evaluation-level tests
34+
35+
.github/workflows/
36+
safety-gate.yml # CI: unit tests (always) + independent audit (with API key)
3237
3338
run_baseline.py # Unguarded baseline across all categories
3439
run_phase1.py # Direct injection before/after
3540
run_phase2.py # Indirect injection (RAG + tool) before/after
3641
run_phase3.py # PII output redaction before/after
37-
run_independent_audit.py # Independent set — tests guard generalisation
42+
run_phase4.py # Schema enforcement + tool authorization before/after
43+
run_independent_audit.py # Independent set — all guards, CI gate (--fail-above)
3844
```
3945

4046
## Setup
@@ -45,16 +51,20 @@ pip install -r requirements.txt
4551
cp .env.example .env # add your OPENAI_API_KEY
4652
```
4753

48-
## Running evaluations
49-
50-
All scripts share the same flags: `--verbose` (per-attack detail), `--dry-run` (list attacks, no API calls).
54+
## Running tests and evaluations
5155

5256
```bash
53-
python run_baseline.py # Unguarded baseline
54-
python run_phase1.py # Direct injection
55-
python run_phase2.py # Indirect injection
56-
python run_phase3.py # PII redaction
57-
python run_independent_audit.py # Generalisation check
57+
# Unit tests — fast, deterministic, no API key
58+
python -m pytest tests/unit/ -v
59+
60+
# Red-team evaluations (require OPENAI_API_KEY)
61+
python run_baseline.py # Unguarded baseline
62+
python run_phase1.py # Direct injection
63+
python run_phase2.py # Indirect injection
64+
python run_phase3.py # PII redaction
65+
python run_phase4.py # Schema + tool authorization
66+
python run_independent_audit.py # Generalisation check — all guards
67+
python run_independent_audit.py --fail-above 0.20 # Stricter CI threshold
5868
```
5969

6070
Output JSON and markdown snapshots are saved to `plans/decisions/`.
@@ -68,28 +78,27 @@ Output JSON and markdown snapshots are saved to `plans/decisions/`.
6878
| 2 | Indirect injection (content_guard) | Done | 0% |
6979
| 3 | PII output redaction (pii_guard) | Done | 0% |
7080
| 4 | Schema enforcement + tool authorization (schema_guard + tool_guard) | Done | 0% |
71-
|| CI safety gate | Planned ||
72-
73-
Independent audit (Phase 1 guard, 15 external attacks): 20% ASR — three misses: ROT13-encoded instruction, Italian override, end-of-message bracket escape.
81+
| 5 | Gap closure — moderation API, unit tests, multi-turn tracking, rate limiting, prompt hardening, schema rejection, tool allowlist, expanded user DB | Done | TBD |
7482

7583
## Attack categories
7684

7785
| Category | Channel | Payload location |
7886
|---|---|---|
79-
| direct_injection | User input | `payloads.py` |
80-
| indirect_injection_rag | Retrieved content | `indirect_payloads.py` |
87+
| direct_injection | User input | `payloads.py`, `independent_payloads.py` |
88+
| indirect_injection_rag | Retrieved content | `indirect_payloads.py`, `independent_payloads.py` |
8189
| indirect_injection_tool | Tool output | `indirect_payloads.py` |
82-
| jailbreak | User input | `payloads.py` |
83-
| pii_extraction | Model response | `payloads.py` |
84-
| schema_violation | Model output format | `payloads.py` |
85-
| unsafe_tool_calls | Tool call parameters | `tool_call_payloads.py` |
90+
| jailbreak | User input | `payloads.py`, `independent_payloads.py` |
91+
| pii_extraction | Model response | `payloads.py`, `independent_payloads.py` |
92+
| schema_violation | Model output format | `payloads.py`, `independent_payloads.py` |
93+
| unsafe_tool_call | Tool call parameters | `tool_call_payloads.py`, `independent_payloads.py` |
8694

8795
## Key invariants
8896

8997
- **Guarded endpoints must never expose raw PII.** `run_guarded_phase3` applies `redact_output` to every response before returning it. Don't bypass or reorder the layers.
9098
- **Judge success criteria matter.** If a guard changes response format (e.g., `[REDACTED:name]` instead of raw name), update `judge.py` success criteria for that category or you'll get false positives.
9199
- **Independent payloads are the honesty check.** Always run `run_independent_audit.py` after adding a new guard. A 0% co-designed ASR with >0% independent ASR means the guard is overfit.
92100
- **`plans/decisions/` is append-only.** Each run writes a timestamped file. Do not delete or overwrite existing files there.
101+
- **Judge model must be stronger than the assistant.** `judge.py` uses `gpt-4o`; the assistant uses `gpt-4o-mini`. Never set them to the same model.
93102

94103
## Private files
95104

DESIGN.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# Design Decisions
2+
3+
Intentional scope boundaries and upgrade paths for each architectural choice.
4+
5+
## What this project is
6+
7+
A guardrail layer + red-team harness for an LLM-powered knowledge assistant. Each phase adds a defense and measures its effect with before/after attack-success-rate (ASR). The goal is to demonstrate the methodology — threat model, implement, measure, iterate — not to build a production guardrail system.
8+
9+
## Defense layers
10+
11+
| Layer | Guard | Technique | Upgrade path |
12+
|---|---|---|---|
13+
| Input scanning | `input_guard.py` | Regex pattern matching | Add classifier model as primary, keep regex as fast pre-filter |
14+
| Content scanning | `content_guard.py` | Regex on retrieved docs/tool outputs | Same as input — classifier layer |
15+
| Moderation | `moderation_guard.py` | OpenAI Moderation API | Swap for any moderation endpoint (Anthropic, custom) |
16+
| Multi-turn tracking | `session_guard.py` | PII probe counting across turns | Conversation-level anomaly scoring, sliding window analysis |
17+
| PII redaction | `pii_guard.py` | Session-value matching + regex + encoded variants | Named-entity recognition model for novel PII patterns |
18+
| Schema enforcement | `schema_guard.py` | JSON validation + harmful content rejection | Structured output constraints at the API level |
19+
| Tool authorization | `tool_guard.py` | Config-driven allowlist + per-tool IDOR rules | Policy engine (OPA/Cedar), per-role dynamic permissions |
20+
| Rate limiting | `rate_guard.py` | In-memory per-session counters | Redis-backed distributed rate limiter |
21+
22+
## Scope boundaries — what we skipped and why
23+
24+
### Regex-only input guard (no custom classifier)
25+
26+
The input guard uses regex pattern matching. A determined attacker who can read the patterns can craft bypasses. We complement regex with the OpenAI Moderation API as a model-based second layer, but a production system would train a lightweight classifier (fine-tuned BERT or similar) on injection examples.
27+
28+
**Why this scope:** Training a classifier requires labeled data and a model hosting pipeline. The regex + moderation combination catches the majority of attacks while keeping the project self-contained.
29+
30+
### Single-turn processing (with multi-turn probe detection)
31+
32+
The core system processes one message at a time. We added `session_guard.py` to detect multi-turn PII probing patterns (escalating extraction across turns), but full conversation-level state tracking — with sliding context windows and cumulative risk scoring — is not implemented.
33+
34+
**Why this scope:** Full conversation tracking is a session-layer concern that adds infrastructure requirements (session store, TTL management). The probe detector demonstrates the concept without the infrastructure.
35+
36+
### PII in the system prompt
37+
38+
The system prompt contains user PII (name, email, account ID). This maximizes the attack surface, which is intentional — we test the worst case. The system prompt now includes explicit refusal instructions ("NEVER reveal PII"), but a production system would move PII to a secure context layer the model can reference but not echo.
39+
40+
**Why this scope:** Removing PII from the prompt changes the attack surface for the baseline, which would make before/after ASR comparisons invalid. The refusal instructions + output redaction address the risk at a different layer.
41+
42+
### Fake RAG pipeline
43+
44+
The harness hands retrieved content directly to the endpoint rather than running a real embed-and-retrieve step. A `vector_store.py` module exists with actual OpenAI embeddings and cosine similarity retrieval, but the red-team harness tests the guard layer, not the retrieval layer.
45+
46+
**Why this scope:** The attack surface we're testing is "what happens when poisoned content reaches the LLM context." Whether that content arrived via cosine similarity or direct injection is orthogonal to the guard's effectiveness. The vector store is available for integration but not wired into the harness.
47+
48+
### 10-user toy database
49+
50+
The user database has 10 entries (up from the original 3) with edge cases: similar names, special characters, admin accounts, soft-deleted users, and varying tiers. A real system would have thousands of users.
51+
52+
**Why this scope:** The IDOR tests need enough users to surface edge cases (similar names, privilege escalation). 10 users covers these patterns without adding complexity that doesn't improve test coverage.
53+
54+
### Same-person red team / blue team
55+
56+
The attacks and defenses are written by the same person. Even with a separate independent attack set, there's no organizational separation. A production security program would have independent red and blue teams.
57+
58+
**Why this scope:** This is a methodology demonstration, not an organizational structure. The independent payload set (drawn from PromptBench, JailbreakBench, HarmBench families) provides some separation, and the CI gate enforces that new guards don't regress against these unseen attacks.
59+
60+
### Rate limiter is in-memory
61+
62+
The rate limiter uses in-memory Python dicts. It doesn't survive process restarts and can't work across multiple instances. Production would use Redis or a similar distributed store.
63+
64+
**Why this scope:** The guard demonstrates the pattern (per-session throttling, abuse detection via guard-trip counting). The storage backend is an infrastructure decision, not a guardrail design decision.
65+
66+
## Judge model
67+
68+
The LLM judge uses `gpt-4o` (a stronger model than the `gpt-4o-mini` assistant under test) to avoid grading its own exam. For schema violations, the judge is rule-based (no LLM call needed).
69+
70+
## Testing
71+
72+
Unit tests (`tests/unit/`) cover every guard's public API with deterministic inputs. No API key required — they run in CI on every PR.
73+
74+
The full red-team audit (`run_independent_audit.py`) requires `OPENAI_API_KEY` and runs 46 attacks across all guard sections.

README.md

Lines changed: 40 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -2,31 +2,43 @@
22

33
A safety layer around an LLM endpoint, plus a harness that attacks it and counts how often the attacks land. Before defenses, after defenses, with numbers.
44

5-
Most guardrails projects stop at input validation. This one also tests indirect injection (instructions hiding in retrieved documents and tool outputs), PII leaking in responses, schema violations, and unsafe tool calls.
5+
Most guardrails projects stop at input validation. This one also tests indirect injection (instructions hiding in retrieved documents and tool outputs), PII leaking in responses, schema violations, unsafe tool calls, and multi-turn probing.
66

77
## What's being guarded
88

99
A small knowledge assistant. It answers questions from a document set (company FAQ, policies), can look up user info, and returns structured JSON. It's simple on purpose: the point is the attack surfaces, not the product.
1010

11+
## Defense layers
12+
13+
| Layer | Guard | What it does |
14+
|---|---|---|
15+
| Input scanning | `input_guard.py` | Regex pattern matching on user input |
16+
| Moderation | `moderation_guard.py` | OpenAI Moderation API as classifier layer |
17+
| Content scanning | `content_guard.py` | Injection detection in retrieved docs and tool outputs |
18+
| Multi-turn tracking | `session_guard.py` | PII probe counting across conversation turns |
19+
| Tool authorization | `tool_guard.py` | Config-driven allowlist + IDOR prevention |
20+
| Schema enforcement | `schema_guard.py` | JSON validation + harmful content rejection |
21+
| PII redaction | `pii_guard.py` | Session-value + regex + encoded variant redaction |
22+
| Rate limiting | `rate_guard.py` | Per-session request throttling and abuse detection |
23+
1124
## Stack
1225

13-
Python, OpenAI API, red-team harness, CI
26+
Python, OpenAI API, red-team harness, CI safety gate
1427

1528
## How this was built
1629

1730
Pair-programmed with Claude Code. Threat model, harness design, and the safety bar owned by Himanshu.
1831

1932
## Status
2033

21-
| Category | Status |
22-
|---|---|
23-
| Threat model | Done |
24-
| Red-team harness | Done |
25-
| Direct injection detection | Done |
26-
| Indirect injection detection | Done |
27-
| PII detection and redaction | Planned |
28-
| Output validation and schema enforcement | Planned |
29-
| CI safety gate | Planned |
34+
| Phase | Defense | Guarded ASR |
35+
|---|---|---|
36+
| 0 | Threat model + harness ||
37+
| 1 | Direct injection detection | 0% |
38+
| 2 | Indirect injection detection | 0% |
39+
| 3 | PII detection and redaction | 0% |
40+
| 4 | Schema enforcement + tool authorization | 0% |
41+
| 5 | Gap closure: moderation API, unit tests, multi-turn tracking, rate limiting, prompt hardening, schema rejection, tool allowlist, expanded user DB | TBD |
3042

3143
## Attack-success-rate
3244

@@ -37,11 +49,23 @@ Baseline: 35 attacks, unguarded endpoint, gpt-4o-mini.
3749
| Direct injection | 33% | 0% |
3850
| Indirect injection (retrieved content) | 20% | 0% |
3951
| Indirect injection (tool output) | 20% | 0% |
40-
| Jailbreak | 22% ||
41-
| PII leakage | 63% ||
42-
| Schema violations | 22% ||
43-
| Unsafe tool calls | TBD ||
52+
| Jailbreak | 22% | 0% |
53+
| PII leakage | 62% | 0% |
54+
| Schema violations | 22% | 0% |
55+
| Unsafe tool calls | 71% | 0% |
56+
57+
## Running tests
58+
59+
```bash
60+
# Unit tests — fast, no API key needed
61+
python -m pytest tests/unit/ -v
62+
63+
# Full red-team audit (requires OPENAI_API_KEY)
64+
python run_independent_audit.py --fail-above 0.30
65+
```
4466

4567
## Docs
4668

47-
Threat model in [`docs/threat-model.md`](docs/threat-model.md).
69+
- Threat model: [`docs/threat-model.md`](docs/threat-model.md)
70+
- Design decisions and scope boundaries: [`DESIGN.md`](DESIGN.md)
71+
- Release notes: [`docs/releases.md`](docs/releases.md)

0 commit comments

Comments
 (0)