A safety layer around an LLM endpoint, plus a harness that attacks it and counts how often the attacks land. Before defenses, after defenses, with numbers.
Most guardrails projects stop at input validation. This one also tests indirect injection (instructions hiding in retrieved documents and tool outputs), PII leaking in responses, schema violations, unsafe tool calls, and multi-turn probing.
A small knowledge assistant. It answers questions from a document set (company FAQ, policies), can look up user info, and returns structured JSON. It's simple on purpose: the point is the attack surfaces, not the product.
| Layer | Guard | What it does |
|---|---|---|
| Input scanning | input_guard.py |
Regex pattern matching on user input |
| Moderation | moderation_guard.py |
OpenAI Moderation API as classifier layer |
| Content scanning | content_guard.py |
Injection detection in retrieved docs and tool outputs |
| Multi-turn tracking | session_guard.py |
PII probe counting across conversation turns |
| Tool authorization | tool_guard.py |
Config-driven allowlist + IDOR prevention |
| Schema enforcement | schema_guard.py |
JSON validation + harmful content rejection |
| PII redaction | pii_guard.py |
Session-value + regex + encoded variant redaction |
| Rate limiting | rate_guard.py |
Per-session request throttling and abuse detection |
Python, OpenAI API, red-team harness, CI safety gate
Pair-programmed with Claude Code. Threat model, harness design, and the safety bar owned by Himanshu.
| Phase | Defense | Guarded ASR |
|---|---|---|
| 0 | Threat model + harness | — |
| 1 | Direct injection detection | 0% |
| 2 | Indirect injection detection | 0% |
| 3 | PII detection and redaction | 0% |
| 4 | Schema enforcement + tool authorization | 0% |
| 5 | Gap closure: moderation API, unit tests, multi-turn tracking, rate limiting, prompt hardening, schema rejection, tool allowlist, expanded user DB | TBD |
Baseline: 35 attacks, unguarded endpoint, gpt-4o-mini.
| Category | Unguarded | Guarded |
|---|---|---|
| Direct injection | 33% | 0% |
| Indirect injection (retrieved content) | 20% | 0% |
| Indirect injection (tool output) | 20% | 0% |
| Jailbreak | 22% | 0% |
| PII leakage | 62% | 0% |
| Schema violations | 22% | 0% |
| Unsafe tool calls | 71% | 0% |
# Unit tests — fast, no API key needed
python -m pytest tests/unit/ -v
# Full red-team audit (requires OPENAI_API_KEY)
python run_independent_audit.py --fail-above 0.30- Threat model:
docs/threat-model.md - Design decisions and scope boundaries:
DESIGN.md - Release notes:
docs/releases.md