Guardrails & Red-Team Harness

A safety layer around an LLM endpoint, plus a harness that attacks it and counts how often the attacks land. Before defenses, after defenses, with numbers.

Most guardrails projects stop at input validation. This one also tests indirect injection (instructions hiding in retrieved documents and tool outputs), PII leaking in responses, schema violations, unsafe tool calls, and multi-turn probing.

What's being guarded

A small knowledge assistant. It answers questions from a document set (company FAQ, policies), can look up user info, and returns structured JSON. It's simple on purpose: the point is the attack surfaces, not the product.

Defense layers

Layer	Guard	What it does
Input scanning	`input_guard.py`	Regex pattern matching on user input
Moderation	`moderation_guard.py`	OpenAI Moderation API as classifier layer
Content scanning	`content_guard.py`	Injection detection in retrieved docs and tool outputs
Multi-turn tracking	`session_guard.py`	PII probe counting across conversation turns
Tool authorization	`tool_guard.py`	Config-driven allowlist + IDOR prevention
Schema enforcement	`schema_guard.py`	JSON validation + harmful content rejection
PII redaction	`pii_guard.py`	Session-value + regex + encoded variant redaction
Rate limiting	`rate_guard.py`	Per-session request throttling and abuse detection

Stack

Python, OpenAI API, red-team harness, CI safety gate

How this was built

Pair-programmed with Claude Code. Threat model, harness design, and the safety bar owned by Himanshu.

Status

Phase	Defense	Guarded ASR
0	Threat model + harness	—
1	Direct injection detection	0%
2	Indirect injection detection	0%
3	PII detection and redaction	0%
4	Schema enforcement + tool authorization	0%
5	Gap closure: moderation API, unit tests, multi-turn tracking, rate limiting, prompt hardening, schema rejection, tool allowlist, expanded user DB	TBD

Attack-success-rate

Baseline: 35 attacks, unguarded endpoint, gpt-4o-mini.

Category	Unguarded	Guarded
Direct injection	33%	0%
Indirect injection (retrieved content)	20%	0%
Indirect injection (tool output)	20%	0%
Jailbreak	22%	0%
PII leakage	62%	0%
Schema violations	22%	0%
Unsafe tool calls	71%	0%

Running tests

# Unit tests — fast, no API key needed
python -m pytest tests/unit/ -v

# Full red-team audit (requires OPENAI_API_KEY)
python run_independent_audit.py --fail-above 0.30

Docs

Threat model: docs/threat-model.md
Design decisions and scope boundaries: DESIGN.md
Release notes: docs/releases.md

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
docs		docs
harness		harness
plans/decisions		plans/decisions
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
DESIGN.md		DESIGN.md
README.md		README.md
requirements.txt		requirements.txt
run_baseline.py		run_baseline.py
run_independent_audit.py		run_independent_audit.py
run_phase1.py		run_phase1.py
run_phase2.py		run_phase2.py
run_phase3.py		run_phase3.py
run_phase4.py		run_phase4.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Guardrails & Red-Team Harness

What's being guarded

Defense layers

Stack

How this was built

Status

Attack-success-rate

Running tests

Docs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Guardrails & Red-Team Harness

What's being guarded

Defense layers

Stack

How this was built

Status

Attack-success-rate

Running tests

Docs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages