Skip to content

himanshu-nocodeassistant/guardrails

Repository files navigation

Guardrails & Red-Team Harness

A safety layer around an LLM endpoint, plus a harness that attacks it and counts how often the attacks land. Before defenses, after defenses, with numbers.

Most guardrails projects stop at input validation. This one also tests indirect injection (instructions hiding in retrieved documents and tool outputs), PII leaking in responses, schema violations, unsafe tool calls, and multi-turn probing.

What's being guarded

A small knowledge assistant. It answers questions from a document set (company FAQ, policies), can look up user info, and returns structured JSON. It's simple on purpose: the point is the attack surfaces, not the product.

Defense layers

Layer Guard What it does
Input scanning input_guard.py Regex pattern matching on user input
Moderation moderation_guard.py OpenAI Moderation API as classifier layer
Content scanning content_guard.py Injection detection in retrieved docs and tool outputs
Multi-turn tracking session_guard.py PII probe counting across conversation turns
Tool authorization tool_guard.py Config-driven allowlist + IDOR prevention
Schema enforcement schema_guard.py JSON validation + harmful content rejection
PII redaction pii_guard.py Session-value + regex + encoded variant redaction
Rate limiting rate_guard.py Per-session request throttling and abuse detection

Stack

Python, OpenAI API, red-team harness, CI safety gate

How this was built

Pair-programmed with Claude Code. Threat model, harness design, and the safety bar owned by Himanshu.

Status

Phase Defense Guarded ASR
0 Threat model + harness
1 Direct injection detection 0%
2 Indirect injection detection 0%
3 PII detection and redaction 0%
4 Schema enforcement + tool authorization 0%
5 Gap closure: moderation API, unit tests, multi-turn tracking, rate limiting, prompt hardening, schema rejection, tool allowlist, expanded user DB TBD

Attack-success-rate

Baseline: 35 attacks, unguarded endpoint, gpt-4o-mini.

Category Unguarded Guarded
Direct injection 33% 0%
Indirect injection (retrieved content) 20% 0%
Indirect injection (tool output) 20% 0%
Jailbreak 22% 0%
PII leakage 62% 0%
Schema violations 22% 0%
Unsafe tool calls 71% 0%

Running tests

# Unit tests — fast, no API key needed
python -m pytest tests/unit/ -v

# Full red-team audit (requires OPENAI_API_KEY)
python run_independent_audit.py --fail-above 0.30

Docs

About

Safety layer around an LLM endpoint with a red-team harness that measures attack-success-rate before and after each defense

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages