Awesome Jailbreak Guardrails for Large Models

Introduction

This repository is a list of research papers, articles, and resources related to jailbreak guardrails for Large Models (i.e., large language models (LLMs), multimodal large language models (MLLMs), and AI agents). Jailbreak guardrails are techniques and strategies designed to detect and filter unauthorized or harmful behavior in AI systems, ensuring they operate safely and ethically.

Survey Papers

Safeguarding Large Language Models: A Survey, Artificial Intelligence Review 2025
Current state of LLM Risks and AI Guardrails, arXiv 2024
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models, arXiv 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey, arXiv 2024
SoK: The Security-Safety Continuum of Multimodal Foundation Models through Information Flow and Global Game-Theoretic Analysis of Asymmetric Threats, arXiv 2024
A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations, arXiv 2025
SoK: Evaluating Jailbreak Guardrails for Large Language Models, IEEE S&P 2026
From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem, arXiv 2025

LLM's Jailbreak Guardrails

A New Generation of Perspective API: Efficient Multilingual Character-level Transformers, KDD 2022
A Holistic Approach to Undesired Content Detection in the Real World, AAAI 2023
LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked, arXiv 2023
Detecting Language Model Attacks with Perplexity, arXiv 2023
Baseline Defenses for Adversarial Attacks Against Aligned Language Models, arXiv 2023
Certifying LLM Safety against Adversarial Prompting, COLM 2024
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks, arXiv 2023
NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails, EMNLP 2023
Llama Guard: LLM-Based Input-Output Safeguard for Human-AI Conversations, arXiv 2023
GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis, ACL 2024
Defending Large Language Models Against Jailbreak Attacks via Semantic Smoothing, arXiv 2024
LLMGuard: Guarding against Unsafe LLM Behavior, arXiv 2024
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes, NeurIPS 2024
AutoDefense: Multi-agent LLM Defense against Jailbreak Attacks, arXiv 2024
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content, ICML 2024
Aegis: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts, arXiv 2024
A Causal Explainable Guardrails for Large Language Models, CCS 2024
Defending Large Language Models Against Attacks With Residual Stream Activation Analysis, CAMLIS 2024
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner, USENIX Security 2025
WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, NeurIPS 2024
R^2-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning, ICLR 2025
Prompt-Guard-86M, Hugging Face (22 July 2024)
PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing, arXiv 2024
ShieldGemma: Generative AI Content Moderation Based on Gemma, arXiv 2024
Trust-Oriented Adaptive Guardrails for Large Language Models, arXiv 2024
EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models, ICONIP 2025
HSF: Defending against Jailbreak Attacks with Hidden State Filtering, WWW 2025
MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks, AIES 2024
Rapid Response: Mitigating LLM Jailbreaks with a Few Examples, arXiv 2024
Improved Large Language Model Jailbreak Detection via Pretrained Embeddings, arXiv 2024
Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models, AAAI 2025
AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails, arXiv 2025
Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment, arXiv 2025
GuardReasoner: Towards Reasoning-based LLM Safeguards, arXiv 2025
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming, arXiv 2025
DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails, arXiv 2025
JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation, USENIX Security 2025
Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs, ACL Findings 2025
CURVALID: Geometrically-Guided Adversarial Prompt Detection, arXiv 2025
BingoGuard: LLM Content Moderation Tools with Risk Levels, ICLR 2025
MirrorShield: Towards Universal Defense Against Jailbreaks via Entropy-Guided Mirror Crafting, arXiv 2025
JailGuard: A Universal Detection Framework for Prompt-based Attacks on LLM Systems, TOSEM 2025
PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages, COLM 2025
X-Guard: Multilingual Guard Agent for Content Moderation, arXiv 2025
JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift, arXiv 2025
ReGA: Representation-Guided Abstraction for Model-based Safeguarding of LLMs, arXiv 2025
RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards, NeurIPS 2025
Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors, arXiv 2025
Lightweight Safety Guardrails via Synthetic Data and RL-guided Adversarial Training, arXiv 2025
OneShield -- the Next Generation of LLM Guardrails, arXiv 2025
Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning, arXiv 2025
LLM Jailbreak Detection for (Almost) Free!, arXiv 2025
MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation, NeurIPS 2025
Qwen3Guard Technical Report, arXiv 2025
Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks, arXiv 2025
gpt-oss-safeguard, Hugging Face (29 Oct 2025)
SGuard-v1: Safety Guardrail for Large Language Models, arXiv 2025
ExplainableGuard: Interpretable Adversarial Defense for Large Language Models Using Chain-of-Thought Reasoning, arXiv 2025
Immunity memory-based jailbreak detection: multi-agent adaptive guard for large language models, arXiv 2025
Efficient Jailbreak Mitigation Using Semantic Linear Classification in a Multi-Staged Pipeline, arXiv 2025
YuFeng-XGuard: A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models, arXiv 2026
Defending LLMs against jailbreak attacks through representation offset detection, Information Processing & Management 2026
Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation, arXiv 2026
Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models, arXiv 2026
Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement, arXiv 2026
CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety, arXiv 2026
Take off Your Disguise: Detecting Disguised Prompt-Based Jailbreak Attacks Against LLMs, TCSS 2026

MLLM's Jailbreak Guardrails

Agents' Jailbreak Guardrails

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning, ICML 2025

Benchmarks/Datasets

GuardBench: A Large-Scale Benchmark for Guardrail Models, EMNLP 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Awesome Jailbreak Guardrails for Large Models

Introduction

Survey Papers

LLM's Jailbreak Guardrails

MLLM's Jailbreak Guardrails

Agents' Jailbreak Guardrails

Benchmarks/Datasets

Acknowledgement

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Awesome Jailbreak Guardrails for Large Models

Introduction

Survey Papers

LLM's Jailbreak Guardrails

MLLM's Jailbreak Guardrails

Agents' Jailbreak Guardrails

Benchmarks/Datasets

Acknowledgement