This repository is a list of research papers, articles, and resources related to jailbreak guardrails for Large Models (i.e., large language models (LLMs), multimodal large language models (MLLMs), and AI agents). Jailbreak guardrails are techniques and strategies designed to detect and filter unauthorized or harmful behavior in AI systems, ensuring they operate safely and ethically.
- Safeguarding Large Language Models: A Survey, Artificial Intelligence Review 2025
- Current state of LLM Risks and AI Guardrails, arXiv 2024
- JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models, arXiv 2024
- Jailbreak Attacks and Defenses Against Large Language Models: A Survey, arXiv 2024
- SoK: The Security-Safety Continuum of Multimodal Foundation Models through Information Flow and Global Game-Theoretic Analysis of Asymmetric Threats, arXiv 2024
- A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations, arXiv 2025
- SoK: Evaluating Jailbreak Guardrails for Large Language Models, IEEE S&P 2026
- From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem, arXiv 2025
- A New Generation of Perspective API: Efficient Multilingual Character-level Transformers, KDD 2022
- A Holistic Approach to Undesired Content Detection in the Real World, AAAI 2023
- LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked, arXiv 2023
- Detecting Language Model Attacks with Perplexity, arXiv 2023
- Baseline Defenses for Adversarial Attacks Against Aligned Language Models, arXiv 2023
- Certifying LLM Safety against Adversarial Prompting, COLM 2024
- SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks, arXiv 2023
- NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails, EMNLP 2023
- Llama Guard: LLM-Based Input-Output Safeguard for Human-AI Conversations, arXiv 2023
- GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis, ACL 2024
- Defending Large Language Models Against Jailbreak Attacks via Semantic Smoothing, arXiv 2024
- LLMGuard: Guarding against Unsafe LLM Behavior, arXiv 2024
- Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes, NeurIPS 2024
- AutoDefense: Multi-agent LLM Defense against Jailbreak Attacks, arXiv 2024
- RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content, ICML 2024
- Aegis: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts, arXiv 2024
- A Causal Explainable Guardrails for Large Language Models, CCS 2024
- Defending Large Language Models Against Attacks With Residual Stream Activation Analysis, CAMLIS 2024
- SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner, USENIX Security 2025
- WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, NeurIPS 2024
- R^2-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning, ICLR 2025
- Prompt-Guard-86M, Hugging Face (22 July 2024)
- PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing, arXiv 2024
- ShieldGemma: Generative AI Content Moderation Based on Gemma, arXiv 2024
- Trust-Oriented Adaptive Guardrails for Large Language Models, arXiv 2024
- EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models, ICONIP 2025
- HSF: Defending against Jailbreak Attacks with Hidden State Filtering, WWW 2025
- MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks, AIES 2024
- Rapid Response: Mitigating LLM Jailbreaks with a Few Examples, arXiv 2024
- Improved Large Language Model Jailbreak Detection via Pretrained Embeddings, arXiv 2024
- Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models, AAAI 2025
- AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails, arXiv 2025
- Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment, arXiv 2025
- GuardReasoner: Towards Reasoning-based LLM Safeguards, arXiv 2025
- Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming, arXiv 2025
- DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails, arXiv 2025
- JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation, USENIX Security 2025
- Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs, ACL Findings 2025
- CURVALID: Geometrically-Guided Adversarial Prompt Detection, arXiv 2025
- BingoGuard: LLM Content Moderation Tools with Risk Levels, ICLR 2025
- MirrorShield: Towards Universal Defense Against Jailbreaks via Entropy-Guided Mirror Crafting, arXiv 2025
- JailGuard: A Universal Detection Framework for Prompt-based Attacks on LLM Systems, TOSEM 2025
- PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages, COLM 2025
- X-Guard: Multilingual Guard Agent for Content Moderation, arXiv 2025
- JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift, arXiv 2025
- ReGA: Representation-Guided Abstraction for Model-based Safeguarding of LLMs, arXiv 2025
- RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards, NeurIPS 2025
- Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors, arXiv 2025
- Lightweight Safety Guardrails via Synthetic Data and RL-guided Adversarial Training, arXiv 2025
- OneShield -- the Next Generation of LLM Guardrails, arXiv 2025
- Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning, arXiv 2025
- LLM Jailbreak Detection for (Almost) Free!, arXiv 2025
- MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation, NeurIPS 2025
- Qwen3Guard Technical Report, arXiv 2025
- Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks, arXiv 2025
- gpt-oss-safeguard, Hugging Face (29 Oct 2025)
- SGuard-v1: Safety Guardrail for Large Language Models, arXiv 2025
- ExplainableGuard: Interpretable Adversarial Defense for Large Language Models Using Chain-of-Thought Reasoning, arXiv 2025
- Immunity memory-based jailbreak detection: multi-agent adaptive guard for large language models, arXiv 2025
- Efficient Jailbreak Mitigation Using Semantic Linear Classification in a Multi-Staged Pipeline, arXiv 2025
- YuFeng-XGuard: A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models, arXiv 2026
- Defending LLMs against jailbreak attacks through representation offset detection, Information Processing & Management 2026
- Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation, arXiv 2026
- Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models, arXiv 2026
- Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement, arXiv 2026
- CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety, arXiv 2026
- Take off Your Disguise: Detecting Disguised Prompt-Based Jailbreak Attacks Against LLMs, TCSS 2026
