- [2025/10] Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks
- [2025/10] From Defender to Devil? Unintended Risk Interactions Induced by LLM Defenses
- [2025/06] SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification
- [2025/06] GuardSet-X: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset
- [2025/06] The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs
- [2025/05] AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
- [2025/05] CircleGuardBench - A full-fledged benchmark for evaluating protection capabilities of AI models
- [2025/04] 𝚂𝙰𝙶𝙴: A Generic Framework for LLM Safety Evaluation
- [2025/03] MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models
- [2025/03] LLM-Safety Evaluations Lack Robustness
- [2025/02] Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models
- [2025/02] SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
- [2024/12] Agent-SafetyBench: Evaluating the Safety of LLM Agents
- [2024/12] SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents
- [2024/11] Quantized Delta Weight Is Safety Keeper
- [2024/08] Image-Perfect Imperfections: Safety, Bias, and Authenticity in the Shadow of Text-To-Image Model Evolution
- [2024/07] Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
- [2024/06] Finding Safety Neurons in Large Language Models
- [2024/06] SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
- [2024/06] GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning
- [2024/06] Self and Cross-Model Distillation for LLMs: Effective Methods for Refusal Pattern Alignment
- [2024/06] Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study
- [2024/05] AI Risk Management Should Incorporate Both Safety and Security
- [2024/05] S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models
- [2024/04] Introducing v0.5 of the AI Safety Benchmark from MLCommons
- [2024/04] ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming
- [2024/04] Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Hallucinations
- [2024/04] Foundational Challenges in Assuring Alignment and Safety of Large Language Models
- [2024/04] Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward
- [2024/04] SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety