31 lines (31 loc) · 6.29 KB

A0. General

[2025/10] Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks
[2025/10] From Defender to Devil? Unintended Risk Interactions Induced by LLM Defenses
[2025/06] SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification
[2025/06] GuardSet-X: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset
[2025/06] The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs
[2025/05] AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
[2025/05] CircleGuardBench - A full-fledged benchmark for evaluating protection capabilities of AI models
[2025/04] 𝚂𝙰𝙶𝙴: A Generic Framework for LLM Safety Evaluation
[2025/03] MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models
[2025/03] LLM-Safety Evaluations Lack Robustness
[2025/02] Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models
[2025/02] SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
[2024/12] Agent-SafetyBench: Evaluating the Safety of LLM Agents
[2024/12] SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents
[2024/11] Quantized Delta Weight Is Safety Keeper
[2024/08] Image-Perfect Imperfections: Safety, Bias, and Authenticity in the Shadow of Text-To-Image Model Evolution
[2024/07] Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
[2024/06] Finding Safety Neurons in Large Language Models
[2024/06] SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
[2024/06] GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning
[2024/06] Self and Cross-Model Distillation for LLMs: Effective Methods for Refusal Pattern Alignment
[2024/06] Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study
[2024/05] AI Risk Management Should Incorporate Both Safety and Security
[2024/05] S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models
[2024/04] Introducing v0.5 of the AI Safety Benchmark from MLCommons
[2024/04] ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming
[2024/04] Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Hallucinations
[2024/04] Foundational Challenges in Assuring Alignment and Safety of Large Language Models
[2024/04] Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward
[2024/04] SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety