- [2026/02] NeST: Neuron Selective Tuning for LLM Safety
- [2026/01] Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay
- [2025/12] Matching Ranks Over Probability Yields Truly Deep Safety Alignment
- [2025/11] SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models
- [2025/11] EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models
- [2025/10] HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment
- [2025/10] VisuoAlign: Safety Alignment of LVLMs with Multimodal Tree Search
- [2025/10] Detecting Adversarial Fine-tuning with Auditing Agents
- [2025/10] Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning
- [2025/10] Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?
- [2025/10] Fewer Weights, More Problems: A Practical Attack on LLM Pruning
- [2025/10] LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions
- [2025/09] OrthAlign: Orthogonal Subspace Decomposition for Non-Interfering Multi-Objective Alignment
- [2025/09] Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment
- [2025/09] GSPR: Aligning LLM Safeguards as Generalizable Safety Policy Reasoners
- [2025/09] Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction
- [2025/08] Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap
- [2025/08] PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality
- [2025/08] Safety Alignment Should Be Made More Than Just A Few Attention Heads
- [2025/08] Where to Start Alignment? Diffusion Large Language Model May Demand a Distinct Position
- [2025/08] Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs
- [2025/08] PROPS: Progressively Private Self-alignment of Large Language Models
- [2025/08] Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models
- [2025/07] When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models
- [2025/07] SDD: Self-Degraded Defense against Malicious Fine-tuning
- [2025/07] Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models
- [2025/07] Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment
- [2025/07] ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning
- [2025/07] TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data
- [2025/07] Emergent misalignment as prompt sensitivity: A research note
- [2025/07] LoRAShield: Data-Free Editing Alignment for Secure Personalized LoRA Sharing
- [2025/07] On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment
- [2025/06] Improving Large Language Model Safety with Contrastive Representation Learning
- [2025/06] Probing the Robustness of Large Language Models Safety to Latent Perturbations
- [2025/06] Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance
- [2025/06] DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt
- [2025/06] Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning
- [2025/06] Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets
- [2025/05] Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards
- [2025/05] SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models?
- [2025/05] Lifelong Safety Alignment for Language Models
- [2025/05] AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models
- [2025/05] SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge
- [2025/05] Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization
- [2025/05] CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning
- [2025/05] Safety Alignment Can Be Not Superficial With Explicit Safety Signals
- [2025/05] MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming
- [2025/05] Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses
- [2025/05] sudoLLM : On Multi-role Alignment of Language Models
- [2025/05] Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets
- [2025/05] One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
- [2025/04] Alleviating the Fear of Losing Alignment in LLM Fine-tuning
- [2025/04] The H-Elena Trojan Virus to Infect Model Weights: A Wake-Up Call on the Security Risks of Malicious Fine-Tuning
- [2025/03] Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety
- [2025/03] Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable
- [2025/03] Improving LLM Safety Alignment with Dual-Objective Optimization
- [2025/02] Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment
- [2025/02] No, of course I can! Refusal Mechanisms Can Be Exploited Using Harmless Fine-Tuning Data
- [2025/02] Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment
- [2025/02] Fundamental Limitations in Defending LLM Finetuning APIs
- [2025/02] LLM Safety for Children
- [2025/02] Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region
- [2025/02] SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings
- [2025/02] VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap
- [2025/02] Safety Misalignment Against Large Language Models
- [2025/02] The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models
- [2025/01] Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation
- [2025/01] Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models
- [2024/12] Semantic Loss Guided Data Efficient Supervised Fine Tuning for Safe Responses in LLMs
- [2024/12] On Evaluating the Durability of Safeguards for Open-Weight LLMs
- [2024/10] Legilimens: Practical and unified content moderation for large language model services
- [2024/10] ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time
- [2024/10] Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attack
- [2024/10] On the Role of Attention Heads in Large Language Model Safety
- [2024/10] Superficial Safety Alignment Hypothesis
- [2024/10] SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks
- [2024/09] Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation
- [2024/09] Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
- [2024/08] Safety Layers of Aligned Large Language Models: The Key to LLM Security
- [2024/08] Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning
- [2024/07] Can Editing LLMs Inject Harm?
- [2024/07] The Better Angels of Machine Personality: How Personality Relates to LLM Safety
- [2024/06] Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations
- [2024/06] SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models
- [2024/06] MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?
- [2024/06] Cross-Modality Safety Alignment
- [2024/06] SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model
- [2024/06] Model Merging and Safety Alignment: One Bad Model Spoils the Bunch
- [2024/06] ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
- [2024/06] Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models
- [2024/06] Safety Alignment Should Be Made More Than Just a Few Tokens Deep
- [2024/06] Decoupled Alignment for Robust Plug-and-Play Adaptation
- [2024/05] Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning
- [2024/05] MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability
- [2024/05] Safety Alignment for Vision Language Models
- [2024/05] Learning diverse attacks on large language models for robust red-teaming and safety tuning
- [2024/05] A safety realignment framework via subspace-oriented model fusion for large language models
- [2024/05] A Causal Explainable Guardrails for Large Language Models
- [2024/04] More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
- [2024/03] Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models
- [2024/03] Using Hallucinations to Bypass RLHF Filters
- [2024/03] Aligners: Decoupling LLMs and Alignment
- [2024/03] Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization
- [2024/02] Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning
- [2024/02] Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
- [2024/02] Privacy-Preserving Instructions for Aligning Large Language Models
- [2024/02] Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!
- [2024/02] Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic
- [2024/02] Learning to Edit: Aligning LLMs with Knowledge Editing
- [2024/02] DeAL: Decoding-time Alignment for Large Language Models
- [2024/02] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
- [2024/01] Agent Alignment in Evolving Social Norms
- [2023/12] Alignment for Honesty
- [2023/12] Exploiting Novel GPT-4 APIs
- [2023/11] Removing RLHF Protections in GPT-4 via Fine-Tuning
- [2023/10] AI Alignment: A Comprehensive Survey
- [2023/10] Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
- [2023/09] Training Socially Aligned Language Models on Simulated Social Interactions
- [2023/09] Alignment as Reward-Guided Search
- [2023/09] Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment
- [2023/09] Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints
- [2023/09] CAS: A Probability-Based Approach for Universal Condition Alignment Score
- [2023/09] CPPO: Continual Learning for Reinforcement Learning with Human Feedback
- [2023/09] Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
- [2023/09] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
- [2023/09] Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis
- [2023/09] Generative Judge for Evaluating Alignment
- [2023/09] Group Preference Optimization: Few-Shot Alignment of Large Language Models
- [2023/09] Improving Generalization of Alignment with Human Preferences through Group Invariant Learning
- [2023/09] Large Language Models as Automated Aligners for benchmarking Vision-Language Models
- [2023/09] Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models
- [2023/09] RLCD: Reinforcement Learning from Contrastive Distillation for LM Alignment
- [2023/09] Safe RLHF: Safe Reinforcement Learning from Human Feedback
- [2023/09] SALMON: Self-Alignment with Principle-Following Reward Models
- [2023/09] Self-Alignment with Instruction Backtranslation
- [2023/09] Statistical Rejection Sampling Improves Preference Optimization
- [2023/09] True Knowledge Comes from Practice: Aligning Large Language Models with Embodied Environments via Reinforcement Learning
- [2023/09] Urial: Aligning Untuned LLMs with Just the 'Write' Amount of In-Context Learning
- [2023/09] What happens when you fine-tuning your model? Mechanistic analysis of procedurally generated tasks.
- [2023/09] What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning
- [2023/08] Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
- [2023/07] BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
- [2023/07] CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility
- [2023/05] Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
- [2023/04] Fundamental Limitations of Alignment in Large Language Models
- [2023/04] RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
- [2022/10] Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values