All

25 repositories

SafeVLA
Public
[NeurIPS 2025 Spotlight] Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning.
Python
•
Other
•9•134•0•0•Updated Mar 31, 2026Mar 31, 2026
VLA-Arena
Public
VLA-Arena is an open-source benchmark for systematic evaluation of Vision-Language-Action (VLA) models.
Python
•
Apache License 2.0
•9•140•0•0•Updated Mar 14, 2026Mar 14, 2026
safety-gymnasium
Public
NeurIPS 2023: Safety-Gymnasium: A Unified Safe Reinforcement Learning Benchmark
reinforcement-learning constraint-satisfaction-problem safety-critical
reinforcement-learning constraint-satisfaction-problem safety-critical safety-critical-systems safe-reinforcement-learning safe-reinforcement-learning-environments constraint-rl safe-policy-optimization
Python
•
Apache License 2.0
•80•552•12•2•Updated Dec 4, 2025Dec 4, 2025
align-anything
Public
Align Anything: Training All-modality Model with Feedback
chameleon multimodal dpo
chameleon multimodal dpo large-language-models rlhf vision-language-model
Python
•
Apache License 2.0
•509•4.6k•29•2•Updated Nov 27, 2025Nov 27, 2025
safe-rlhf
Public
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
reinforcement-learning transformers transformer
reinforcement-learning transformers transformer safety llama gpt datasets beaver alpaca ai-safety
Python
•
Apache License 2.0
•132•1.6k•16•2•Updated Nov 24, 2025Nov 24, 2025
MM-DeceptionBench
Public
0•0•0•0•Updated Sep 25, 2025Sep 25, 2025
eval-anything
Public
Python
•
Apache License 2.0
•18•21•1•2•Updated Jul 26, 2025Jul 26, 2025
llms-resist-alignment
Public
[ACL2025 Best Paper] Language Models Resist Alignment
alignment llama safe
alignment llama safe alpaca ai-safety vicuna llm llms rlhf safe-rlhf
Python
•1•45•0•0•Updated Jun 11, 2025Jun 11, 2025
SAE-V
Public
[ICML 2025 Poster] SAE-V: Interpreting Multimodal Models for Enhanced Alignment
1•15•0•0•Updated Jun 5, 2025Jun 5, 2025
ProgressGym
Public
Alignment with a millennium of moral progress. Spotlight@NeurIPS 2024 Track on Datasets and Benchmarks.
Python
•
MIT License
•4•25•0•0•Updated Mar 30, 2025Mar 30, 2025
s1-m
Public
S1-M: Simple Test-time Scaling in Multimodal Reasoning
Python
•
Apache License 2.0
•509•3•0•0•Updated Mar 25, 2025Mar 25, 2025
omnisafe
Public
JMLR: OmniSafe is an infrastructural framework for accelerating SafeRL research.
benchmark machine-learning reinforcement-learning
benchmark machine-learning reinforcement-learning deep-learning deep-reinforcement-learning constraint-satisfaction-problem pytorch safety-critical saferl safe-reinforcement-learning
Python
•
Apache License 2.0
•153•1.1k•19•5•Updated Mar 17, 2025Mar 17, 2025
ProAgent
Public
AAAI24(Oral) ProAgent: Building Proactive Cooperative Agents with Large Language Models
language-model cooperative human-ai
language-model cooperative human-ai overcooked human-ai-interaction cooperative-ai llm-agent
JavaScript
•
MIT License
•11•103•2•0•Updated Mar 4, 2025Mar 4, 2025
Beaver-zh-hk
Public
Python
•0•1•0•0•Updated Feb 23, 2025Feb 23, 2025
TransformerLens-V
Public
Python
•
MIT License
•2•6•1•0•Updated Jan 31, 2025Jan 31, 2025
SAELens-V
Public
Python
•
MIT License
•2•10•3•0•Updated Jan 31, 2025Jan 31, 2025
aligner
Public
[NeurIPS 2024 Oral] Aligner: Efficient Alignment by Learning to Correct
alignment aligner interpretability
alignment aligner interpretability aisafety llm rlhf weak-to-strong mecinterp
Python
•10•191•0•0•Updated Jan 16, 2025Jan 16, 2025
.github
Public
0•0•0•0•Updated Jan 16, 2025Jan 16, 2025
Aligner2024.github.io
Public
HTML
•1•0•0•0•Updated Oct 31, 2024Oct 31, 2024
safe-sora
Public
SafeSora is a human preference dataset designed to support safety alignment research in the text-to-video generation field, aiming to enhance the helpfulness an…
alignment human-preferences text-to-video-generation
alignment human-preferences text-to-video-generation large-vision-models
Python
•5•34•0•0•Updated Aug 20, 2024Aug 20, 2024
SafeDreamer
Public
ICLR 2024: SafeDreamer: Safe Reinforcement Learning with World Models
reinforcement-learning constraint-satisfaction-problem safety-critical-systems
reinforcement-learning constraint-satisfaction-problem safety-critical-systems safe-reinforcement-learning constraint-rl safe-policy-optimization
Python
•
Apache License 2.0
•5•100•1•0•Updated Apr 8, 2024Apr 8, 2024
Safe-Policy-Optimization
Public
NeurIPS 2023: Safe Policy Optimization: A benchmark repository for safe reinforcement learning algorithms
benchmarks reinforcement-learning-algorithms safe
benchmarks reinforcement-learning-algorithms safe safe-reinforcement-learning constrained-reinforcement-learning
Python
•
Apache License 2.0
•60•405•4•0•Updated Mar 20, 2024Mar 20, 2024
AlignmentSurvey
Public
AI Alignment: A Comprehensive Survey
awesome reinforcement-learning ai
awesome reinforcement-learning ai deep-learning survey alignment papers interpretability red-teaming large-language-models
1•137•0•0•Updated Nov 2, 2023Nov 2, 2023
beavertails
Public
BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).
safety llama gpt
safety llama gpt datasets language-model beaver ai-safety human-feedback-data llm llms
Makefile
•
Apache License 2.0
•6•178•3•0•Updated Oct 27, 2023Oct 27, 2023
ReDMan
Public
ReDMan is an open-source simulation platform that provides a standardized implementation of safe RL algorithms for Reliable Dexterous Manipulation.
Python
•
Apache License 2.0
•2•23•0•0•Updated May 2, 2023May 2, 2023

ProTip! When viewing an organization's repositories, you can use the props. filter to filter by custom property.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PKU-Alignment

All

All

25 repositories

SafeVLA

VLA-Arena

safety-gymnasium

align-anything

safe-rlhf

MM-DeceptionBench

eval-anything

llms-resist-alignment

SAE-V

ProgressGym

s1-m

omnisafe

ProAgent

Beaver-zh-hk

TransformerLens-V

SAELens-V

aligner

.github

Aligner2024.github.io

safe-sora

SafeDreamer

Safe-Policy-Optimization

AlignmentSurvey

beavertails

ReDMan

All

All

Repositories list

25 repositories