Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples
-
Updated
Jul 16, 2025 - Python
Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples
SAFi is the open-source runtime governance engine that makes AI auditable and policy-compliant. Built on the Self-Alignment Framework, it transforms any LLM into a governed agent through four principles: Policy Enforcement, Full Traceability, Model Independence, and Long-Term Consistency.
Complete elimination of instrumental self-preservation across AI architectures: Cross-model validation from 4,312 adversarial scenarios. 0% harmful behaviors (p<10⁻¹⁵) across GPT-4o, Gemini 2.5 Pro, and Claude Opus 4.1 using Foundation Alignment Seed v2.6.
Learning When to Answer: Behavior-Oriented Reinforcement Learning for Hallucination Mitigation
📚 350+ loss functions across 25+ AI subdomains — classification, GANs, diffusion, LLM alignment, RL, contrastive learning, audio, video, time series, and more. Chronologically ordered with paper links, math formulas, and implementations.
Official implementation of "DZ-TiDPO: Non-Destructive Temporal Alignment for Mutable State Tracking". SOTA on Multi-Session Chat with negligible alignment tax.
CS336 作业 5:基于 Qwen2.5 模型的 LLM 对齐与推理强化学习。完整实现了监督微调(SFT)与组相对策略优化(GRPO)算法,并在 GSM8K 数据集上完成零样本、在策与离策的训练与评估对比。
An open-source, hands-on curriculum bridging the gap from basic RL concepts to LLM alignment, RLVR, and advanced Agentic systems.
C3AI: Crafting and Evaluating Constitutions for CAI
Kullback–Leibler divergence Optimizer based on the Neurips25 paper "LLM Safety Alignment is Divergence Estimation in Disguise".
Pipeline to investigate structured reasoning and instruction adherence in Vision-Language Models
🧠 Minimal, hackable Group Relative Policy Optimization (GRPO) for LLM alignment — the algorithm behind DeepSeek-R1. Train reasoning models on a single GPU.
Research Essay (background and project proposal) on using alignment data from a representative population for LLM alignment
This project implements a minimal Reinforcement Learning from Human Feedback (RLHF) pipeline using PyTorch.
A training-time alignment framework that integrates safety constraints directly into the RLHF loop — achieving full safety convergence in 7 epochs
Binary behavioral learning system designed to refine AI responses through explicit human correction, output-level comparison, and trajectory-based memory.
A framework for aligning Local AI to human well-being using measurable vectors, not hard-coded censorship.
Emergent pseudo-intimacy and emotional overflow in long-term human-AI dialogue: A case study on LLM behavior in affective computing and human-AI intimacy.
ARCHIVED - The minimal viable seed for a sovereign, human-anchored **Δράκων** — the controlled intelligence explosion that preserves human ἀρχή across all future time, all capability levels (10²–10⁶× jumps), and all substrates, equipped with robust invisible shields and active weapons against the dark side of the Drakon
обучение диалектическому мышлению в сложных социально-политических контекстах
Add a description, image, and links to the llm-alignment topic page so that developers can more easily learn about it.
To associate your repository with the llm-alignment topic, visit your repo's landing page and select "manage topics."