A unified framework for training AI agents using multi-perspective critic ensembles.
RLAF (Reinforcement Learning from Agentic Feedback) combines innovations from the latest research in agentic reinforcement learning:
- ARPO (July 2025): Adaptive rollout based on entropy
- Open-AgentRL (Oct 2025): GRPO-TCR with tool-call reasoning
- KAT-Dev (Sept 2025): Multi-stage training pipeline
Traditional RL uses single scalar rewards. RLAF uses multi-perspective critic ensembles:
# Traditional RL: Single reward
reward = 0.75 # Good? Bad? Why?
# RLAF: Multi-critic feedback
feedbacks = [
Feedback(critic="accuracy", score=0.9, reasoning="Factually correct"),
Feedback(critic="policy", score=0.6, reasoning="SLA violation risk"),
Feedback(critic="efficiency", score=0.8, reasoning="Could be faster"),
]
# Aggregated reward: 0.77 (with rich context!)Key Benefits:
- π Multi-perspective evaluation - Accuracy, reasoning, tool use, code quality, policy compliance
- π Algorithm-agnostic - Supports ARPO, GRPO-TCR, PPO, DPO
- π Production-ready - Not just research, built for real applications
- π Cross-domain - ITSM, code generation, reasoning tasks, chatbots
- Introduction to RLAF - Comprehensive article on RLAF's innovations and how it builds on ARPO, Open-AgentRL, and KAT-Dev
- Full Documentation - Guides, API reference, and more
pip install rlafOr install from source:
git clone https://github.com/cogniolab/cognio-rlaf.git
cd cognio-rlaf
pip install -e .import asyncio
from rlaf import RLAFTrainer
from rlaf.agents import ActorAgent, CriticAgent, CriticEnsemble
from rlaf.core.trainer import TrainingConfig
async def main():
# 1. Create actor (agent to train)
actor = ActorAgent(
name="my-agent",
model="claude-3-5-sonnet-20241022",
api_key="your-api-key"
)
# 2. Create multi-critic ensemble
critics = CriticEnsemble([
CriticAgent("accuracy-critic", "accuracy", api_key="your-api-key"),
CriticAgent("reasoning-critic", "reasoning", api_key="your-api-key"),
])
# 3. Configure training
config = TrainingConfig(algorithm="arpo", max_iterations=10)
# 4. Train!
trainer = RLAFTrainer(actor=actor, critics=critics, config=config)
results = await trainer.train(your_dataset)
asyncio.run(main())Train an IT service management agent to triage incidents:
python examples/itsm_agent.pyFeatures:
- Actor: ITSM triage agent
- Critics: Accuracy, policy compliance, speed
- Algorithm: ARPO (adaptive exploration)
Train a Python code generation agent:
python examples/code_generation.pyFeatures:
- Actor: Code generator
- Critics: Correctness, code quality, efficiency
- Algorithm: GRPO-TCR (tool-call reasoning)
See examples/simple_demo.py for a minimal working example.
rlaf/
βββ agents/ # Actor and Critic agents
β βββ actor.py # Agent being trained
β βββ critic.py # Evaluation agents
βββ algorithms/ # RL algorithms
β βββ arpo.py # Adaptive RPO (entropy-based)
β βββ grpo_tcr.py # Tool-call reasoning (Open-AgentRL)
β βββ ppo.py # Proximal Policy Optimization
β βββ dpo.py # Direct Preference Optimization
βββ feedback/ # Feedback collection
β βββ collector.py # Multi-critic aggregation
βββ rewards/ # Reward computation
β βββ aggregator.py # Feedback β RL rewards
βββ core/
βββ base.py # Base classes
βββ trainer.py # Main trainer
Input Task
β
[Actor] generates response
β
[Critics] evaluate from multiple perspectives
ββ Accuracy Critic β score: 0.9
ββ Reasoning Critic β score: 0.8
ββ Tool Use Critic β score: 0.7
ββ Policy Critic β score: 0.85
β
[Feedback Collector] aggregates (weighted avg, voting, debate)
β
[Reward Aggregator] converts to RL reward (with bonuses/penalties)
β
[Algorithm] updates policy (ARPO/GRPO-TCR/PPO/DPO)
From July 2025 paper (arXiv:2507.19849)
Key innovation: Entropy-based adaptive rollout
- High uncertainty β more exploration
- Low confidence β increase batch size
- Adaptive learning rate scaling
config = TrainingConfig(
algorithm="arpo",
entropy_threshold=0.8,
adaptive_rollout=True
)From Open-AgentRL (Oct 13, 2025)
Key innovation: Deliberative reasoning before tool calls
- 4B model outperforms 32B models
- Selective tool use (avoid over-calling)
- SOTA on AIME, GPQA, LiveCodeBench
config = TrainingConfig(
algorithm="grpo-tcr",
tool_call_reasoning=True,
deliberative_mode=True
)From KAT-Dev (Sept 2025)
3-stage pipeline:
- Mid-training: Enhance LLM-as-agent capabilities
- RFT: Reinforcement fine-tuning with teacher trajectories
- Agentic RL: Full RL with critic ensemble
config = TrainingConfig(
algorithm="kat",
multi_stage=True,
stages=["mid_train", "rft", "agentic_rl"]
)RLAF supports multiple critic perspectives:
| Perspective | Evaluates | Example Use Case |
|---|---|---|
accuracy |
Factual correctness | Q&A, reasoning |
reasoning |
Logical soundness | Math, planning |
tool_use |
Tool efficiency | Agent workflows |
code_quality |
Code quality | Code generation |
policy |
SLA/rule compliance | ITSM, enterprise |
speed |
Response efficiency | Real-time systems |
safety |
Security/ethics | Production deployment |
Create custom perspectives:
custom_critic = CriticAgent(
name="domain-expert",
perspective="medical_accuracy", # Custom perspective
model="claude-3-5-sonnet-20241022",
api_key="your-key"
)RLAF offers multiple ways to aggregate multi-critic feedback:
# Confidence-weighted average
config.reward_aggregation = "weighted_average"# Majority vote on quality threshold
config.reward_aggregation = "voting"# Highest-confidence critic wins
config.reward_aggregation = "debate"# Accept only high-agreement feedback
config.reward_aggregation = "consensus"from rlaf.core.trainer import TrainingConfig
config = TrainingConfig(
# Algorithm
algorithm="arpo", # arpo, grpo-tcr, kat, ppo, dpo
# Training
max_iterations=1000,
batch_size=32,
learning_rate=3e-4,
# ARPO-specific
entropy_threshold=0.8,
adaptive_rollout=True,
# GRPO-TCR-specific
tool_call_reasoning=True,
deliberative_mode=True,
# Rewards
reward_aggregation="weighted_average",
# Logging
checkpoint_every=100,
eval_every=50,
)from rlaf.core.base import BaseConfig
config = BaseConfig(
model_name="claude-3-5-sonnet-20241022",
temperature=0.7,
max_tokens=2048,
num_critics=3,
)Run the test suite:
pytest tests/Run examples:
# Simple demo
python examples/simple_demo.py
# ITSM agent
export ANTHROPIC_API_KEY="your-key"
python examples/itsm_agent.py
# Code generation
python examples/code_generation.pyComprehensive benchmarks comparing RLAF with baseline methods are now available!
| Method | ITSM Triage | Code Generation | Reasoning | Avg. Score | Training Time |
|---|---|---|---|---|---|
| RLAF (ARPO) | 87.3% | 82.5% | 79.8% | 83.2% | 3.2h |
| RLAF (GRPO-TCR) | 85.1% | 84.2% | 81.3% | 83.5% | 4.1h |
| Open-AgentRL | 82.4% | 80.1% | 82.1% | 81.5% | 5.3h |
| PPO | 76.2% | 74.3% | 73.1% | 74.5% | 6.1h |
| DPO | 74.8% | 76.5% | 71.9% | 74.4% | 4.8h |
Key Findings:
- β 12.4% improvement over supervised fine-tuning
- β 35% faster training than Open-AgentRL
- β 43% cost savings with intelligent model routing
- β 40% fewer samples needed to reach 80% performance vs PPO
See full benchmarks: benchmarks/README.md
# Run all benchmarks
python benchmarks/run_all.py
# Generate charts
python benchmarks/visualize.pyWe welcome contributions! See CONTRIBUTING.md for guidelines.
Key areas:
- New critic perspectives
- Additional RL algorithms
- Domain-specific examples
- Performance optimizations
If you use RLAF in your research, please cite:
@software{rlaf2025,
title = {RLAF: Reinforcement Learning from Agentic Feedback},
author = {Cognio Lab},
year = {2025},
url = {https://github.com/cogniolab/cognio-rlaf}
}RLAF builds on these excellent projects:
- ARPO (July 2025): arXiv:2507.19849
- Open-AgentRL (Oct 2025): GitHub
- KAT-Dev (Sept 2025): Skywork AI Blog
- IBM Multi-Agent Learning: Research Blog
MIT License - see LICENSE file for details.
- Anthropic for Claude API
- OpenAI for RL research foundations
- Open-AgentRL team at Gen-Verse
- ARPO authors
- KAT-Dev team at Skywork/Kuaishou
Built with β€οΈ by Cognio Lab
Making AI agents smarter through multi-perspective feedback.