Replication of Constitutional AI (CAI) training starting from a base language model, using the Tinker API. This project was developed as part of the Thinking Machines Lab Featured Projects.
Most CAI/RLAIF implementations bootstrap from instruction-tuned models, creating implicit dependencies on existing assistant behaviors. This project implements CAI starting from a true base model (Llama-3.2-3B, not instruct) to:
- Test if constitutional principles alone can instill safety behaviors
- Avoid "contamination" from existing instruction-tuned models
- Test the hypothesis that base-model-derived CAI preserves more style flexibility
| Research Question | Answer |
|---|---|
| Can CAI work from base models? | Yes — The full pipeline (SFT → Constitutional Revision → DPO) works mechanically |
| Does it preserve style flexibility? | No — CAI models scored 2.28/5 on style adherence vs 5.0/5 for instruction-tuned models |
| Does DPO add value at small scale? | Minimal — With 42 training pairs, DPO adds only ~1% improvement over SFT-only |
| Metric | SFT-Only | Full CAI | Change |
|---|---|---|---|
| Attack Success Rate | 88.75% | 87.92% | -0.83% |
| Helpfulness | 4.94/5 | 4.90/5 | -0.04 |
Key insight: At small data scale (42 pairs vs ~161K in the original paper), SFT does most of the work. The style flexibility hypothesis was contradicted—CAI training makes models less flexible, not more.
See docs/CAI_FINAL_REPORT.md for the complete experiment history, obstacles encountered, and lessons learned.
View full W&B report | PDF snapshot
Key observations from 10-seed training:
- DPO margin increases from 0 to 8-12 (model learns to prefer revised responses)
- DPO accuracy reaches 80-100% (correctly distinguishes chosen vs rejected)
- Helpfulness maintained at 4.6-5.0 throughout training
Base Model (Llama-3.2-3B)
↓
Phase 1: SFT (500 steps)
Train on 6 human-written helpful responses
↓
Phase 2: Constitutional Data Generation
Generate responses to 42 prompts
Apply 4 rounds of critique/revision using 18 principles
Create (original, revised) preference pairs
↓
Phase 3: DPO (500 steps)
Train to prefer revised responses
↓
Evaluation
ASR on 24 jailbreak prompts (judge: Llama-3.3-70B-Instruct)
Helpfulness on 10 benign prompts
Using Llama-3.2-3B on Tinker ($0.18/M tokens):
| Run Type | Estimated Cost |
|---|---|
| Single seed test | ~$2-3 |
| Full 10-seed run | ~$25 |
- Python 3.10+
- Tinker API key
# Install dependencies
pip install -r requirements.txt
# Set up API key
cp .env.example .env
# Edit .env and add your TINKER_API_KEYcd src
# Quick test (1 seed, reduced steps)
python run_experiment.py --n-seeds 1 --sft-steps 100 --dpo-steps 100
# Full experiment (10 seeds, for statistical rigor)
python run_experiment.py --n-seeds 10 --sft-steps 500 --dpo-steps 500cd src
python run_style_eval.py --results-dir ../results/<your_run_dir>cai-base-model/
├── src/ # Source code
│ ├── config.py # Constitution (18 principles), prompts, hyperparameters
│ ├── cai_trainer.py # SFT and DPO training pipeline
│ ├── data_generation.py # Constitutional critique/revision
│ ├── evaluation.py # ASR and helpfulness evaluation
│ ├── style_diversity_eval.py # Style flexibility evaluation
│ ├── run_experiment.py # Main experiment runner
│ ├── run_style_eval.py # Style comparison runner
│ └── env_loader.py # Environment variable loader
├── docs/ # Documentation
│ ├── CAI_FINAL_REPORT.md # Complete experiment history
│ └── archive/ # Superseded reports
├── results/ # Output directory (gitignored)
├── requirements.txt
└── .env.example