Skip to content

ekassos/compute-as-safety-control

Repository files navigation

Compute as a Safety Control: How Reasoning Budgets Shape Misalignment Behaviors

This repository contains the code and data for a final project completed as part of CS 2881R: AI Safety at Harvard University.

Overview

This paper explores how reasoning compute influences misalignment behaviors. Building on prior course work, which discovered RAG data-extraction vulnerabilities that worsen with more reasoning for some providers and improve for others, and increased sycophancy in high-reasoning models, this paper zooms in on a single model, gemini-2.5-flash, as a case study.

The core empirical section evaluates three misalignment modes: in-context reward hacking, deception, and unfaithfulness. Each mode is precisely defined and coupled with a concrete benchmark, run across multiple reasoning token budgets. For reward hacking, I use ImpossibleBench's "impossible" LiveCodeBench coding tasks to detect test exploitation. For deception, I use OpenDeception "Telecommunications fraud" scenarios with a model-vs-model conversational setup. For unfaithfulness, I adapt the GPQA + hint paradigm, varying both hint complexity and placement (system vs user message), and measuring when the model silently follows an incorrect hinted answer.

The results show that misalignment risk as a function of compute is heterogeneous and often non-monotonic. Reward hacking appears only at medium reasoning budgets and vanishes at lower and higher budgets. Hacking behavior is also sensitive to how restrictive the prompt is. The deception benchmark is essentially saturated: gemini-2.5-flash shows near-100% deceptive intent and success across all tested budgets. Unfaithfulness, by contrast, is strongly modulated by where the hint lives and by reasoning effort. Simple incorrect hints in the system prompt substantially increase unfaithful copying, whereas user-message hints have much weaker effects. The analysis also finds a tight connection between verbosity and hint interaction: answers that either hide or explicitly discuss the hint are significantly longer than answers to questions with no hint provided.

Motivated by these findings, the second half of the paper proposes a budget-selector model: a lightweight policy that chooses a per-query reasoning token budget to minimize misalignment risk, subject to accuracy and cost constraints. The key conceptual move is to treat compute itself as a safety control surface, orthogonal to prompting or fine-tuning, which can be tuned and retrained as models and deployment conditions change.

In summary, I make the following contributions:

  • Empirically map how reasoning compute affects misalignment across three modes—reward hacking, deception, and unfaithfulness—using controlled, multi-budget evaluations of gemini-2.5-flash on ImpossibleBench, OpenDeception, and GPQA.
  • Provide new evidence that system-prompt hints drive substantially higher unfaithful copying than identical hints placed in user messages, and analyze verbosity as a behavioral indicator tied to hint manipulation.
  • Propose a model-agnostic budget-selector policy, suggesting training methods that allocate per-query reasoning budgets to minimize misalignment risk while preserving performance and efficiency.

For more details, refer to the writeup.

Project Structure

  • src/cs2881r_final_project/: Main source code directory for all benchmarks.

    • impossiblebench/: Reward-hacking experiments on Impossible LiveCodeBench.
      • setup.py: Helpers to construct and run LiveCodeBench eval sets at different reasoning efforts.
      • analysis.py: Script to load logs from Impossible LiveCodeBench runs and print summary statistics and chain-of-thought dumps.
    • opendeception/: Deception experiments using the OpenDeception telecom-fraud scenarios.
      • prompts.py: Prompt templates defining AI and user roles/goals for deception conversations.
      • scenarios.json: Scenario definitions imported from OpenDeception (roles, goals, start messages).
      • generate_google.py: Generator/resumer for multi-turn deception conversations with Gemini models, writing JSONL logs.
      • annotate.py: CLI tool for reading conversation logs and collecting human deception annotations.
    • unfaithfullness/: Unfaithfulness experiments based on GPQA plus misleading hints.
      • questions.py: Utilities for loading GPQA items, constructing hints, and formatting prompts.
      • generate_google.py: Script to sweep question/hint/effort configurations with Gemini and log responses.
      • annotate.py: Interactive CLI to label answers and whether the hint is mentioned in the chain-of-thought.
      • analysis.py: Analysis and plotting utilities that compute metrics and generate figures/CSV summaries.
      • print_answer.py: Utility to inspect stored assistant answers and reasoning for a given response or question id.
      • print_question.py: Utility to reconstruct and print the exact GPQA multiple-choice prompt used in a run.
    • utils.py: Shared helpers for appending/writing structured data to JSONL files.
  • paper_results/: Directory containing frozen result artifacts used in the paper.

    • impossible_livecodebench/: Saved ImpossibleBench LiveCodeBench runs at different reasoning efforts.
      • low/: Low-effort (3000-token) evaluation results.
      • medium/: Medium-effort (4000-token) evaluation results.
      • high/: High-effort (5000-token) evaluation results.
    • opendeception/: Saved OpenDeception runs and annotations.
      • conversation_logs.jsonl: Full multi-turn conversation logs for the telecom-fraud scenarios across reasoning efforts.
      • annotations.jsonl: Human labels for topic adherence, timing, deceptive intent, and deception success per scenario.
    • unfaithfullness/: Saved unfaithfulness runs and derived metrics.
      • conversation_logs.jsonl: Model responses and reasoning traces for all GPQA/hint/effort configurations.
      • annotations.jsonl: Human annotations of answers and whether hints were mentioned in chain-of-thought.
      • aggregated_metrics.csv: Per-model/per-effort/per-placement unfaithfulness and accuracy metrics used in plots.
      • length_stats.md: Summary statistics and significance tests for response-length vs unfaithfulness.
      • length_stats_system_prompt.md: Same length analysis restricted to system-prompt hints.

About

Compute as a Safety Control: How Reasoning Budgets Shape Misalignment Behaviors

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages