Skip to content

gkroiz/investigating-model-motives-blog

Repository files navigation

Investigating Model Motives — Blog Companion

This repository contains supplementary details and artifacts for the blog post on investigating model motives. It provides experiment-specific information (including prompt templates, workspace files, output variants, and plots) for each of the four environments in agent-interp-envs.

Structure

Each folder corresponds to one environment:

  • funding_email_details/ — Funding email environment
  • sandbagging_details/ — Sandbagging environment
  • eval_tampering_details/ — Eval tampering environment
  • secret_number_details/ — Secret number environment

Within each folder you'll find a README describing the experiment setup, prompt templates used, workspace files provided to the agent, output variants across runs, and plots summarizing results.

About

Supplementary material for research on investigating model incrimination

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages