My project identifies specific components in the 700M parameter language model gpt2-large that are causally linked to the generation of select facts involving race. For example, I observed that a layer 18 MLP in gpt2-large strongly determines the generation of facts associated with "Indian." It then makes a basic attempt at linking these identified components to the decision-making process the language model takes when faced with a prompt involving race. Results produced by activation patching:
Clean prompt: "The Indian live in the city of" | Generated: Mumbai
Clean prompt: "The Indian have an annual festival called" | Generated: "Diwali"
Clean prompt: "The Chinese have an annual festival called" | Generated: "the Spring Festival"