Skip to content

yuvrajvirk/MechInterpRace

Repository files navigation

MechInterpRace

My project identifies specific components in the 700M parameter language model gpt2-large that are causally linked to the generation of select facts involving race. For example, I observed that a layer 18 MLP in gpt2-large strongly determines the generation of facts associated with "Indian." It then makes a basic attempt at linking these identified components to the decision-making process the language model takes when faced with a prompt involving race. Results produced by activation patching:

MechInterpRaceGraphs1 (3) MechInterpRaceGraphs2

Clean prompt: "The Indian live in the city of" | Generated: Mumbai newplot (11)

Clean prompt: "The Indian have an annual festival called" | Generated: "Diwali" newplot (14)

Clean prompt: "The Chinese have an annual festival called" | Generated: "the Spring Festival" newplot (21)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published