MechInterpRace

My project identifies specific components in the 700M parameter language model gpt2-large that are causally linked to the generation of select facts involving race. For example, I observed that a layer 18 MLP in gpt2-large strongly determines the generation of facts associated with "Indian." It then makes a basic attempt at linking these identified components to the decision-making process the language model takes when faced with a prompt involving race. Results produced by activation patching:

Clean prompt: "The Indian live in the city of" | Generated: Mumbai

Clean prompt: "The Indian have an annual festival called" | Generated: "Diwali"

Clean prompt: "The Chinese have an annual festival called" | Generated: "the Spring Festival"

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
MechInterpRaceDev.ipynb		MechInterpRaceDev.ipynb
MechInterpRaceGraphs1 (3).png		MechInterpRaceGraphs1 (3).png
MechInterpRaceGraphs2.png		MechInterpRaceGraphs2.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MechInterpRace

About

Uh oh!

Releases

Packages

Uh oh!

Languages

yuvrajvirk/MechInterpRace

Folders and files

Latest commit

History

Repository files navigation

MechInterpRace

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages