This project uses AI models to study and detect bias and discrimination in anti-trans discourse. It curates datasets from examples of transphobia, which currently includes anti-trans legislation in the US that limits transgender rights relating to healthcare, access to bathrooms, sports and more, at the federal and state level. The goal is to use language about sex, gender, sexuality, and related terms from the legislation to train text generation and text classification Large Language Models (llms).
The federal legislation dataset originates from the www.congress.gov
website, and includes bills, amendments, and resolutions from the
House of Representatives and and the Senate over sessions 117
(2022-2023) and 118 (2023-2024) that contain the keyword
"transgender" from the 117th and 118th
congressional sessions (2021-2024).
Another dataset is being developed from federal bills that specifically focus on the current anti-trans movement, containing
targeted [anti-trans federal bills from 2023-2024]
(https://github.com/gofilipa/anti-trans-legislation/blob/main/processing/bill_data/transtracker_federal_bills.csv).
The state bills dataset originates from Erin Reed's "LGBTQ+ Legislative Tracking 2023" document, which gathers legislation that are explicitly anti-trans.
All the code for data gathering and processing is available in this repository.
To gather the federal bill data, I scraped the bill
text from congress.gov servers and from the
trans legislation tracker list notebooks).
The processing notebook contains a matcher that extracts definitions of gender and related terms (like "sexuality," "biological sex", etc). You can see the final dataset on my Huggingface datasets page.
I am currently in the process of using the data to train models which you can see on my HuggingFace profile page, gofilipa.