GitHub - aubs7/kpop-sexism: Generated a dataset, built a hybrid model, and deployed a browser extension for K-Pop Sexism Detection

K-Pop Sexism Detection

A repository created for an undergraduate thesis "A Semi-Supervised Approach for Sexism Detection in K-Pop Posts"

It contains all the resources used to develop a hybrid model for detecting sexism in K-pop-related posts. The study utilized a semi-supervised learning approach (pseudolabeling) with stacked embeddings (Glove + FastText), CNN, and Attention mechanism. Overall, the final hybrid model acheived 91.84% accuracy, 90.10% precision, 94.79% recall, 92.38% F1-score, and 96.96% ROC-AUC. A web extension powered by Flask was developed for real-time sexism detection on K-pop posts on Reddit.

⋆｡°✩

This repo primarily features an unlabeled K-pop-related tweets dataset for generating pseudolabels. It consists of 11,211 English tweets from popular K-pop scandals and sexist keywords, of which 1,040 came from the Garam bullying scandal generated Sainez and Wu in 2022, and the 10,171 were manually scraped by Aubrey Min Lasala & Britney Beligan from January to April 2025.

⋆｡°✩

The structure of this repo is as follows:

kpop-sexism/
│
├── 📁 datasets/                                       # All datasets used in the project
│   ├── 📁 for-training/                               # Cleaned datasets for model training
│   │   ├── 📄 train.csv                               # 5,644 rows - cleaned training set
│   │   ├── 📄 test.csv                                # 2,208 - cleaned test set
│   │   └── 📄 unlabeled.csv                           # 10,782 rows – cleaned unlabeled data
│   │
│   └── 📁 unlabeled/                                  # Raw scraped dataset
│       └── 📄 final-scrape.csv                        # 11,211 rows – raw data
│
├── 📁 model/                                          # Model files and training notebooks
│   ├── 📄 baseline_model2.h5                          # Baseline trained model
│   ├── 📄 kpop-sexism-model2.h5                       # Final trained model
│   ├── 📄 tokenizer.pickle                            # Tokenizer for preprocessing
│   └── 📁 src/                                        # Training notebooks and scripts
│       ├── 📄 Baseline_Model_Training.ipynb
│       └── 📄 SSL_Training.ipynb
│
├── 📁 web-extension/                                  # Flask app and browser extension
│   ├── 📄 app.py                                      # Flask backend for real-time detection
│   ├── 📄 preprocessing.py                            # Preprocessing scripts for web input
│   └── 📁 extension/                                  # Browser extension (upload on Google Extensions)
│       ├── 📄 background.js
│       ├── 📄 index.html
│       ├── 📄 manifest.json
│       ├── 📄 script.js
│       └── 📄 style.css
│
├── 📄 system-reqs.txt               
├── 📄 README.md                     
└── 📄 Study_in_K-pop_Sexism_Detection.pdf

⋆｡°✩

Notes:

The web extension is designed to work only on Reddit threads.
train.csv and test.csv contain English labeled data from the EXIST 2021 dataset and manually annotated K-pop tweets, while unlabeled.csv contains unlabeled K-pop tweets only. They have been cleaned using preprocessing.py.
Although preprocessing and data wrangling techniques were applied to clean the for-training/ datasets, some rows may still be messy or uncleaned.

This repository and its datasets are intended for academic and educational purposes only!

Do not use the data, model, or outputs for any harmful, discriminatory, or unauthorized purposes.

Happy coding <3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
datasets		datasets
model		model
web-extension		web-extension
.gitattributes		.gitattributes
README.md		README.md
Study in K-pop Sexism Detection.pdf		Study in K-pop Sexism Detection.pdf
system-reqs.txt		system-reqs.txt

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages