This repository contains the Jupyter Notebook detailing the data analysis, preprocessing, and model training process for the Cyberbullying Detection project.
A live, deployed version of this project is running on Hugging Face Spaces.
This repository documents the steps taken to build the machine learning model that powers the final web application. The primary focus is the development process, from raw data to a trained and evaluated classifier.
The main file in this repository is Cyberbullying.ipynb. This notebook includes:
- Data Loading and Cleaning: Importing the dataset of over 40,000 comments and performing initial preprocessing.
- Exploratory Data Analysis (EDA): Visualizing the distribution of the data.
- Text Preprocessing: Detailed steps for cleaning the text, including tokenization, stop word removal, and lemmatization using NLTK.
- Feature Extraction: Using the TF-IDF (Term Frequency-Inverse Document Frequency) method to convert text into numerical features.
- Model Training and Evaluation: Training and comparing multiple classifiers to select the best one based on performance metrics.
The final model chosen was the Stochastic Gradient Descent (SGD) Classifier, which achieved an accuracy of 87% on the test set.
The clean, deployed code for the live Gradio application can be found in the cyberbullying-app repository.