GitHub - babli18/comment-classifier: Code to classify comments

Currently in Progress:
Data - The Dataset contains Googleplay comments of users using Netflix App.
Timeline of Data - 2018 - 2025
Source- Kaggle

data_preparation.ipynb - Contains code to clean data including features, missing rows, language classification. I tried translating the language to english using opensource models but it was not upto the accuracy needed and hence change it to classifying as English and Not English.

clustering.ipynb - Contains code for general text preprocessing like lowercase, removal of stop words, emojis etc. I have used S-Bert for embedding (in order to capture and preserve the context of the sentence). Used UMAP to reduce the dimension of the embeddings to get better clusters and used HDBSCAN for clustering because it could handle various densities. I played around with clustering with and without reducing the dimensions. This is currently on hold and I will revisit again as the flow of my logic has changed.

classify_training.ipynb - Currently in progress. Intend to train a multiclass classifier.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.gitignore		.gitignore
README.md		README.md
classify_training.ipynb		classify_training.ipynb
clustering.ipynb		clustering.ipynb
data_preparation.ipynb		data_preparation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

babli18/comment-classifier

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages