Skip to content

babli18/comment-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Currently in Progress:
Data - The Dataset contains Googleplay comments of users using Netflix App.
Timeline of Data - 2018 - 2025
Source- Kaggle

data_preparation.ipynb - Contains code to clean data including features, missing rows, language classification. I tried translating the language to english using opensource models but it was not upto the accuracy needed and hence change it to classifying as English and Not English.

clustering.ipynb - Contains code for general text preprocessing like lowercase, removal of stop words, emojis etc. I have used S-Bert for embedding (in order to capture and preserve the context of the sentence). Used UMAP to reduce the dimension of the embeddings to get better clusters and used HDBSCAN for clustering because it could handle various densities. I played around with clustering with and without reducing the dimensions. This is currently on hold and I will revisit again as the flow of my logic has changed.

classify_training.ipynb - Currently in progress. Intend to train a multiclass classifier.

Releases

No releases published

Packages

No packages published