Classify tweets with #disaster hashtag into real disaster vs irrelevant tweets. The data set contains over 10,000 tweets, where roughly half are tweets about real disasters and half are irrelevant tweet.
Methods include:
- Logistic regression with bag of words embedding
- Logistic regression with word2vec embedding, incorporating the semantic meaning of each word
- Decision tree
- Random forest
- Convolutional Neural Network, incorporating the text structure
Inspired by the blog https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e