You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
399333 tweets were collected using Twitter Api and stored in AWS S3
Using Databricks environment connect to S3 bucket and mount the data by creating a spark session.
Data preprocessing:
Created a pyspark dataframe object twitter data.
Checked for null values and drop rows with Null values.
Converted create_at to datetime column.
Used regular expression to clean the tweet, location columns.
.
Textblob which is a library in python for text analysis can be used to assign sentiment for each tweet.
Created a column Sentiment which will have values 0 if a tweet has nagative sentiment and 1 for positive sentiment.
After cleaning we have 135,083 tweets out of which 45,760 tweets were with positive sentiment and 89,323 were tweets with negative sentiment.
Model:
Feature Engineering:
Using library Tokenizer convert tweet column to lowercase and split it by white spaces, outputColumn="tokens"
Remove stopwords from tokens using library StopWordsRemover,outputColumn="filtered" .
Convert filtered tweets into matrix of token counts using CountVectorizer library,outputColumn="cv" .
Inverse document frequency (IDF) library will check for relevant words in the tweet and remove sparse words, outputcolumn = "1gram_idf".
Ngram (n=2) library is feature transformer that converts the input array of strings into an array of n-grams, outputcolumn= "2gram".
HashingTF will map a sequence of terms to their term frequencies using the hashing trick, numFeatures=20000,outputcolumn= "2gram_tf".
Again perform IDf to remove sparse terms, outputColumn="2gram_idf"
VectorAssembler will merges "1gram_idf", "2gram_tf" columns into a vector column="rawFeatures"
ChiSqSelector will select categorical features from rawFeatures, outputCol="features" and reduce the number of features to 16000
Model Development and Evaluation:
Data was split into 90% train and 10% test data.
Sentiment column is the label. 0 > negative sentiment, 1> positive sentiment
We tried RandomforestClassifier and Logisticregression models to classify if the tweet in the test data is positive or negative
With RandomForestClassifer we acheived 66% accuracy and 72.87% Roc-Auc score
Classification report for RandomForestClassifer as follows:
LogisticRegression gave us an accuracy score of 90.425 and Roc-Auc score of 92.83
Classification report for LogisticRegression as follows:
LogisticRegresssion model gave us better accuracy, the predictions are saved back to AWS S3 bucket
QuickSight Dashboard
Tweets post data preprocessing:
66% of the tweets were with negative sentiment
Top 10 location in terms of number of tweet, location as a feature doesnot seem to be a contributor in tweet sentiment as they almost have equal percentage of both negative and positive tweets
Predictions:
from the 8.9K negative tweets, the model was able to correctly predict 8.19K tweets as tweets with negative sentiment.