Twitter-Sentiment-Analysis--BigData

Introduction:

Sentiment analysis is used to analyse text data inorder understand the underlying sentiment (positive or negavtive).
Sentiment analysis uses Natural language processing(NLP) and machine learning to determine emotional intent behind a communication.
Twittershutdown, Riptwitter was trending when Elon Musk took charge and hundreds of twitter employees send in their resignations.
This project will perform Sentiment anlaysis on tweets collected for hastags #Twittershutdown, #Riptwitter, #Elon Musk and so on.

Collect tweets using Twitter Api by launching an AWS EC2 instance, stream the tweets using Kinesis firehose and store the data in AWS S3 bucket.
Create a binary classification model to classify sentiment of each tweet (positive or negative), label= sentiment(0>negative, 1>positive) .
Create a Quicksight dashboard for the data collected and also predictions from the classification model.

AWS, Twitter Api, Amazon Kinesis firehose, Pyspark, Amazon Quicksight, Databricks

399333 tweets were collected using Twitter Api and stored in AWS S3
Using Databricks environment connect to S3 bucket and mount the data by creating a spark session.

Created a pyspark dataframe object twitter data.
Checked for null values and drop rows with Null values.
Converted create_at to datetime column.
Used regular expression to clean the tweet, location columns. .
Textblob which is a library in python for text analysis can be used to assign sentiment for each tweet.
Created a column Sentiment which will have values 0 if a tweet has nagative sentiment and 1 for positive sentiment.
After cleaning we have 135,083 tweets out of which 45,760 tweets were with positive sentiment and 89,323 were tweets with negative sentiment.

Using library Tokenizer convert tweet column to lowercase and split it by white spaces, outputColumn="tokens"
Remove stopwords from tokens using library StopWordsRemover,outputColumn="filtered" .
Convert filtered tweets into matrix of token counts using CountVectorizer library,outputColumn="cv" .
Inverse document frequency (IDF) library will check for relevant words in the tweet and remove sparse words, outputcolumn = "1gram_idf".
Ngram (n=2) library is feature transformer that converts the input array of strings into an array of n-grams, outputcolumn= "2gram".
HashingTF will map a sequence of terms to their term frequencies using the hashing trick, numFeatures=20000,outputcolumn= "2gram_tf".
Again perform IDf to remove sparse terms, outputColumn="2gram_idf"
VectorAssembler will merges "1gram_idf", "2gram_tf" columns into a vector column="rawFeatures"
ChiSqSelector will select categorical features from rawFeatures, outputCol="features" and reduce the number of features to 16000

Data was split into 90% train and 10% test data.
Sentiment column is the label. 0 > negative sentiment, 1> positive sentiment
We tried RandomforestClassifier and Logisticregression models to classify if the tweet in the test data is positive or negative
With RandomForestClassifer we acheived 66% accuracy and 72.87% Roc-Auc score
Classification report for RandomForestClassifer as follows:
LogisticRegression gave us an accuracy score of 90.425 and Roc-Auc score of 92.83
Classification report for LogisticRegression as follows:

66% of the tweets were with negative sentiment
Top 10 location in terms of number of tweet, location as a feature doesnot seem to be a contributor in tweet sentiment as they almost have equal percentage of both negative and positive tweets

from the 8.9K negative tweets, the model was able to correctly predict 8.19K tweets as tweets with negative sentiment.