USA Presidential Debates 2020 : EDA, Sentiment Analysis and Predictive Modelling

US President Donald Trump and President-Elect Joe Biden had their chance to challenge each other face to face twice during the Presidential Debates. In this project we're trying to analyze and visualize a dataset that contains the transcripts of the debates.

Code and Resources Used :

Python : 3.8.5
Libraries : pandas, seaborn, wordcloud, bs4, re, nltk, sklearn
Data : Kaggle, additional data scraped from Factba.se and Rev

Data Cleaning

Upon initial analysis of the data obtained from Kaggle we can identify a number of data cleaning and pre-processing steps required.

Null values : The 'minute' column of the first debate consists a null value. Upon inspection we find that it represents the start of the second segment and so we replace it with '00:00'.
Inconsistent speaker names : There are inconsistencies in the speaker names such as Chris Wallace being represenred as 'Chris Wallace' and 'Chris Wallace : ', President Trump being represented as 'President Donald J. Trump', 'President Trump' and 'Donald Trump'. All of these are normalised to 'Donald Trump'.
Inconsistent timeframe : The 'minute' column is in string form. To perform analysis we will convert it to seconds spoken by each candidate.

Data Analysis

We will start off with some basic EDA.

We will first find the candidate who spoke for the longest time in one go and what did he speak.

We will now look at the vocabulary size for both candidates.

We will now look at the total time spoken by the candidates and the moderators across both debates.

We can see that it was a neck to neck debate between both the candidates.

We will now look at the total words spoken by the candidates and the moderators across both debates.

Interesting insight - Donald Trump spoke for a total of 30 seconds less than Joe Biden but spoke 1000 more words.

Word Frequency :

Let us look at the most frequent words used by the candidates and the moderators. For better analysis, we've performed some text processing operations such as :

Remove stopwords
Remove punctuations
Remove numbers
Convert words to lowercase
Converting contractions to words

Donald Trump

Joe Biden

Moderators ( Chris Wallace and Kristen Welker )

Bigram Frequency :

A bigram is a pair of consecutive written units such as letters, syllables, or words. We will now look at the most frequent bigrams used by the candidates.

Donald Trump

Joe Biden

WordClouds

Let us create some word clouds!

Donald Trump

Joe Biden

Flow of the debate

Hardly a minute went by during the debates without one of the candidates angrily interrupting the other, whether on the coronavirus pandemic, the Supreme Court, the economy or anything else, including each other’s families.

“Will you shut up, man?” Biden snapped at Trump at one point.

A good way to visualize the number of interruptions was by plotting heatmaps of the flow of the debates.

First Presidential Debate :

Second Presidential Debate :

Vice Presidential Debate :

These heatmaps give us an idea about the high amount of interruptions and cross talking during the first debate. In the second presidential debate, the mute button, or at least the threat of it seemed to work as Donald Trump and Joe Biden were more restrained.

Sentiment Analysis :

We will use the `SentimentIntensityAnalyzer` from `nltk` package to calculate the sentiments of the sentences spoken by both the candidates.

We use the compound score to measure the sentiments which ranges from -1 (Most Negative) to +1 (Most Positive).

After calculating the sentiment (Positive, Neutral, Negative) we can visualize it.

Model Building

Now we will build a machine learning model that will train on the debate transcript data, town hall speeches data and the data extracted from factba.se and rev.com. The model will then predict whether a given quote could be spoken by Donald Trump or Joe Biden.

Data has been scraped using BeautifulSoup. After scraping the blocks of text, they've been split into sentences and given the appropriate 'speaker' label and then converted to a dataframe.

`train_test_split` from `sklearn` is used to split the data into training and test data in the ratio 80:20.

Next we use `TfidfVectorizer` which transforms text to feature vectors that can be used as input to estimator. tf-idf creates a set of its own vocabulary from the entire set of text. Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.

Then we use `LogisticRegression` to train the model. We can compute the `confusion_matrix` to find the accuracy of the model.

The confusion matrix gives us an accuracy of 85.83%.

Let us check the model's predictions for some different Joe Biden quotes.

Let us check the model's predictions for some different Donald Trump quotes.

Usage:

This project is best viewed in a notebook viewer, which can be accessed here:

EDA, Word Frequency plots, Bigram Frequency plots, Word clouds, Sentiment Analysis - here
Debate Flow Heatmaps - here
Web Scraper - here
ML Model - here

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
docs		docs
images		images
README.md		README.md
_config.yml		_config.yml
heatmaps.ipynb		heatmaps.ipynb
main.ipynb		main.ipynb
model.ipynb		model.ipynb
requirements.txt		requirements.txt
scraper.ipynb		scraper.ipynb

ritik-k/USA_election_debate

Folders and files

Latest commit

History

Repository files navigation

USA Presidential Debates 2020 : EDA, Sentiment Analysis and Predictive Modelling

US President Donald Trump and President-Elect Joe Biden had their chance to challenge each other face to face twice during the Presidential Debates. In this project we're trying to analyze and visualize a dataset that contains the transcripts of the debates.

Code and Resources Used :

Data Cleaning

Upon initial analysis of the data obtained from Kaggle we can identify a number of data cleaning and pre-processing steps required.

Null values : The 'minute' column of the first debate consists a null value. Upon inspection we find that it represents the start of the second segment and so we replace it with '00:00'.

Inconsistent timeframe : The 'minute' column is in string form. To perform analysis we will convert it to seconds spoken by each candidate.

Data Analysis

We will start off with some basic EDA.

We will first find the candidate who spoke for the longest time in one go and what did he speak.

We will now look at the vocabulary size for both candidates.

We will now look at the total time spoken by the candidates and the moderators across both debates.

We can see that it was a neck to neck debate between both the candidates.

We will now look at the total words spoken by the candidates and the moderators across both debates.

Interesting insight - Donald Trump spoke for a total of 30 seconds less than Joe Biden but spoke 1000 more words.

Word Frequency :

Let us look at the most frequent words used by the candidates and the moderators. For better analysis, we've performed some text processing operations such as :

Remove stopwords

Remove punctuations

Remove numbers

Convert words to lowercase

Converting contractions to words

Donald Trump

Joe Biden

Moderators ( Chris Wallace and Kristen Welker )

Bigram Frequency :

A bigram is a pair of consecutive written units such as letters, syllables, or words. We will now look at the most frequent bigrams used by the candidates.

Donald Trump

Joe Biden

WordClouds

Let us create some word clouds!

Donald Trump

Joe Biden

Flow of the debate

Hardly a minute went by during the debates without one of the candidates angrily interrupting the other, whether on the coronavirus pandemic, the Supreme Court, the economy or anything else, including each other’s families.

“Will you shut up, man?” Biden snapped at Trump at one point.

A good way to visualize the number of interruptions was by plotting heatmaps of the flow of the debates.

First Presidential Debate :

Second Presidential Debate :

Vice Presidential Debate :

These heatmaps give us an idea about the high amount of interruptions and cross talking during the first debate. In the second presidential debate, the mute button, or at least the threat of it seemed to work as Donald Trump and Joe Biden were more restrained.

Sentiment Analysis :

We will use the SentimentIntensityAnalyzer from nltk package to calculate the sentiments of the sentences spoken by both the candidates.

We use the compound score to measure the sentiments which ranges from -1 (Most Negative) to +1 (Most Positive).

After calculating the sentiment (Positive, Neutral, Negative) we can visualize it.

Model Building

Now we will build a machine learning model that will train on the debate transcript data, town hall speeches data and the data extracted from factba.se and rev.com. The model will then predict whether a given quote could be spoken by Donald Trump or Joe Biden.

Data has been scraped using BeautifulSoup. After scraping the blocks of text, they've been split into sentences and given the appropriate 'speaker' label and then converted to a dataframe.

train_test_split from sklearn is used to split the data into training and test data in the ratio 80:20.

Then we use LogisticRegression to train the model. We can compute the confusion_matrix to find the accuracy of the model.

The confusion matrix gives us an accuracy of 85.83%.

Let us check the model's predictions for some different Joe Biden quotes.

Let us check the model's predictions for some different Donald Trump quotes.

Usage:

This project is best viewed in a notebook viewer, which can be accessed here:

EDA, Word Frequency plots, Bigram Frequency plots, Word clouds, Sentiment Analysis - here

Debate Flow Heatmaps - here

Web Scraper - here

ML Model - here

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

We will use the `SentimentIntensityAnalyzer` from `nltk` package to calculate the sentiments of the sentences spoken by both the candidates.

`train_test_split` from `sklearn` is used to split the data into training and test data in the ratio 80:20.

Then we use `LogisticRegression` to train the model. We can compute the `confusion_matrix` to find the accuracy of the model.

Packages