Skip to content

US President Donald Trump and President-Elect Joe Biden had their chance to challenge each other face to face twice during the Presidential Debates. In this project we're trying to analyze and visualize a dataset that contains the transcripts of the debates.

Notifications You must be signed in to change notification settings

ritik-k/USA_election_debate

Repository files navigation

USA Presidential Debates 2020 : EDA, Sentiment Analysis and Predictive Modelling

US President Donald Trump and President-Elect Joe Biden had their chance to challenge each other face to face twice during the Presidential Debates. In this project we're trying to analyze and visualize a dataset that contains the transcripts of the debates.

Code and Resources Used :

  • Python : 3.8.5
  • Libraries : pandas, seaborn, wordcloud, bs4, re, nltk, sklearn
  • Data : Kaggle, additional data scraped from Factba.se and Rev

Data Cleaning

Upon initial analysis of the data obtained from Kaggle we can identify a number of data cleaning and pre-processing steps required.

  • Null values : The 'minute' column of the first debate consists a null value. Upon inspection we find that it represents the start of the second segment and so we replace it with '00:00'.

  • Inconsistent speaker names : There are inconsistencies in the speaker names such as Chris Wallace being represenred as 'Chris Wallace' and 'Chris Wallace : ', President Trump being represented as 'President Donald J. Trump', 'President Trump' and 'Donald Trump'. All of these are normalised to 'Donald Trump'.

  • Inconsistent timeframe : The 'minute' column is in string form. To perform analysis we will convert it to seconds spoken by each candidate.

Data Analysis

We will start off with some basic EDA.

We will first find the candidate who spoke for the longest time in one go and what did he speak.

We will now look at the vocabulary size for both candidates.

We will now look at the total time spoken by the candidates and the moderators across both debates.

We can see that it was a neck to neck debate between both the candidates.

We will now look at the total words spoken by the candidates and the moderators across both debates.

Interesting insight - Donald Trump spoke for a total of 30 seconds less than Joe Biden but spoke 1000 more words.

Word Frequency :

Let us look at the most frequent words used by the candidates and the moderators. For better analysis, we've performed some text processing operations such as :

  • Remove stopwords

  • Remove punctuations

  • Remove numbers

  • Convert words to lowercase

  • Converting contractions to words

Donald Trump

Joe Biden

Moderators ( Chris Wallace and Kristen Welker )

Bigram Frequency :

A bigram is a pair of consecutive written units such as letters, syllables, or words. We will now look at the most frequent bigrams used by the candidates.

Donald Trump

Joe Biden

WordClouds

Let us create some word clouds!

Donald Trump

Joe Biden

Flow of the debate

Hardly a minute went by during the debates without one of the candidates angrily interrupting the other, whether on the coronavirus pandemic, the Supreme Court, the economy or anything else, including each other’s families.

“Will you shut up, man?” Biden snapped at Trump at one point.

A good way to visualize the number of interruptions was by plotting heatmaps of the flow of the debates.

First Presidential Debate :

Second Presidential Debate :

Vice Presidential Debate :

These heatmaps give us an idea about the high amount of interruptions and cross talking during the first debate. In the second presidential debate, the mute button, or at least the threat of it seemed to work as Donald Trump and Joe Biden were more restrained.

Sentiment Analysis :

We will use the SentimentIntensityAnalyzer from nltk package to calculate the sentiments of the sentences spoken by both the candidates.

We use the compound score to measure the sentiments which ranges from -1 (Most Negative) to +1 (Most Positive).

After calculating the sentiment (Positive, Neutral, Negative) we can visualize it.

Model Building

Now we will build a machine learning model that will train on the debate transcript data, town hall speeches data and the data extracted from factba.se and rev.com. The model will then predict whether a given quote could be spoken by Donald Trump or Joe Biden.

Data has been scraped using BeautifulSoup. After scraping the blocks of text, they've been split into sentences and given the appropriate 'speaker' label and then converted to a dataframe.

train_test_split from sklearn is used to split the data into training and test data in the ratio 80:20.

Next we use TfidfVectorizer which transforms text to feature vectors that can be used as input to estimator. tf-idf creates a set of its own vocabulary from the entire set of text. Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.

Then we use LogisticRegression to train the model. We can compute the confusion_matrix to find the accuracy of the model.

The confusion matrix gives us an accuracy of 85.83%.

Let us check the model's predictions for some different Joe Biden quotes.

Let us check the model's predictions for some different Donald Trump quotes.


Usage:

This project is best viewed in a notebook viewer, which can be accessed here:

  • EDA, Word Frequency plots, Bigram Frequency plots, Word clouds, Sentiment Analysis - here

  • Debate Flow Heatmaps - here

  • Web Scraper - here

  • ML Model - here

About

US President Donald Trump and President-Elect Joe Biden had their chance to challenge each other face to face twice during the Presidential Debates. In this project we're trying to analyze and visualize a dataset that contains the transcripts of the debates.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published