-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathCase-Study-3-WriteUp.rmd
88 lines (59 loc) · 5.41 KB
/
Case-Study-3-WriteUp.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
title: "Case Study 3 - A Study of Demographic Differences against Party Preference"
author: "Elliot Pickens & Dean Gladish"
date: "May 22, 2018"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
\ \ \ \ \ \ \ Political party preference is typically thought to be associated with the demographics and geography of a populace. It is of interest to politicians, political scientists and the media alike to determine the extent of such correlation in order to understand which groups are most likely to vote for the party. Our case study, which uses data collected from U.S. adults from the 1980 and 2000 elections respectively as part of the National Election Studies project, is an investigation into the matter that allows us to model party preference using the logistic regression model. Specifically, we aim to address whether gender, regional, income, race, age, level of education, and union differences play a part in party preference over time.
## Data
\ \ \ \ \ \ \ The dataset that we analyzed consists of a binary indicator variable indicating Democratic Party preference as well as numerous other categorical variables corresponding to factors such as year, age, gender, race, region, income, unionized and educational status. The explanatory variables that we focus on are *gender*, *race*, *income*, *age* and *union*.
The following table gives our estimates of the important aspects (coefficients, etc.) of our model:
```{r, message = F, warning = F, echo = F}
library(pander)
m <- matrix(c("intercept", 1.648114, 0.221279, 7.448, 9.47e-14, "raceother",
-1.600862, 0.227855, -7.026, 2.13e-12, "racewhite", -1.852649, 0.182748, -10.138, 2e-16, "unionyes", 0.692776, 0.116536, 5.945, 2.77e-09, "incomemiddle 1/3", -0.265841, 0.113745, -2.337, 0.01943, "incomeupper 1/3", -0.491922, 0.114198, -4.308, 1.65e-05, "age", 0.007781, 0.002699, 2.883, 0.00395, "gendermale", -0.258311, 0.090486, -2.855, 0.00431), ncol = 5, byrow = T, nrow = 8)
colnames(m) <- c(" ", "Estimate", "Standard error", "z value", "P-value")
pander(m, caption = "Important Coefficients of our Logistic Regression Model")
```
The following set of plots represent what is essentially the relationship between our binary model and the data for the years 1980 and 2000.
```{r, message = F, warning = F, echo = F, fig.height = 3, fig.width = 6}
library(Sleuth3)
library(dplyr)
library(ggformula)
library(pander)
library(knitr)
library(stargazer)
library(car)
library(pander)
library(gridExtra)
library(broom)
library(ggthemes)
library(MASS)
library(leaps)
library(GGally)
nes <- read.csv("http://aloy.rbind.io/data/NES.csv")
library(ggformula)
library(car)
glm.basic <- glm(dem ~ 1, data = nes, family = binomial)
stp.inter.fwd <- stepAIC(glm.basic, scope = list(lower = ~1, upper = ~ year + region + union + income + educ + gender + race + age + I(age^2) +
age * year + age * region + age * union + age * income + age * educ + age * gender + age * race),
direction = "both", k = log(nrow(nes)), trace = 0)
par(mfrow=c(1, 4))
p2 <- crPlot(stp.inter.fwd, variable = "union")
p1 <- crPlot(stp.inter.fwd, variable = "gender")
p1 <- crPlot(stp.inter.fwd, variable = "race")
p1 <- crPlot(stp.inter.fwd, variable = "age")
p1 <- crPlot(stp.inter.fwd, variable = "income")
```
\ \ \ \ \ \ \ Through exploratory data analysis of significance and association, we found that interaction variables did not give us a closer fit to the data.
## Results:
\ \ \ \ \ \ \ Using the BIC criterion we obtained the following model:
\begin{center} $\widehat Y_i \{dem\} = \beta_0 + \beta_1 raceother + \beta_2 racewhite + \beta_3 unionyes + \beta_4 incomemiddle \ 1/3 + \beta_5 incomeupper \ 1/3 + \beta_6 age + \beta_7 gendermale$ \end{center}
The model shown above can be interpreted as follows:
## Discussion:
\ \ \ \ \ \ \ Out of the eight features that were contained in the original datset we ended up using only five to predict party status in our model, and this model does not include any interaction, or otherwise transformed variables. While, the simplicity of the model may suggest robust-ness it may may be an over simplification of the the interaction that we are ultimately trying to model. That being said (and Occam's Razor suggests) that this simplification may infact be a good thing, given that we are trying to create a model that works under a very broad range of circumstances. This may leave our model vunerable to mis-classifying very specific groups of people, but if we wish to provide an accurate model for each and every group of people we may need the help of some political scientists that have understanding of those specific groups. Overall our model seems to pedict party affiliation fairly well without completely violating its underlying assumptions or completely overfitting itself to the data.
\ \ \ \ \ \ \ In the future, however we would suggest that another sample of the population be taken. Given that each of the data points we used in this analysis were collected in either 1980 or 2000 there may be some underlying time related pattern that is subtly influencing our data. Economies, societies, and political parties all undergo transformations over time, and we may be picking up on some of those changes in our model. Alternatively, it is possible that changes have occured since 1980 & 2000 that may render this model less effective.